Welcome to Plain Data: IMPLEMENTING ENSEMBLE METHODS WITH PYTHON

Hello, welcome to my blog. In my previous post I introduced the concept of ensemble classifiers. I also talked about their operation and two popular ensemble methods – Boosting & Random Forests.

In this post I want to demonstrate how to implement the two ensemble methods mentioned above using the GraphLab library in Python. I will use the same dataset – LendingClub dataset so we can compare the performance of the single tree model to the ensemble model.

The dataset is in SFrame format so I will use the ‘SFrame’ method from GraphLab to load the data. Feel free to use the stand alone ‘sframe’ library to load the data. In this blog post I will just display screenshots of the results of the program.

PRELIMINARY STEPS

I undersampled the majority class because the dataset had a disproportionally large amount of safe loans. GraphLab does not require the data to have numerical values so there was no need to convert the categorical features to binary features via one-hot encoding. Finally, I split the data into training (80%) and test (20%) sets and used a seed to ensure reproducibility.

TRAINING THE ENSEMBLE MODEL – BOOSTING

To create a boosted decision tree model I used the ‘boosted_trees_classifier.create’ method from GraphLab specifying ‘max_iterations’ = 10 and ‘max_depth’ = 6. The ‘max_iterations’ parameter specifies how many weak learners (i.e. decision trees in this case) you want to generate. Since I set it to 10 it means I generated 10 weak learners from the data. The ‘max_depth’ parameters control the depth of each decision tree generated for each iteration. Therefore, each tree will have a maximum depth of 6.

EVALUATING THE ENSEMBLE MODEL – BOOSTING

I evaluated the boosted decision tree model on test data using the ‘evaluate’ method. The model had an accuracy of 67% which is a fairly significant improvement over a single tree decision model that had an accuracy of 63%.

I decided to try different iteration values and see the effect it had on accuracy. I generated 30 boosted tree models; the first, second, third up to the thirtieth model had 1, 2, 3, …, 30 weak learners respectively and plotted their accuracy values on the test data. The resulting plot is shown below

From the plot you can see that accuracy generally increases with the number of iterations and reaches the maximum accuracy of 68.7% at 28 iterations.

TRAINING THE ENSEMBLE MODEL – RANDOM FORESTS

To create a random forest model I used the ‘random_forest_classifier.create’ method from GraphLab specifying ‘max_iterations’ = 10 and ‘max_depth’ = 6. ‘max_iterations’ means we want 10 decision trees and ‘max_depth’ means we want each tree be have a maximum depth of 6.

EVALUATING THE ENSEMBLE MODEL – RANDOM FOREST

I evaluated the boosted decision tree model on test data using the ‘evaluate’ method. The model had an accuracy of 65% which is a marginal improvement over a single tree decision model that had an accuracy of 63%.

I decided to try different iteration values and see the effect it had on accuracy. I generated 30 random forest models; the first, second, third up to the thirtieth random forest model had 1, 2, 3, …, 30 decision trees respectively and plotted their accuracy values on the test data. The resulting plot is shown below

As you can see again accuracy of random forests also increases with the number of iterations but there are more dips compared to the boosted decision tree model. The maximum accuracy was 66.4% at 22 iterations.

CONCLUSION

We can conclude that the boosted decision tree model performed better than the random forest model for the set of parameters we used. We can also see that both ensemble methods performed better than the single decision tree model on the test data. This shows that ensemble methods are indeed capable of generating strong classifiers by combining several weak classifiers.

SUMMARY

In this post I demonstrated how we can implement two ensemble methods – Boosting and Random Forests using GraphLab. Thank you reading this post. As always if you have any questions please comment about it and I will do my best to answer you. Subscribe to this blog in case you have not. Here’s wishing you a wonderful week ahead. Cheers!!!

Welcome to Plain Data

Sunday, 12 June 2016

IMPLEMENTING ENSEMBLE METHODS WITH PYTHON

No comments:

Post a Comment

Search This Blog

Blog Archive

About Me

Popular Posts

Translate