Hello, welcome to my blog. In my previous post I introduced
the concept of ensemble classifiers. I also talked about their operation and
two popular ensemble methods – Boosting & Random Forests.

In this post I want to demonstrate how to implement the two
ensemble methods mentioned above using the GraphLab library in Python. I will
use the same dataset – LendingClub dataset so we can compare the performance of
the single tree model to the ensemble model.

The dataset is in SFrame format so I will use the ‘SFrame’
method from GraphLab to load the data. Feel free to use the stand alone
‘sframe’ library to load the data. In this blog post I will just display
screenshots of the results of the program.

PRELIMINARY STEPS

I undersampled the majority class because the dataset had a
disproportionally large amount of safe loans. GraphLab does not require the
data to have numerical values so there was no need to convert the categorical
features to binary features via one-hot encoding. Finally, I split the data
into training (80%) and test (20%) sets and used a seed to ensure
reproducibility.

TRAINING THE ENSEMBLE MODEL – BOOSTING

To create a boosted decision tree model I used the
‘boosted_trees_classifier.create’ method from GraphLab specifying
‘max_iterations’ = 10 and ‘max_depth’ = 6. The ‘max_iterations’ parameter specifies
how many weak learners (i.e. decision trees in this case) you want to generate.
Since I set it to 10 it means I generated 10 weak learners from the data. The
‘max_depth’ parameters control the depth of each decision tree generated for
each iteration. Therefore, each tree will have a maximum depth of 6.

EVALUATING THE ENSEMBLE MODEL – BOOSTING

I evaluated the boosted decision tree model on test data
using the ‘evaluate’ method. The model had an accuracy of 67% which is a fairly
significant improvement over a single tree decision model that had an accuracy
of 63%.

I decided to try different iteration values and see the
effect it had on accuracy. I generated 30 boosted tree models; the first,
second, third up to the thirtieth model had 1, 2, 3, …, 30 weak learners
respectively and plotted their accuracy values on the test data. The resulting
plot is shown below

From the plot you can see that accuracy generally increases
with the number of iterations and reaches the maximum accuracy of 68.7% at 28
iterations.

TRAINING THE ENSEMBLE MODEL – RANDOM FORESTS

To create a random forest model I used the
‘random_forest_classifier.create’ method from GraphLab specifying
‘max_iterations’ = 10 and ‘max_depth’ = 6. ‘max_iterations’ means we want 10
decision trees and ‘max_depth’ means we want each tree be have a maximum depth
of 6.

EVALUATING THE ENSEMBLE MODEL – RANDOM FOREST

I evaluated the boosted decision tree model on test data
using the ‘evaluate’ method. The model had an accuracy of 65% which is a
marginal improvement over a single tree decision model that had an accuracy of
63%.

I decided to try different iteration values and see the
effect it had on accuracy. I generated 30 random forest models; the first,
second, third up to the thirtieth random forest model had 1, 2, 3, …, 30 decision
trees respectively and plotted their accuracy values on the test data. The
resulting plot is shown below

As you can see again accuracy of random forests also
increases with the number of iterations but there are more dips compared to the
boosted decision tree model. The maximum accuracy was 66.4% at 22 iterations.

CONCLUSION

We can conclude that the boosted decision tree model
performed better than the random forest model for the set of parameters we
used. We can also see that both ensemble methods performed better than the
single decision tree model on the test data. This shows that ensemble methods
are indeed capable of generating strong classifiers by combining several weak
classifiers.

SUMMARY

In this post I demonstrated how we can implement two ensemble
methods – Boosting and Random Forests using GraphLab. Thank you reading this
post. As always if you have any questions please comment about it and I will do
my best to answer you. Subscribe to this blog in case you have not. Here’s
wishing you a wonderful week ahead. Cheers!!!