Hello, welcome to my blog. Recently, I have been talking about two algorithms for classification namely logistic regression & decision trees. I also demonstrated how we can implement these algorithms using Python’s scikit-learn library.
Today, I want to talk about Ensemble classifiers. The fundamental idea behind ensemble classifiers is combining a set of classifiers to make one better classifier. Concretely, an ensemble classifier combines two or more classifiers (also called a weak learner or classifier) in order to make a stronger classifier (also called a strong learner or classifier).
OPERATION OF AN ENSEMBLE CLASSIFIER
The operation of an ensemble classifier can be summarized in the following steps:
- Given input data, learn T weak classifiers from the data.
- Aggregate the prediction from each weak classifier in order to make a prediction for a test data-point.
Let me explain this with an example. Remember from the previous post, I built a model to predict safe and unsafe loans using a single decision tree. The ensemble classifier approach is to learn, say, 10 decision trees from the dataset. Each of these trees is called a weak learner because individually they may not have very good accuracy. We then combine these trees in order to obtain a single strong classifier which will have good accuracy.
How do we combine a set of weak classifiers in order to make a strong classifier? One way to do this is to assign a weight to the prediction of each weak classifier based on how well we trust the predictions of that classifier (i.e. good classifiers will have high weights while bad classifiers will have low weights). Finally, we predict by the sign of the weighted sum of predictions of all weak classifiers i.e. if the weighted sum is positive we predict the positive class else we predict the negative class. Another way to do this is to make predictions based on the majority vote of predictions of the weak classifiers.
In this section I will talk about two popular ensemble methods – Boosting (specifically AdaBoost) and Random Forests.
AdaBoost or adaptive boosting is an algorithm that was introduced by Freund & Schapire in 1999. It generates weak learners by iteratively focusing on the hard-to-classify examples in the training data. An overview of its operation is given in the following paragraph.
We start by assigning the same weight to all points in the data. Next, we learn a classifier from these points. If a data-point is correctly classified its weight is reduced, if it is incorrectly classified its weight is increased. This means that correctly classified data-points will ‘disappear’ after some iterations leaving the misclassified examples from the previous iteration. Once we recompute the weights for all data-points we learn another classifier on the remaining data-points (i.e. the hard-to-classify examples). This procedure is repeated T times where T is the number of weak learners we want to generate. So if we want 10 weak learners we repeat the procedure 10 times. A summary of this algorithm is given below:
- Start with same weights for all data-points.
- Learn a classifier ft from the weighted data-points.
- For t = 1, ... , T
- Compute coefficient ŵt for the classifier based on the weighted classification error of the classifier ft.
- Recompute weights for the data-points.
- Normalize weights (this is necessary in order to avoid numerical instability).
The equation above says our prediction, ŷ, is the sign of the weighted sum of predictions for all the weak classifiers.
Although AdaBoost can be applied to nearly any kind of model, it most commonly used with decision trees. A very important advantage of boosting is that it is robust to over-fitting i.e. test-set performance is not very sensitive to the choice of T.
Impact of Boosting
- It is extremely useful in the field of computer vision
- It is used by most (nearly half of) winners of machine learning (ML) competitions
- Most deployed machine learning systems use model ensembles
This is another very popular ensemble method that has been successfully applied in various areas of machine learning due to its scalability, ease of use and good classification performance. This algorithm iteratively ‘grows’ a number of trees on random subsets of the data and aggregates the predictions of all the trees based on the majority vote. A summary of the Random Forests algorithm is given below:
- Draw a random bootstrap sample size of n (randomly choose n samples from the training set with replacement).
- Grow a decision tree from the bootstrap sample. At each node:
- Randomly select d features without replacement
- Split the node using the feature that provides the best split according to the objective function e.g. maximizing the information gain or classification accuracy.
Advantages of Random Forests
- It is a general purpose model that performs well on most problems
- It is capable of handling noisy or missing data; categorical or continuous features
- It can be used for extremely large datasets.
One weakness of Random Forests is that may not be easily interpreted unlike a decision tree. It also may require some work to tune the model to the data.
In this blog post, I introduced the concept of an ensemble classifier and I talked about the mode of operation of two popular ensemble methods – Boosting and Random Forests.
I announced in the last blog post that I created a Github repository that contains my implementation of various ML algorithms. I have added an implementation of k-nearest neighbours and I will add more soon.
Thank you for reading this post. I hoped you liked it. As always I look forward to your comments about this post or any other of my posts. If you have a question, comment about it and I will do my best to answer. Please subscribe to my blog posts if you have not done so. Wishing you a great and productive week. Cheers!!!