Hello, welcome to my blog. Recently, I have been talking
about two algorithms for classification namely logistic regression &
decision trees. I also demonstrated how we can implement these algorithms using
Python’s scikit-learn library.
Today, I want to talk about Ensemble classifiers. The fundamental
idea behind ensemble classifiers is combining a set of classifiers to make one
better classifier. Concretely, an ensemble classifier combines two or more
classifiers (also called a weak learner
or classifier) in order to make a stronger classifier (also called a strong learner or classifier).
OPERATION OF AN ENSEMBLE CLASSIFIER
The operation of an ensemble classifier can be summarized in
the following steps:
- Given input data, learn T weak classifiers from the data.
- Aggregate the prediction from each weak classifier in order to make a prediction for a test data-point.
Let me explain this with an example. Remember from the
previous post, I built a model to predict safe and unsafe loans using a single decision tree. The ensemble
classifier approach is to learn, say, 10 decision trees from the dataset. Each of
these trees is called a weak learner
because individually they may not have very good accuracy. We then combine
these trees in order to obtain a single strong
classifier which will have good accuracy.
How do we combine a set of weak classifiers in order to make
a strong classifier? One way to do this is to assign a weight to the prediction
of each weak classifier based on how well we trust the predictions of that
classifier (i.e. good classifiers will have high weights while bad classifiers
will have low weights). Finally, we predict by the sign of the weighted sum of
predictions of all weak classifiers i.e. if the weighted sum is positive we
predict the positive class else we predict the negative class. Another way to
do this is to make predictions based on the majority vote of predictions of the
weak classifiers.
ENSEMBLE METHODS
In this section I will talk about two popular ensemble
methods – Boosting (specifically AdaBoost) and Random Forests.
Boosting
AdaBoost or adaptive boosting is an algorithm that was
introduced by Freund & Schapire in 1999. It generates weak learners by iteratively
focusing on the hard-to-classify examples in the training data. An overview of
its operation is given in the following paragraph.
We start by assigning the same weight to all
points in the data. Next, we learn a classifier from these points. If a data-point
is correctly classified its weight is reduced, if it is incorrectly classified
its weight is increased. This means that correctly classified data-points will
‘disappear’ after some iterations leaving the misclassified examples from the
previous iteration. Once we recompute the weights for all data-points we learn
another classifier on the remaining data-points (i.e. the hard-to-classify
examples). This procedure is repeated T times
where T is the number of weak
learners we want to generate. So if we want 10 weak learners we repeat the
procedure 10 times. A summary of this algorithm is given below:
- Start with same weights for all data-points.
- Learn a classifier ft from the weighted data-points.
- For t = 1, ... , T
- Compute coefficient ŵt for the classifier based on the weighted classification error of the classifier ft.
- Recompute weights for the data-points.
- Normalize weights (this is necessary in order to avoid numerical instability).
The equation above says our prediction, ŷ, is the sign of the
weighted sum of predictions for all the weak classifiers.
Although AdaBoost can be applied to nearly any kind of model,
it most commonly used with decision trees. A very important advantage of
boosting is that it is robust to over-fitting i.e. test-set performance is not
very sensitive to the choice of T.
Impact of Boosting
- It is extremely useful in the field of computer vision
- It is used by most (nearly half of) winners of machine learning (ML) competitions
- Most deployed machine learning systems use model ensembles
RANDOM FORESTS
This is another very popular ensemble method that has been successfully
applied in various areas of machine learning due to its scalability, ease of
use and good classification performance. This algorithm iteratively ‘grows’ a
number of trees on random subsets of the data and aggregates the predictions of
all the trees based on the majority vote. A summary of the Random Forests
algorithm is given below:
- Draw a random bootstrap sample size of n (randomly choose n samples from the training set with replacement).
- Grow a decision tree from the bootstrap sample. At each node:
- Randomly select d features without replacement
- Split the node using the feature that provides the best split according to the objective function e.g. maximizing the information gain or classification accuracy.
- Repeat the steps 1 to 2 k times.
- Aggregate the prediction by each tree to assign the class label by majority vote.
Advantages of Random Forests
- It is a general purpose model that performs well on most problems
- It is capable of handling noisy or missing data; categorical or continuous features
- It can be used for extremely large datasets.
One weakness of Random Forests is that may not be easily
interpreted unlike a decision tree. It also may require some work to tune the
model to the data.
SUMMARY
In this blog post, I introduced the concept of an ensemble
classifier and I talked about the mode of operation of two popular ensemble
methods – Boosting and Random Forests.
I announced in the last blog post that I created a Github repository that contains my implementation of various ML
algorithms. I have added an implementation of k-nearest neighbours and I will
add more soon.
Thank you for reading this post. I hoped you liked it. As
always I look forward to your comments about this post or any other of my
posts. If you have a question, comment about it and I will do my best to
answer. Please subscribe to my blog posts if you have not done so. Wishing you
a great and productive week. Cheers!!!
No comments:
Post a Comment