## Saturday, 30 April 2016

### EVALUATION METRICS FOR CLASSIFIERS

Hello, welcome to my blog. In my previous posts I introduced the concept of classification. I talked about the operation of a linear classifier, how to learn the coefficients for a linear classifier using logistic regression and I demonstrated how to implement logistic regression in Python.

In this post I want to talk about how we evaluate a classifier.  Concretely, how do we know if a classifier is doing good or poorly on data? This question is the theme of this post.

ACCURACY
This is usually the first metric used to judge a classifier. It is the fraction of predictions that the classifier got correct. Formally,

It has the following properties:
• Its best possible value is 1 – this means the classifier got all of its predictions correct
• Its worst possible value is 0 – this means the classifier got all of its predictions wrong
The opposite of accuracy is classification error which is the fraction of predictions the classifier got wrong. The accuracy and classification error of any classifier on data must sum up to 1. Accuracy is usually expressed in percentages. For example, if a classifier gets 85 out of 100 predictions correct it has an accuracy of 0.85 or 85%. A good classifier will have accuracy close to 1 while a bad classifier will have accuracy close to 0.

FLAWS OF ACCURACY
If a classifier has high accuracy it does not mean that you should trust its predictions. Let me give an example. Suppose a classifier was able to correctly identify whether or not 9,900 out of 10,000 reviews have negative sentiment. This means the classifier has an accuracy of 99%. This may seem very impressive but what if on further investigation it was discovered that only 50 out of the 10,000 reviews had positive sentiment? Suddenly this accuracy score is no longer impressive because if we had a ‘classifier’ that predicted ŷ = -1 (i.e. negative sentiment) for every review, this ‘classifier’ would have an accuracy of 99.5% which is better than the previous classifier. Keep in mind that this classifier does not learn anything from the data – it always predicts ŷ = -1 for every review and it still manages to do better than the previous classifier which went through the process of learning from data.

This is the major flaw of using accuracy to measure the performance of a classifier. It can give misleading information on the performance of a classifier especially in cases where the data is imbalanced i.e. majority of examples belong to one class. This tells us that accuracy alone is not good enough judge how well a classifier is doing. Before I discuss other classification metrics, I want to talk about the confusion matrix.

CONFUSION MATRIX
A confusion matrix (also called contingency table) is a way of displaying the predictions of a classifier. It helps us see how well a classifier’s predictions match the actual values in the data. For a binary classification problem, the table will be a 2x2 matrix. More generally, for an n-class classification problem (n stands for the number of classes), the table will be an nxn matrix. The confusion matrix for binary classification is shown below

 Predicted Class ŷ = +1 ŷ = -1 Actual  Class y = +1 True Positives (TP) False Negatives (FN) y = -1 False Positives (FP) True Negatives (TN)

A perfect classifier would have all its predictions on the diagonal i.e. in the cells labelled True Positives and True Negatives. The other cells describe the two kinds of error we can have. Let me define the terms in the confusion matrix
• True Positives: These are reviews where the classifier predicted positive sentiment and the class of the review is truly positive.
• True Negatives: These are reviews where the classifier predicted negative sentiment and the class of the review is truly negative.
• False Positives: These are reviews where the classifier predicted positive sentiment but the actual class of the review is negative.
• False Negatives: These are reviews where the classifier predicted negative sentiment but the actual class of the review is positive.

OTHER CLASSIFICATION METRICS – PRECISION & RECALL

PRECISION
Precision also known as positive predictive value is defined as the fraction of positive predictions that are truly positive. For the sentiment classification problem, precision asks – What fraction of the reviews we predicted as having positive sentiment actually had positive sentiment? Formally

If a classifier has high precision it means that if it makes a positive prediction, it is very likely to be correct. Like accuracy, its best possible value is 1 and its worst possible value is 0.

RECALL
Recall is defined as the fraction of positive examples that were correctly predicted to be positive. It is a measure of how complete the predictions are. For the sentiment classification problem, recall asks – What fraction of the positive reviews did we correctly classify as having positive sentiment? Formally

If a classifier has high recall it means it was able to identify most (or all) of the actually positive examples. Like accuracy, its best possible value is 1 and its worst possible value is 0.

PRECISION-RECALL EXTREMES
You can think of a model with high recall and low precision as being optimistic – it predicts almost everything as positive. Therefore, it is likely to capture all the positive reviews but it will most definitively include a lot of negative reviews.

A model with high precision and low recall can be thought of as being pessimistic – it only predicts positive when it is very sure. Therefore, if it predicts that a review has positive sentiment it is very likely to be so but it will also miss out on the positive reviews it was not sure about.

The question now is – Can we make a trade-off precision and recall? Yes!

In a previous post, I said that the probability assigned to a particular prediction is used to predict what class the prediction belongs to.

To make a high precision model, we increase the probability threshold for classifying a review as positive. This will reduce the number of reviews that are predicted as having positive sentiment. The reviews that are predicted as having positive sentiment will be the ones we are very sure of.

To make a high recall model, we decrease the probability threshold for classifying a review as positive. This increases the number of reviews that are classified as positive. While we are likely to capture all the positive reviews, many negative reviews will also be included in our predictions.

OTHER CLASSIFICATION METRICS
There are other metrics which can be used to judge the performance of a classifier. I just list some of them here
• Sensitivity
• Specificity
• F1 Score
• Kappa statistic

SUMMARY
In this post, I talked about accuracy and its flaws when dealing with imbalanced data. I also discussed precision and recall and how we balance them to achieve either an optimistic or pessimistic classifier. I ended the post my mentioning some other classification metrics.