Thursday, 14 April 2016


Hello, welcome to my blog. In my previous posts I introduced the concept of classification and I also described the operation of a linear classifier. Basically a linear classifier computes a score for an input and classifies the input to a class based on the value of that score.

One thing I did not talk about was how the coefficients used by a linear classifier is computed. This is going to be the subject of this post. I will talk about a model called logistic regression that is used to learn the coefficients for a linear classifier from data. Before I do that, it is important give some background which will help you understand what logistic regression is about. In this post I will still use the case study of a restaurant review system to make my explanations clear.

A logistic regression model does not explicitly predict a class label (for our restaurant review case study the classes are either positive or negative sentiment). Rather it predicts the conditional probability of either positive or negative sentiment given a review. Formally this is written as 

Ppred(y|x), where y stands for the class label (output) and x stands the input (a review).
The statement above can be read as ‘probability of y given x’These probabilities are very useful in practice because they show how confident we are of our predictions.

If a logistic regression model predicts that a review is positive with a very high probability, for example 0.99 it means that we are very confident that the review is positive. If we are unsure whether a review is either positive or negative we want the probability to be around 0.5 which shows that there is a 50-50 chance that the review has either positive or negative sentiment. Usually, we predict y=+1 if the predicted probability is greater than 0.5 and =-1 if the predicted probability is less than 0.5. This is not always the case though, depending on the application, we may choose to use a lower or higher threshold for predicting y=+1.

The score is a weighted sum of the inputs to the linear classifier. Formally, it is written as 

wTx is convenient shorthand for w0x0 + w1x1 + … + wNxN. Note that wmeans the transpose of w which is vector that contains the coefficients for our features. It does not mean w raised to power T.
The score for a given input ranges from -∞ to +∞. If ŷ is -1 or +1 it should be close -or +∞ respectively.

Stating this more formally, if the score is a large positive number (i.e. close to +∞) we predict that the review is positive (i.e. ŷ = +1) and Ppred(y=+1|xi) = 1 (or close to 1). Conversely if the score is a large negative number (i.e. close to -) we predict that the review is negative (i.e. ŷ = -1) and Ppred(y=+1|xi) = 0 (or close to 0). If the score is 0 (at the decision boundary) we are not sure if ŷ = +1 or ŷ = -1 and we say Ppred(y=+1|xi) = 0.5.
Now, probabilities range from 0 to 1 while the score ranges from -or +∞. The question now is how do we link scores to probability? That is, how we interpret a score in terms of probability?

This is done by passing a score through what is called a link function. This function ‘squeezes’ values from the range -to +∞ (the range of our scores) to the interval 0 to 1 (which is the range of probabilities). The output of the link function for a given score is interpreted as Ppred(y=+1|x). Formally this is written as

Where g is the link function, ŵ is the coefficients we learn from data and h(xi) are our features.
For logistic regression the link function that is used is called the logistic function (also called sigmoid or logit function). It is stated below

 The plot for this function is an S-shaped curve. At the end of this post, I will show python code that produces the plot for the sigmoid function and the resulting plot.

Let me demonstrate the effect of the sigmoid function on small and large values of scores.


You can see that as the value of score gets larger sigmoid(score) get close to 1 and reaches 1 at positive infinity.  Also, as the value of score gets smaller sigmoid(score) get close to 0 and eventually reaches 0 at negative infinity.

The logistic regression model takes a score as input in the range -to +∞; pushes the score through the sigmoid function to estimate Ppred(y=+1|x, w). Formally, for an input xi
Ppred(y=+1|x, w) = sigmoid(score(xi))

Now I can come back to the subject of this blog post – How do we learn weights for our features from data? The best coefficients for our features are coefficients, ŵ such that:

P(y=+1|x, ŵ) = 1 à for all positive reviews and
P(y=-1|x, ŵ) = 1 à for all negative reviews.

The goal now is to find the set of ŵ for our features that makes this possible or as close as possible because usually, no ŵ is able to achieve perfect predictions. 

The likelihood function is a quality metric for logistic regression. It measures the quality of fit for a model with coefficients, ŵ. The higher it is for a model, the better. 

Basically, it is product of the conditional probabilities for all reviews in our data. The best value for the likelihood function is 1 because if a set of coefficients, ŵ is perfect it will predict probability 1 for all positive and negative reviews.

Gradient ascent is the algorithm that we use to learn the best coefficients from data. Basically, in gradient ascent we start from some point in the parameter space and ‘climb up’ using the gradient of the likelihood function until we have reached the top of the likelihood function. The set of coefficients that we get at the top of the likelihood function will be ŵ. The outline of the gradient ascent algorithm is shown below

while not converged
       w(t+1) ß w(t) + η*gradient of the likelihood function with respect to w(t)

η (pronounced eta) is the ‘step size’ which controls how much we climb on each iteration of gradient ascent. If it is large we will take big steps, while if it is small we take little steps for each iteration of gradient ascent.
The coefficients at the beginning of gradient ascent can be initialized to be zero, random numbers or some value which you feel may help you to quickly converge to an optimal set of coefficients.

In this post, I about the logistic regression model. I talked about how it uses the logistic function to interpret scores as probabilities. I also talked about how the likelihood function and gradient ascent. In the next post I will show how to implement logistic regression in Python.

Thank you once again for reading my blog. Don’t forget to add your email address to the mailing list of this blog so you can quickly get my posts. As always, if you have questions, suggestions or comments feel free to leave a comment and I will do my best to attend to you. Cheers!!!

Python code to produce the plot for the sigmoid function

#import matplotlib library for plotting
import matplotlib.pyplot as plt
import numpy as np

#Define function for sigmoid
def sigmoid(x):
    result = 1 /(1 +np.exp(-x))
    return result

#A list to store some score values ranging from -10 to 10
someScores = range(-10, 11)

sigmoidScores = [sigmoid(x) for x in someScores]

plt.plot(someScores, sigmoidScores)
plt.ylabel('Sigmoid of scores')

The resulting plot is shown below: