Hello, welcome to my blog. In my previous posts I introduced
the concept of classification and I also described the operation of a linear
classifier. Basically a linear classifier computes a score for an input and
classifies the input to a class based on the value of that score.
One thing I did not talk about was how the coefficients used
by a linear classifier is computed. This is going to be the subject of this
post. I will talk about a model called logistic
regression that is used to learn the coefficients for a linear classifier
from data. Before I do that, it is important give some background which will
help you understand what logistic regression is about. In this post I will
still use the case study of a restaurant review system to make my explanations
clear.
PREDICTING PROBABILITIES
A logistic regression model does not explicitly predict a
class label (for our restaurant review case study the classes are either
positive or negative sentiment). Rather it predicts the conditional probability of either positive or negative sentiment given
a review. Formally this is written as
P_{pred}(yx), where y stands for the class label (output) and x stands the input (a review).
The statement above can be read as ‘probability of y given x’. These probabilities are very useful in practice because they
show how confident we are of our predictions.
HOW TO USE PROBABILITIES FOR CLASSIFICATION
If a logistic regression model predicts that a review is
positive with a very high probability, for example 0.99 it means that we are
very confident that the review is positive. If we are unsure whether a review
is either positive or negative we want the probability to be around 0.5 which
shows that there is a 5050 chance that the review has either positive or
negative sentiment. Usually, we predict y=+1
if the predicted probability is greater than 0.5 and y=1 if the predicted probability is less than 0.5. This is not
always the case though, depending on the application, we may choose to use a
lower or higher threshold for predicting y=+1.
INTERPRETING THE SCORE
The score is a weighted sum of the inputs to the linear
classifier. Formally, it is written as
w^{T}x is convenient shorthand for w_{0}x_{0}_{ }+ w_{1}x_{1} + … + w_{N}x_{N}_{. }Note that w^{T }means the transpose of w which is vector that contains
the coefficients for our features. It does
not mean w raised to power T.
The score for a given input ranges from ∞ to +∞.
If ŷ is 1 or +1 it should be close ∞ or +∞ respectively.
Stating this more formally, if the score is a large positive
number (i.e. close to +∞) we predict that the review is
positive (i.e. ŷ
= +1) and P_{pred}(y=+1x_{i}) = 1
(or close to 1). Conversely if the score is a large negative number (i.e. close
to ∞) we predict that the review is
negative (i.e. ŷ = 1) and P_{pred}(y=+1x_{i})
= 0 (or close to 0). If the score is 0 (at the decision boundary) we are not
sure if ŷ = +1 or ŷ = 1 and we say P_{pred}(y=+1x_{i})
= 0.5.
Now, probabilities
range from 0 to 1 while the score ranges from ∞ or +∞.
The question now is how do we link scores to probability? That
is, how we interpret a score in terms of probability?
THE LOGISTIC FUNCTION
This is done by
passing a score through what is called a link
function. This function ‘squeezes’ values from the range ∞ to +∞ (the range of our scores) to the
interval 0 to 1 (which is the range of probabilities). The output of the link
function for a given score is interpreted as P_{pred}(y=+1x).
Formally this is written as
Where g is the link
function, ŵ is the coefficients we learn from data and h(x_{i}) are our
features.
For logistic regression the link function that is used is
called the logistic function (also called sigmoid or logit function). It is
stated below
Let me demonstrate the effect of the sigmoid function on small
and large values of scores.
Score

∞

3

0

+3

+∞

Sigmoid(Score)

0

0.047

0.5

0.953

1

You can see that as the value of score gets larger
sigmoid(score) get close to 1 and reaches 1 at positive infinity. Also, as the value of score gets smaller
sigmoid(score) get close to 0 and eventually reaches 0 at negative infinity.
THE LOGISTIC REGRESSION MODEL
The logistic regression model takes a score as input in the
range ∞ to +∞; pushes the score through the sigmoid function to estimate P_{pred}(y=+1x,
w). Formally, for an input x_{i}
P_{pred}(y=+1x, w) = sigmoid(score(x_{i}))
LEARNING WEIGHTS FOR LOGISTIC REGRESSION
Now I can come back to the subject of this blog post – How do
we learn weights for our features from data? The best coefficients for our
features are coefficients, ŵ such that:
P(y=+1x, ŵ) = 1 à for all positive reviews and
P(y=1x, ŵ) = 1 à for all negative reviews.
The goal now is to find the set of ŵ for our features that makes this possible or as close as possible
because usually, no ŵ is able to
achieve perfect predictions.
LIKELIHOOD FUNCTION
The likelihood function is a quality metric for logistic
regression. It measures the quality of fit for a model with coefficients, ŵ.
The higher it is for a model, the better.
Basically, it is product of the conditional probabilities for
all reviews in our data. The best value for the likelihood function is 1 because
if a set of coefficients, ŵ is perfect it will predict probability 1 for all
positive and negative reviews.
GRADIENT ASCENT
Gradient ascent is the algorithm that we use to learn the best
coefficients from data. Basically, in gradient ascent we start from some point
in the parameter space and ‘climb up’ using the gradient of the likelihood
function until we have reached the top of the likelihood function. The set of
coefficients that we get at the top of the likelihood function will be ŵ. The
outline of the gradient ascent algorithm is shown below
while not converged
w^{(t+1) }ß w^{(t)} + η*gradient of the
likelihood function with respect to w^{(t)}
^{}
η (pronounced eta) is the ‘step size’ which controls how much
we climb on each iteration of gradient ascent. If it is large we will take big
steps, while if it is small we take little steps for each iteration of gradient
ascent.
The coefficients at the beginning of gradient ascent can be
initialized to be zero, random numbers or some value which you feel may help
you to quickly converge to an optimal set of coefficients.
SUMMARY
In this post, I about the logistic regression model. I talked
about how it uses the logistic function to interpret scores as probabilities. I
also talked about how the likelihood function and gradient ascent. In the next
post I will show how to implement logistic regression in Python.
Thank you once again for reading my blog. Don’t forget to add
your email address to the mailing list of this blog so you can quickly get my
posts. As always, if you have questions, suggestions or comments feel free to
leave a comment and I will do my best to attend to you. Cheers!!!
Python code to produce the plot for the sigmoid function
#import matplotlib library for plotting
import matplotlib.pyplot as plt
import numpy as np
#Define function for sigmoid
def sigmoid(x):
result = 1 /(1 +np.exp(x))
return result
#A list to store some score values ranging from 10 to 10
someScores = range(10, 11)
sigmoidScores = [sigmoid(x) for x in someScores]
plt.plot(someScores, sigmoidScores)
plt.xlabel('Scores')
plt.ylabel('Sigmoid of scores')
The resulting plot is shown below: