Saturday, 23 April 2016

IMPLEMENTING LOGISTIC REGRESSION IN PYTHON

Hello there. Welcome to my blog. In the last post I talked about the algorithm used to learn coefficients for linear classifiers – logistic regression. In case you have not read it, click here

In this post, I want to demonstrate how to implement logistic regression using the sci-kit learn library in Python. I will show the full program at the end of the post, here I will just display the screenshots of the results of the program.
For this demonstration, I will be using a dataset of baby product reviews on Amazon.com. This dataset is in SFrame format and I will use the sframe library in Python to load it. Here is a list of the Python libraries I will be using:
  • sframe – To load the dataset
  • sci-kit learn – This library provides functionality for implementing logistic regression
Here’s a look at the first five rows of the dataset




I started by removing the punctuations from the ‘review’ column so that words like “love” and “love!” are counted as the same word. The ratings for reviews in the dataset range from 1 (most negative rating) to 5 (most positive rating). I ignored reviews that had ratings = 3 because these reviews tend to have neutral sentiment. I created the target column ‘sentiment’ by assigning a +1 class label for reviews with ratings of 4 or higher and  -1 class label for reviews with ratings of 2 or lower. I randomly split this dataset into two parts – 80% of the data was used for training the model while 20% of the data was used to test the model. Here is another look at the dataset with the new columns



FEATURE EXTRACTION
I used the word count for each word that appear in the reviews in the training data to build a sparse matrix. Each row of this sparse matrix contains the word counts for the corresponding review. This matrix is sparse because most of the words appear in only a few reviews. A vector that contains word counts is referred to as bag-of-words features. Here is a general outline for extracting word count vectors:
  • Learn a vocabulary (the set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
  • Compute the occurrences of the words in each review and collect them in a row vector.
  • Build a sparse matrix where each row is the word count vector for the corresponding review.
If you want to know more about bag-of-words, read this excellent post. Fortunately it is easy to create word count vectors in Python using ‘CountVectorizer’ from the sklearn.feature_extraction.text library.
This transformation is performed both on the training and test data.

TRAINING A LOGISTIC REGRESSION CLASSIFIER
We are now ready to learn a logistic regression classifier from the training data. To do this we first we import the LogisticRegression class from sklearn.linear_model. Next, we create an instance of that class. The ‘fit’ method is used to train the classifier using the sparse word count matrix as features and the ‘sentiment’ column of the training data as the target.

INSPECTING THE COEFFICIENTS FOR THE MODEL
We can now inspect the coefficients learned by the classifier. Remember that this model uses the word counts from reviews in the training data as the features; therefore each word is going to have a coefficient. In order to better understand the model coefficients, I created a table to store words (in one column) and their corresponding coefficients (in another column). I sorted this table by the ‘coefficients’ column so we can see the top 10 words with the most positive coefficients and the top 10 words with the most negative coefficients. The tables are shown below

The top 10 most positive words
The top 10 most negative words

From the table, some of the most positive words include ‘amazed’, ‘pleasantly’, ‘excellent’, ‘outstanding’ and ‘perfect’. This makes sense as words like these frequently occur in reviews with positive sentiment. Some of the most negative words include ‘disappointed’, ‘worthless’, ‘useless’, ‘poorly’ and ‘worst’. This also makes sense because words like this frequently appear in reviews with negative sentiment.

EVALUATING PERFOMANCE OF THE MODEL
Finally, I used the ‘accuracy_score’ function from the ‘sklearn.metrics’ library to get the accuracy of the model on the test data. The model had an accuracy of 93.2% on the test data. This indicates that the model is pretty good at classifying reviews as having either positive or negative sentiment.

SUMMARY
In this blog post, I demonstrated how to implement logistic regression using the sci-kit learn library in Python. Thank you once again for reading my blog. If you have any questions, suggestions or comments please feel free to leave a comment and I will do my best to attend to you. Please add your email to the mailing list of this blog if you have not done so. Enjoy the rest of your weekend. Cheers!!!

Python code for implementing logistic regression

 #import library for load the dataset
import sframe
products = sframe.SFrame('amazon_baby.gl/')

#Function to remove punctuation
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)

products['review_clean'] = products['review'].apply(remove_punctuation)
#View the first five rows of the dataset
products.head(5)

#Ignore reviews with ratings = 3
products = products[products['rating'] != 3]

#Assign +1 to ratings higher than 4 and -1 to ratings of 2 or lower
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

#Let's view the products sframe again
products.head(5)

#Randomly split data into training and test sets
#seed=1 ensures that the data is always split the same way every time I run this program
train_data, test_data = products.random_split(.8, seed=1)

#Building a word count vector
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')

# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])

# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

#Learn a logistic regression classifier from the data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
sentiment_model = logreg.fit(train_matrix, train_data['sentiment'])

#Inspecting the coefficients
sentiment_model_coef_table=sframe.SFrame({'word':vectorizer.get_feature_names(),'coefficient':sentiment_model.coef_.flatten()})

#Let's see the 10 most positive words
sentiment_model_coef_table.sort('coefficient', ascending=False).print_rows(10, 2)

#Let's see the 10 most negative words
#By default the sort function sorts in descending order
sentiment_model_coef_table.sort('coefficient').print_rows(10, 2)

#Get accuracy of the model on test data
import sklearn.metrics


#First make predictions on the test data
predictions = sentiment_model.predict(test_matrix)
accuracy = sklearn.metrics.accuracy_score(y_true=test_data['sentiment'].to_numpy(), y_pred=predictions)

print "Accuracy on test data ", accuracy

#This prints Accuracy on test data 0.932205423566