Hello
there. Welcome to my blog. In the last post I talked about the algorithm used
to learn coefficients for linear classifiers – logistic regression. In case you
have not read it, click here. 
In this post, I want to demonstrate how to implement logistic
regression using the sci-kit learn library in Python. I will show the full
program at the end of the post, here I will just display the screenshots of the
results of the program.
For this demonstration, I will be using a dataset of baby
product reviews on Amazon.com. This dataset is in SFrame format and I will use
the sframe library in Python to load it. Here is a list of the Python libraries
I will be using:- sframe – To load the dataset
- sci-kit learn – This library provides functionality for implementing logistic regression
Here’s a look at the first five rows of the dataset
I started by removing the punctuations from the ‘review’
column so that words like “love” and “love!” are counted as the same word. The
ratings for reviews in the dataset range from 1 (most negative rating) to 5
(most positive rating). I ignored reviews that had ratings = 3 because these
reviews tend to have neutral sentiment. I created the target column ‘sentiment’
by assigning a +1 class label for reviews with ratings of 4 or higher and  -1 class label for reviews with ratings of 2
or lower. I randomly split this dataset into two parts – 80% of the data was
used for training the model while 20% of the data was used to test the model.
Here is another look at the dataset with the new columns
FEATURE EXTRACTION
I used the word count for each word that appear in the
reviews in the training data to build a sparse matrix. Each row of this sparse matrix
contains the word counts for the corresponding review. This matrix is sparse
because most of the words appear in only a few reviews. A vector that contains
word counts is referred to as bag-of-words features. Here is a general outline
for extracting word count vectors:
- Learn a vocabulary (the set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them in a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review.
If you want to know more about bag-of-words, read this
excellent post. Fortunately it is easy to create word count
vectors in Python using ‘CountVectorizer’ from the sklearn.feature_extraction.text
library.
This transformation is performed both on the training and
test data.
TRAINING A LOGISTIC REGRESSION CLASSIFIER
We are now ready to learn a logistic regression classifier
from the training data. To do this we first we import the LogisticRegression
class from sklearn.linear_model. Next, we create an instance of that class. The
‘fit’ method is used to train the classifier using the sparse word count matrix
as features and the ‘sentiment’ column of the training data as the target.
INSPECTING THE COEFFICIENTS FOR THE MODEL
We can now inspect the coefficients learned by the
classifier. Remember that this model uses the word counts from reviews in the
training data as the features; therefore each word is going to have a
coefficient. In order to better understand the model coefficients, I created a
table to store words (in one column) and their corresponding coefficients (in
another column). I sorted this table by the ‘coefficients’ column so we can see
the top 10 words with the most positive coefficients and the top 10 words with
the most negative coefficients. The tables are shown below
| The top 10 most positive words | 
| The top 10 most negative words | 
From the table, some of the most positive words include
‘amazed’, ‘pleasantly’, ‘excellent’, ‘outstanding’ and ‘perfect’. This makes
sense as words like these frequently occur in reviews with positive sentiment.
Some of the most negative words include ‘disappointed’, ‘worthless’, ‘useless’,
‘poorly’ and ‘worst’. This also makes sense because words like this frequently
appear in reviews with negative sentiment. 
EVALUATING PERFOMANCE OF THE MODEL
Finally, I used the ‘accuracy_score’ function from the
‘sklearn.metrics’ library to get the accuracy of the model on the test data.
The model had an accuracy of 93.2% on the test data. This indicates that the
model is pretty good at classifying reviews as having either positive or negative
sentiment.
SUMMARY
In this blog post, I demonstrated how to implement logistic
regression using the sci-kit learn library in Python. Thank you once again for
reading my blog. If you have any questions, suggestions or comments please feel
free to leave a comment and I will do my best to attend to you. Please add your
email to the mailing list of this blog if you have not done so. Enjoy the rest
of your weekend. Cheers!!!
Python code for implementing logistic regression
 #import library for load the dataset
import sframe
products = sframe.SFrame('amazon_baby.gl/')
#Function to remove punctuation
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation)
products['review_clean'] = products['review'].apply(remove_punctuation)
#View the first five rows of the dataset
products.head(5)
#Ignore reviews with ratings = 3
products = products[products['rating'] != 3]
#Assign +1 to ratings higher than 4 and -1 to ratings of 2 or lower
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
#Let's view the products sframe again
products.head(5)
#Randomly split data into training and test sets
#seed=1 ensures that the data is always split the same way every time I run this program
train_data, test_data = products.random_split(.8, seed=1)
#Building a word count vector
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])
#Learn a logistic regression classifier from the data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
sentiment_model = logreg.fit(train_matrix, train_data['sentiment'])
#Inspecting the coefficients
sentiment_model_coef_table=sframe.SFrame({'word':vectorizer.get_feature_names(),'coefficient':sentiment_model.coef_.flatten()})
#Let's see the 10 most positive words
sentiment_model_coef_table.sort('coefficient', ascending=False).print_rows(10, 2)
#Let's see the 10 most negative words
#By default the sort function sorts in descending order
sentiment_model_coef_table.sort('coefficient').print_rows(10, 2)
#Get accuracy of the model on test data
import sklearn.metrics
#First make predictions on the test data
predictions = sentiment_model.predict(test_matrix)
accuracy = sklearn.metrics.accuracy_score(y_true=test_data['sentiment'].to_numpy(), y_pred=predictions)
print "Accuracy on test data ", accuracy
#This prints Accuracy on test data 0.932205423566
 
No comments:
Post a Comment