Linear Classifiers and Logistic Regression

The goal of this lesson is to explore logistic regression and feature engineering with Sklearn functions. In the assignment, we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.

Learning objectives:

  1. Use Pandas and Sklearn to do some feature engineering.
  2. Train a logistic regression model to predict the sentiment of product reviews.
  3. Inspect the weights (coefficients) of a trained logistic regression model.
  4. Make a prediction (both class and probability) of sentiment for a new product review.
  5. Given the logistic regression weights, predictors and ground truth labels, write a function to compute the accuracy of the model.
  6. Inspect the coefficients of the logistic regression model and interpret their meanings.
  7. Compare multiple logistic regression models.

Before we start, import a few libraries and load the dataset consisting of baby product reviews on Amazon.com. Then, store the data in a data frame products in amazon_baby.csv.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
In [2]:
products = pd.read_csv('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/amazon_baby.csv')
products.head()
Out[2]:
name review rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5

1. Data Set-up

The first step is to design suitable data for the analysis. Specifically, we will need to do the following:

  • Text cleaning: we will remove punctuation to ensure that words "cake." and "cake!" are counted as the same word. We also remove reviews with rating = 3, since they tend to have a neutral sentiment.
  • Create dependent variable (desired target): we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. Then we create a new column, call "sentiment", with +1 for the positive class label and -1 for the negative class label.
  • Train and test split: again, we use split the into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.
1.1 Text cleaning
In [3]:
products = products.fillna({'review':''}) # fill in N/A's in the review column
In [4]:
def remove_punctuation(text):
    import string
    return text.translate(str.maketrans('','',string.punctuation)) 

products['review_clean'] = products['review'].apply(remove_punctuation)
products = products[products['rating'] != 3]
1.2 Create dependent variable (desired target)
In [5]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
1.3 Train and test split
In [6]:
import json
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/test_data_idx.json') as test_data_file:    
    test_data_idx = json.load(test_data_file)
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/train_data_idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

train_data = products.iloc[train_data_idx]
test_data = products.iloc[test_data_idx]
In [7]:
train_data.head(2)
Out[7]:
name review rating review_clean sentiment
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5 it came early and was not disappointed i love ... 1
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5 Very soft and comfortable and warmer than it l... 1
In [8]:
test_data.head(2)
Out[8]:
name review rating review_clean sentiment
9 Baby Tracker® - Daily Childcare Journal, S... This has been an easy way for my nanny to reco... 4 This has been an easy way for my nanny to reco... 1
10 Baby Tracker® - Daily Childcare Journal, S... I love this journal and our nanny uses it ever... 4 I love this journal and our nanny uses it ever... 1

2. Building a Logistic Regression model

  • Build the word count vector for each review: we will now compute the word count for each word that appears in the reviews.

  • Create and Train the Model: here we create a LogisticRegression Object and use the .fit() method to finally train the model.

  • Making predictions with logistic regression: after a model is trained, we can make predictions on the test data.

2.1 Build the word count vector for each review

Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction. Compute the occurrences of the words in each review and collect them into a row vector. Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix. Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])
2.2 Train a sentiment classifier with logistic regression

First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function. Then, fit our model on the train set using fit() and perform prediction on the test set using predict().

In [10]:
logitreg = LogisticRegression() 
sentiment_model = logitreg.fit(train_matrix, train_data['sentiment'])
In [11]:
print("The number of coefficients is greater than zero:", np.sum(sentiment_model.coef_ > 0))
The number of coefficients is greater than zero: 86781
2.3 Making predictions with logistic regression

In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the test_data and print their content:

In [12]:
sample_test_data = test_data[10:13]
sample_test_data
Out[12]:
name review rating review_clean sentiment
59 Our Baby Girl Memory Book Absolutely love it and all of the Scripture in... 5 Absolutely love it and all of the Scripture in... 1
71 Wall Decor Removable Decal Sticker - Colorful ... Would not purchase again or recommend. The dec... 2 Would not purchase again or recommend The deca... -1
91 New Style Trailing Cherry Blossom Tree Decal R... Was so excited to get this product for my baby... 1 Was so excited to get this product for my baby... -1
a. Score Prediction

We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Recall that the score for the logistic regression model is defined as: $$ score_i = w^T h(x_i)$$ where $h(x_i)$ represents the features for example i.

In [13]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)
[  5.60167742  -3.16956136 -10.42345957]
b. Class Label Prediction

These scores calculated above can be used to make class predictions as follows: $$ y = \begin{cases} +1 & \quad \text{if } w^T h(x_i) > 0\\ -1 & \quad \text{if } w^T h(x_i) \leq 0 \end{cases} $$ We can make this prediction in scikit-learn by calling the predict() function.

In [14]:
print(sentiment_model.predict(sample_test_matrix))
[ 1 -1 -1]
c. Probability Prediction

Recall that we can also calculate the probability predictions from the scores using the sigmoid function as follow: $$P(y_i=+1|x_i ,w)= \frac{1}{1+exp(−w^T h(x_i))}$$

Using the variable scores calculated previously, we can write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range [0, 1].

In [15]:
for i in range(len(sample_test_data)):
    prob = 1/(1+np.exp(-scores[i]))
    print('The probability that a sentiment is positive for observation',i+1, 'is:', prob)
The probability that a sentiment is positive for observation 1 is: 0.9963219122192273
The probability that a sentiment is positive for observation 2 is: 0.04032738776318565
The probability that a sentiment is positive for observation 3 is: 2.972597544224787e-05
d. Find the most positive (and negative) review

We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points. Using the sentiment_model, we will determine the 20 reviews in the entire test_data with the highest (lowest) probability of being classified as a positive (negative) review.

In [16]:
test_matrix = vectorizer.transform(test_data['review_clean'])
test_predict_proba = sentiment_model.predict_proba(test_matrix)[:,1]
result = {}
for name, proba in zip(test_data['name'], test_predict_proba):
    result[name] = proba

Top 20 reviews with the highest probability of being classified as a positive review

In [17]:
sorted(result.items(), key=lambda x: x[1], reverse=True)[:20]
Out[17]:
[('Britax Decathlon Convertible Car Seat, Tiffany', 1.0),
 ('Evenflo X Sport Plus Convenience Stroller - Christina', 1.0),
 ('Roan Rocco Classic Pram Stroller 2-in-1 with Bassinet and Seat Unit - Coffee',
  1.0),
 ("Graco Pack 'n Play Element Playard - Flint", 1.0),
 ('Buttons Cloth Diaper Cover - One Size - 8 Color Options', 1.0),
 ('Mamas & Papas 2014 Urbo2 Stroller - Black', 1.0),
 ('Summer Infant Wide View Digital Color Video Monitor', 0.9999999999999998),
 ('Phil & Teds Navigator Buggy Golden Kiwi Free 2nd Seat NAVPR12200',
  0.9999999999999996),
 ('Britax Frontier Booster Car Seat', 0.999999999999998),
 ('Roundabout Convertible Car Seat - Grey Wicker', 0.9999999999999882),
 ('Emily Green 6" Bowl, Sunshine Safari', 0.9999999999999811),
 ('Quinny 2012 Buzz Stroller, Rebel Red', 0.9999999999999656),
 ('Peg Perego Aria Light Weight One Hand Fold Stroller in Moka',
  0.9999999999999645),
 ('Prince Lionheart bebePOD Plus, Watermelon', 0.9999999999999567),
 ('Rainy Day Indoor Playground toddler swing to be used with support system',
  0.9999999999999527),
 ('Safety 1st Complete Air 70 Car Seat, Julianne', 0.9999999999998697),
 ('JJ Cole Swag Diaper Bag, Bronze Drop', 0.9999999999997538),
 ('Graco Quattro Tour Duo Stroller, Clairmont', 0.9999999999996945),
 ('Prince Lionheart Flexibath Foldable Bathtub, White', 0.9999999999996703),
 ('Mamas and papas Galaxy Crib Mobile', 0.9999999999996658)]

Top 20 reviews with the lowest probability of being classified as a positive review

In [18]:
sorted(result.items(), key=lambda x: x[1])[:20]
Out[18]:
[('The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit',
  3.3606734423009134e-13),
 ('Ellaroo Mei Tai Baby Carrier - Hershey', 4.352115010246832e-10),
 ('Thirsties Hemp Inserts 2 Pack, Small 6-18 Lbs', 1.6343277038091365e-09),
 ('Baby Trend Inertia Infant Car Seat - Horizon', 1.8982763435964858e-09),
 ('One Step Ahead Hide-Away Extra Long Bed Rail', 3.1939346262040935e-09),
 ("Fisher-Price Discover 'n Grow Take-Along Play Blanket",
  4.819466690798638e-09),
 ('Baby Jogger Summit XC Double Stroller, Red/Black', 5.257972080343987e-09),
 ('Valco Baby Tri-mode Twin Stroller EX- Hot Chocolate',
  6.654048657895673e-09),
 ('Levana BABYVIEW20 Interference Free Digital Wireless Video Baby Monitor with Night Light Lullaby Camera',
  8.365013066786632e-09),
 ('Playtex Nurser With Drop-Ins Liner, 4 Ounce, Colors May Vary, 3-Count',
  9.940460342010848e-09),
 ('Cloud b Gentle Giraffe On The Go Travel Sound Machine with Four Soothing Sounds',
  1.2218005551881237e-08),
 ('Peg-Perego Tatamia High Chair, White Latte', 2.6697097197057853e-08),
 ('Baby Trend Encore Travel System-Columbia', 3.9092268801919764e-08),
 ('Philips AVENT 3-in-1 Electric Steam Sterilizer', 4.139203969022401e-08),
 ("Phil and Ted's Vibe Baby Stroller in Black and Red", 4.987747831380129e-08),
 ("GuideCraft Noah's Ark Toy Chest", 6.846122157188161e-08),
 ("Carter's Monkey Bars Musical Mobile, Chocolate", 7.150985068398024e-08),
 ('Evenflo Take Me Too Premiere Tandem Stroller - Castlebay',
  9.945574873664368e-08),
 ("Mother's Lounge 5 Piece Carseat Canopy Whole Caboodle, Hawkslee",
  1.1014947473690548e-07),
 ('Boon Fluid Sippy Cup,Blue/Orange', 1.1899889812261643e-07)]

3. Model Evaluation

While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. To do this are going to see how the model performs on the new data (test set) accuracy is defined as: $$ \text{accuracy} = \frac{\text{correctly classified examples}}{\text{total examples}}$$

In [19]:
print("Our accuracy was:", accuracy_score(test_data['sentiment'], sentiment_model.predict(test_matrix)))
Our accuracy was: 0.9322954163666907

4. Learn a Second Classifier with Fewer Words

For this assignment, we try to evaluate the performance of models that use different sets of words. Above, we used the word counts for all words in the reviews to train the sentiment classifier model. Now, we are going to follow a similar path, but only train a simpler logistic regression model by using a subset of the words 20 selected words to work with.

4.1 Train a logistic regression model on a subset of data
In [20]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

Similar as above, we compute word count vectors for the training and test data and obtain the sparse matrices, respectively.

In [21]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

Then, we build a logistic regression classifier on a subset of data.

In [22]:
simple_model =  logitreg.fit(train_matrix_word_subset, train_data['sentiment'])

Let's inspect the weights (coefficients) of the simple_model building a table to store (word, coefficient) pairs.

In [23]:
simple_model_coef_table = pd.DataFrame({'word':significant_words,'coefficient':simple_model.coef_.flatten()})
simple_model_coef_table.sort_values(['coefficient'], ascending=False)
Out[23]:
word coefficient
6 loves 1.673074
5 perfect 1.509812
0 love 1.363690
2 easy 1.192538
1 great 0.944000
4 little 0.520186
7 well 0.503760
8 able 0.190909
3 old 0.085513
9 car 0.058855
11 less -0.209563
16 product -0.320556
18 would -0.362167
12 even -0.511380
15 work -0.621169
17 money -0.898031
10 broke -1.651576
13 waste -2.033699
19 return -2.109331
14 disappointed -2.348298

Consider the coefficients of simple_model, we count how many of coefficients are positive for the simple_model.

In [24]:
print("The number of coefficients are positive in the simple_model:", np.sum(simple_model.coef_>0))
The number of coefficients are positive in the simple_model: 10
4.2 Compare accuracy between models

We will compare the accuracy of the sentiment_model and the simple_model using the training data to see which one has higher accuracy.

In [25]:
sentiment_model = logitreg.fit(train_matrix, train_data['sentiment'])
print("The accuracy of sentiment_model:", accuracy_score(train_data['sentiment'], sentiment_model.predict(train_matrix)))
The accuracy of sentiment_model: 0.9684895364873778
In [26]:
simple_model =  logitreg.fit(train_matrix_word_subset, train_data['sentiment'])
print("The accuracy of sentiment_model:", accuracy_score(train_data['sentiment'], simple_model.predict(train_matrix_word_subset)))
The accuracy of sentiment_model: 0.8668225700065959
4.3 Majority class prediction

It is common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless. We can calculate the majority class classifier model as follow:

In [27]:
num_positive = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print("The number of positive sentiment:",num_positive)
print("The number of positive sentiment:", num_negative)
print("Majority class ratio:", num_positive / len(train_data))
The number of positive sentiment: 112164
The number of positive sentiment: 21252
Majority class ratio: 0.8407087605684476