The goal of this lesson is to explore logistic regression and feature engineering with Sklearn functions. In the assignment, we will use product review data from Amazon.com to predict whether the sentiments about a product (from its reviews) are positive or negative.
Learning objectives:
Before we start, import a few libraries and load the dataset consisting of baby product reviews on Amazon.com. Then, store the data in a data frame products in amazon_baby.csv.
import pandas as pd
import numpy as np
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
products = pd.read_csv('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/amazon_baby.csv')
products.head()
The first step is to design suitable data for the analysis. Specifically, we will need to do the following:
products = products.fillna({'review':''}) # fill in N/A's in the review column
def remove_punctuation(text):
import string
return text.translate(str.maketrans('','',string.punctuation))
products['review_clean'] = products['review'].apply(remove_punctuation)
products = products[products['rating'] != 3]
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)
import json
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/test_data_idx.json') as test_data_file:
test_data_idx = json.load(test_data_file)
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week1/train_data_idx.json') as train_data_file:
train_data_idx = json.load(train_data_file)
train_data = products.iloc[train_data_idx]
test_data = products.iloc[test_data_idx]
train_data.head(2)
test_data.head(2)
Build the word count vector for each review: we will now compute the word count for each word that appears in the reviews.
Create and Train the Model: here we create a LogisticRegression Object and use the .fit() method to finally train the model.
Making predictions with logistic regression: after a model is trained, we can make predictions on the test data.
Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction. Compute the occurrences of the words in each review and collect them into a row vector. Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix. Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.transform(test_data['review_clean'])
First, import the Logistic Regression module and create a Logistic Regression classifier object using LogisticRegression() function. Then, fit our model on the train set using fit() and perform prediction on the test set using predict().
logitreg = LogisticRegression()
sentiment_model = logitreg.fit(train_matrix, train_data['sentiment'])
print("The number of coefficients is greater than zero:", np.sum(sentiment_model.coef_ > 0))
In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the test_data and print their content:
sample_test_data = test_data[10:13]
sample_test_data
We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Recall that the score for the logistic regression model is defined as: $$ score_i = w^T h(x_i)$$ where $h(x_i)$ represents the features for example i.
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)
These scores calculated above can be used to make class predictions as follows: $$ y = \begin{cases} +1 & \quad \text{if } w^T h(x_i) > 0\\ -1 & \quad \text{if } w^T h(x_i) \leq 0 \end{cases} $$ We can make this prediction in scikit-learn by calling the predict() function.
print(sentiment_model.predict(sample_test_matrix))
Recall that we can also calculate the probability predictions from the scores using the sigmoid function as follow: $$P(y_i=+1|x_i ,w)= \frac{1}{1+exp(−w^T h(x_i))}$$
Using the variable scores calculated previously, we can write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the range [0, 1].
for i in range(len(sample_test_data)):
prob = 1/(1+np.exp(-scores[i]))
print('The probability that a sentiment is positive for observation',i+1, 'is:', prob)
We now turn to examining the full test dataset, test_data, and use sklearn.linear_model.LogisticRegression to form predictions on all of the test data points. Using the sentiment_model, we will determine the 20 reviews in the entire test_data with the highest (lowest) probability of being classified as a positive (negative) review.
test_matrix = vectorizer.transform(test_data['review_clean'])
test_predict_proba = sentiment_model.predict_proba(test_matrix)[:,1]
result = {}
for name, proba in zip(test_data['name'], test_predict_proba):
result[name] = proba
Top 20 reviews with the highest probability of being classified as a positive review
sorted(result.items(), key=lambda x: x[1], reverse=True)[:20]
Top 20 reviews with the lowest probability of being classified as a positive review
sorted(result.items(), key=lambda x: x[1])[:20]
While there are other ways of measuring model performance (precision, recall, F1 Score, ROC Curve, etc), we are going to keep this simple and use accuracy as our metric. To do this are going to see how the model performs on the new data (test set) accuracy is defined as: $$ \text{accuracy} = \frac{\text{correctly classified examples}}{\text{total examples}}$$
print("Our accuracy was:", accuracy_score(test_data['sentiment'], sentiment_model.predict(test_matrix)))
For this assignment, we try to evaluate the performance of models that use different sets of words. Above, we used the word counts for all words in the reviews to train the sentiment classifier model. Now, we are going to follow a similar path, but only train a simpler logistic regression model by using a subset of the words 20 selected words to work with.
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves',
'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed',
'work', 'product', 'money', 'would', 'return']
Similar as above, we compute word count vectors for the training and test data and obtain the sparse matrices, respectively.
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])
Then, we build a logistic regression classifier on a subset of data.
simple_model = logitreg.fit(train_matrix_word_subset, train_data['sentiment'])
Let's inspect the weights (coefficients) of the simple_model building a table to store (word, coefficient) pairs.
simple_model_coef_table = pd.DataFrame({'word':significant_words,'coefficient':simple_model.coef_.flatten()})
simple_model_coef_table.sort_values(['coefficient'], ascending=False)
Consider the coefficients of simple_model, we count how many of coefficients are positive for the simple_model.
print("The number of coefficients are positive in the simple_model:", np.sum(simple_model.coef_>0))
We will compare the accuracy of the sentiment_model and the simple_model using the training data to see which one has higher accuracy.
sentiment_model = logitreg.fit(train_matrix, train_data['sentiment'])
print("The accuracy of sentiment_model:", accuracy_score(train_data['sentiment'], sentiment_model.predict(train_matrix)))
simple_model = logitreg.fit(train_matrix_word_subset, train_data['sentiment'])
print("The accuracy of sentiment_model:", accuracy_score(train_data['sentiment'], simple_model.predict(train_matrix_word_subset)))
It is common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless. We can calculate the majority class classifier model as follow:
num_positive = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print("The number of positive sentiment:",num_positive)
print("The number of positive sentiment:", num_negative)
print("Majority class ratio:", num_positive / len(train_data))