Maximum Likelihood Estimation (MLE) is a frequentist approach for estimating the parameters of a model given some observed data. The general approach for using MLE is: i) observe some data; ii)write down a model for how we believe the data was generated; iii) set the parameters of our model to values which maximize the likelihood of the parameters given the data.
In this assignment, we’re going to cover the basics of creating models, understanding likelihood, and using the Maximum Likelihood Estimation process for fitting our parameters.
Learning objectives:
Before we start, let's import libraries and load review data!
import pandas as pd
import numpy as np
For this assignment, we will use a subset of the Amazon product review dataset. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.
products = pd.read_csv('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/amazon_baby_subset.csv')
products.head(5)
Let us quickly explore more of this dataset.
products.describe()
products[products['sentiment']>0].count()
# Frequency table of the variable 'rating'
freq_table = products['rating'].value_counts().to_frame('count')
freq_table['percent'] = freq_table['count']/freq_table['count'].sum()
freq_table
We will perform some simple feature cleaning using data frames. The last assignment used all words in building bag-of-words features, but here we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into the JSON file named important_words.json. Load the words into a list important_words.
We will perform 2 simple data transformations: fill in N/A's in the review column and remove punctuation
products = products.fillna({'review':''})
def remove_punctuation(text):
import string
return text.translate(str.maketrans('','',string.punctuation))
products['review_clean'] = products['review'].apply(remove_punctuation)
For each word in important_words, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.
As a result, the data frame products should contain one column for each of the 193 important_words. For example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews.
import json
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/important_words.json') as important_words:
important_words = json.load(important_words)
for word in important_words:
products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
products.head(2)
Calculate the number of product reviews that contain the word perfect
products['contains_prefect'] = np.where(products['perfect']>=1,1,0)
products['contains_prefect'].sum()
We will write a function that extracts columns from a data frame and converts them into a multi-dimensional array. The function should accept three parameters:
The function should return two values: one 2D array for features and one 1D array for class labels.
def get_numpy_data(dataframe, features, label):
dataframe['constant'] = 1
features = ['constant'] + features
features_frame = dataframe[features]
feature_matrix = features_frame.as_matrix()
label_sarray = dataframe[label]
label_array = label_sarray.as_matrix()
return(feature_matrix, label_array)
Using the function written, we extract two arrays feature_matrix and sentiment.
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')
The link function sigmoid given by: $$P(y_i=+1|x_i ,w)= \frac{1}{1+exp(−w^T h(x_i))}$$
We will write a function to calculate the conditional probability. The function should take two parameters: feature_matrix and coefficients. Then,
The function return the predictions given by the sigmoid function.
def predict_probability(feature_matrix, coefficients):
score = np.dot(feature_matrix,coefficients.transpose())
predictions = 1/(1+np.exp(-score))
return predictions
We will write a function feature_derivative that computes the derivative of log likelihood with respect to a single coefficient w_j. The function should:
Given derivative of (log-)likelihood: $$ \frac{\partial l}{\partial w_j} = \displaystyle\sum_{i=1}^{N} h(x_i)(1[y_i = +1]-(y_i=+1|x_i ,w))$$
def feature_derivative(errors, feature):
derivative = np.dot(errors.transpose(),feature)
return derivative
We will also write a function to calculate the log-likelihood using the following formula: $$ ll(w) = \displaystyle\sum_{i=1}^{N} \Big((1[y_i = +1]-1)w^T h(w_i)-ln(1+exp(−w^T h(x_i)))\Big) $$
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
indicator = (sentiment==+1)
scores = np.dot(feature_matrix, coefficients)
lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores)))
return lp
Now we are ready to implement our own logistic regression. All we have to do is to write a gradient ascent function that takes gradient steps towards the optimum. We will write a function that fit a logistic regression model using gradient ascent. The function should accept the following parameters:
The function should carry out the following steps:
from math import sqrt
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
coefficients = np.array(initial_coefficients) # make sure it's a numpy array
for itr in range(max_iter):
predictions = predict_probability(feature_matrix, coefficients)
indicator = (sentiment==+1)
errors = indicator - predictions
for j in range(len(coefficients)):
derivative = feature_derivative(errors, feature_matrix[:,j])
coefficients[j]=coefficients[j] + step_size*derivative
if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
print( 'iteration %*d: log likelihood of observed labels = %.8f' % \
(int(np.ceil(np.log10(max_iter))), itr, lp))
return coefficients
Now, let's try to run the logistic regression solver with the parameters below:
initial_coefficients = np.zeros(194)
step_size = 1e-7
max_iter = 301
coef = logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter)
Using the result estimated above, we compute class predictions. We do this in two steps:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(feature_matrix, coef)
class_pred = np.zeros_like(scores)
for i in range(feature_matrix.shape[0]):
if scores[i]<0:
class_pred[i] = -1
else:
class_pred[i] = +1
print("The number of positive sentiment:", np.sum(class_pred>0))
To do this we are going to see how the model performs on the new data (test set) accuracy is defined as: $$ \text{accuracy} = \frac{\text{correctly classified examples}}{\text{total examples}}$$
from sklearn.metrics import accuracy_score
print("Our accuracy was:", accuracy_score(sentiment, class_pred))
coefficients = list(coef[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)
word_coefficient_tuples