Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data). In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors). Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting.

In this assignment, we aim to implement the logistic regression classifier with L2 regularization.

Learning objectives:

- Write a function to compute the derivative of log likelihood function with an L2 penalty with respect to a single coefficient.
- Implement gradient ascent with an L2 penalty.
- Empirically explore how the L2 penalty can ameliorate overfitting.

Before we start, let's import libraries and load review data!

In [1]:

```
import pandas as pd
import numpy as np
```

In [2]:

```
products = pd.read_csv('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/amazon_baby_subset.csv')
products.head(5)
```

Out[2]:

Again, we perform some simple feature cleaning using data frames. Similarly with the last assignment, we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into the JSON file named important_words.json. Load the words into a list important_words.

We will perform 2 simple data transformations: fill in N/A's in the review column and remove punctuation

In [3]:

```
products = products.fillna({'review':''})
```

In [4]:

```
def remove_punctuation(text):
import string
return text.translate(str.maketrans('','',string.punctuation))
products['review_clean'] = products['review'].apply(remove_punctuation)
```

For each word in important_words, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.

In [5]:

```
import json
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/important_words.json') as important_words:
important_words = json.load(important_words)
for word in important_words:
products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
```

We split the data into a train-validation split with 80% of the data in the training set and 20% of the data in the validation set.

In [6]:

```
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/validation_data_idx.json') as validation_data_file:
validation_data_idx = json.load(validation_data_file)
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/train_data_idx.json') as train_data_file:
train_data_idx = json.load(train_data_file)
validation_data = products.iloc[validation_data_idx]
train_data = products.iloc[train_data_idx]
```

- Convert data frame to multi-dimensional array
- Adding L2 penalty to the derivative
- Explore effects of L2 regularization

We will write a function that extracts columns from a data frame and return two arrays: one 2D array for features and one 1D array for class labels.

In [7]:

```
def get_numpy_data(dataframe, features, label):
dataframe['constant'] = 1
features = ['constant'] + features
features_frame = dataframe[features]
feature_matrix = features_frame.as_matrix()
label_sarray = dataframe[label]
label_array = label_sarray.as_matrix()
return(feature_matrix, label_array)
```

Using the function written above, we convert train_data and validation_data into multi-dimensional arrays.

In [8]:

```
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')
```

To apply regularization to our logistic regression, we just need to add the L2 penalty term to the per-coefficient derivative of log likelihood: $$ \frac{\partial l}{\partial w_j} = \displaystyle\sum_{i=1}^{N} h(x_i)(1[y_i = +1]-(y_i=+1|x_i ,w))-2\lambda w_j$$ Via the regularization parameter Î», we can then control how well we fit the training data while keeping the weights small. By increasing the value of Î» , we increase the regularization strength.

Frist, we write a function to calculate the conditional probability. The function should take two parameters: feature_matrix and coefficients. Then,

- compute the dot product of feature_matrix and coefficients.
- compute the sigmoid function P(y=+1|x,w).

The function return the predictions given by the sigmoid function.

In [9]:

```
def predict_probability(feature_matrix, coefficients):
score = np.dot(feature_matrix,coefficients.transpose())
predictions = 1/(1+np.exp(-score))
return predictions
```

We now write a function to compute the derivative of log likelihood with respect to a single coefficient w_j. This function will accept five parameters: errors, feature, coefficient, l2_penalty, feature_is_constant and return derivative. The l2_penalty parameter is the L2 penalty constant. The feature_is_constant parameter is a Boolean value indicating whether the j-th feature is constant or not

In [10]:

```
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant):
derivative = np.dot(errors.transpose(),feature)
if not feature_is_constant:
derivative -=2*l2_penalty*coefficient
return derivative
```

To verify the correctness of the gradient descent algorithm, we write a function for computing log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability), which is given by the formula $$ ll(w) = \displaystyle\sum_{i=1}^{N} \Big((1[y_i = +1]-1)w^T h(w_i)-ln(1+exp(âˆ’w^T h(x_i)))\Big)-\lambda ||w_j||_2^2 $$

In [11]:

```
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
indicator = (sentiment==+1)
scores = np.dot(feature_matrix, coefficients)
lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
return lp
```

We now write a function to fit a logistic regression model under L2 regularization. The function should accept the following parameters:

- feature_matrix: 2D array of features
- sentiment: 1D array of class labels
- initial_coefficients: 1D array containing initial values of coefficients
- step_size: a parameter controlling the size of the gradient steps
- l2_penalty: the L2 penalty constant \lambdaÎ»
- max_iter: number of iterations to run gradient ascent

The function returns the last set of coefficients after performing gradient ascent.

The function carries out the following steps:

- Initialize vector coefficients to initial_coefficients.
- Predict the class probability $P(y_i=+1|x_i,w)$ using your predict_probability function and save it to variable predictions.
- Compute indicator value for $(y_i = +1)$ by comparing sentiment against +1. Save it to variable indicator.
- Compute the errors as difference between indicator and predictions. Save the errors to variable errors.
- For each j-th coefficient, compute the per-coefficient derivative by calling feature_derivative_L2 with the j-th column of feature_matrix. Don't forget to supply the L2 penalty. Then increment the j-th coefficient by (step_size*derivative).
- Once in a while, insert code to print out the log likelihood.
- Repeat steps 2-6 for max_iter times.

In [12]:

```
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
coefficients = np.array(initial_coefficients) # make sure it's a numpy array
for itr in range(max_iter):
predictions = predict_probability(feature_matrix, coefficients)
indicator = (sentiment==+1)
errors = indicator - predictions
for j in range(len(coefficients)): # loop over each coefficient
is_intercept = (j == 0)
derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
coefficients[j]=coefficients[j] + step_size*derivative
if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
print ('iteration %*d: log likelihood of observed labels = %.8f' % \
(int(np.ceil(np.log10(max_iter))), itr, lp))
return coefficients
```

We explore the benefits of using L2 regularization by training models with 6 different L2 penalty values 0, 4, 10, 1e2, 1e3, and 1e5.

In [13]:

```
feature_matrix = feature_matrix_train
sentiment = sentiment_train
initial_coefficients = np.zeros(194)
step_size = 5e-6
max_iter = 501
```

In [14]:

```
### L2_penalty = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 0, max_iter)
```

In [15]:

```
### L2_penalty = 4
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 4, max_iter)
```

In [16]:

```
### L2_penalty = 10
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 10, max_iter)
```

In [17]:

```
### L2_penalty = 1e2
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e2, max_iter)
```

In [18]:

```
### L2_penalty = 1e3
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e3, max_iter)
```

In [19]:

```
### L2_penalty = 1e5
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e5, max_iter)
```

We compare the coefficients for each of the models that were trained above. We will create a table of features and learned coefficients associated with each of the different L2 penalty values.

In [20]:

```
table = pd.DataFrame({'word': ['(intercept)'] + important_words})
def add_coefficients_to_table(coefficients, column_name):
table[column_name] = coefficients
return table
```

In [21]:

```
add_coefficients_to_table(coefficients_0_penalty, 'coefficients [L2=0]')
add_coefficients_to_table(coefficients_4_penalty, 'coefficients [L2=4]')
add_coefficients_to_table(coefficients_10_penalty, 'coefficients [L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, 'coefficients [L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, 'coefficients [L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, 'coefficients [L2=1e5]')
```

Out[21]:

Using the coefficients trained with L2 penalty 0, we determine the 5 most positive/negative words (with largest positive/negative coefficients)

In [22]:

```
### the 5 most positive words
positive_words = list((table.sort_values('coefficients [L2=0]', ascending = False)[0:5][['word']]).values.flatten())
positive_words
```

Out[22]:

In [23]:

```
### the 5 most negative words
negative_words = list((table.sort_values('coefficients [L2=0]', ascending = True)[0:5][['word']]).values.flatten())
negative_words
```

Out[23]:

We observe the effect of increasing L2 penalty on the 10 words selected above by making a plot of the coefficients for the 10 words over the different values of L2 penalty.

In [24]:

```
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6
def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):
cmap_positive = plt.get_cmap('Reds')
cmap_negative = plt.get_cmap('Blues')
xx = l2_penalty_list
plt.plot(xx, [0.]*len(xx), '--', lw=1, color='k')
table_positive_words = table[table['word'].isin(positive_words)]
table_negative_words = table[table['word'].isin(negative_words)]
del table_positive_words['word']
del table_negative_words['word']
for i in range(len(positive_words)):
color = cmap_positive(0.8*((i+1)/(len(positive_words)*1.2)+0.15))
plt.plot(xx, table_positive_words[i:i+1].as_matrix().flatten(),
'-', label=positive_words[i], linewidth=4.0, color=color)
for i in range(len(negative_words)):
color = cmap_negative(0.8*((i+1)/(len(negative_words)*1.2)+0.15))
plt.plot(xx, table_negative_words[i:i+1].as_matrix().flatten(),
'-', label=negative_words[i], linewidth=4.0, color=color)
plt.legend(loc='best', ncol=3, prop={'size':16}, columnspacing=0.5)
plt.axis([1, 1e5, -1, 2])
plt.title('Coefficient path')
plt.xlabel('L2 penalty ($\lambda$)')
plt.ylabel('Coefficient value')
plt.xscale('log')
plt.rcParams.update({'font.size': 18})
plt.tight_layout()
make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list=[0, 4, 10, 1e2, 1e3, 1e5])
```