Overfitting and Regularization in Logistic Regression

Overfitting is a common problem in machine learning, where a model performs well on training data but does not generalize well to unseen data (test data). In order to avoid overfitting, it is necessary to use additional techniques (e.g. cross-validation, regularization, early stopping, pruning, or Bayesian priors). Regularization is a way of finding a good bias-variance tradeoff by tuning the complexity of the model. It is a very useful method to handle collinearity (high correlation among features), filter out noise from data, and eventually prevent overfitting.

In this assignment, we aim to implement the logistic regression classifier with L2 regularization.

Learning objectives:

  1. Write a function to compute the derivative of log likelihood function with an L2 penalty with respect to a single coefficient.
  2. Implement gradient ascent with an L2 penalty.
  3. Empirically explore how the L2 penalty can ameliorate overfitting.

Before we start, let's import libraries and load review data!

In [1]:
import pandas as pd
import numpy as np
In [2]:
products = pd.read_csv('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/amazon_baby_subset.csv')
products.head(5)
Out[2]:
name review rating sentiment
0 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5 1
1 Nature's Lullabies Second Year Sticker Calendar We wanted to get something to keep track of ou... 5 1
2 Nature's Lullabies Second Year Sticker Calendar My daughter had her 1st baby over a year ago. ... 5 1
3 Lamaze Peekaboo, I Love You One of baby's first and favorite books, and it... 4 1
4 SoftPlay Peek-A-Boo Where's Elmo A Children's ... Very cute interactive book! My son loves this ... 5 1

1. Data Set-up

Again, we perform some simple feature cleaning using data frames. Similarly with the last assignment, we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into the JSON file named important_words.json. Load the words into a list important_words.

1.1 Text cleaning

We will perform 2 simple data transformations: fill in N/A's in the review column and remove punctuation

In [3]:
products = products.fillna({'review':''})
In [4]:
def remove_punctuation(text):
    import string
    return text.translate(str.maketrans('','',string.punctuation)) 

products['review_clean'] = products['review'].apply(remove_punctuation)
1.2 Compute word counts (only for important_words)

For each word in important_words, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in important_words which keeps a count of the number of times the respective word occurs in the review text.

In [5]:
import json
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/important_words.json') as important_words:    
    important_words = json.load(important_words)
    
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))
1.3 Train-Validation split

We split the data into a train-validation split with 80% of the data in the training set and 20% of the data in the validation set.

In [6]:
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/validation_data_idx.json') as validation_data_file:    
    validation_data_idx = json.load(validation_data_file)
with open('C:/Users/tn00230/OneDrive - University of Surrey/Python/Data/Week2/train_data_idx.json') as train_data_file:    
    train_data_idx = json.load(train_data_file)

validation_data = products.iloc[validation_data_idx]
train_data = products.iloc[train_data_idx]

2. Implement the Logistic Regression Classifier With L2 Regularization

  1. Convert data frame to multi-dimensional array
  2. Adding L2 penalty to the derivative
  3. Explore effects of L2 regularization
2.1 Convert data frame to multi-dimensional array

We will write a function that extracts columns from a data frame and return two arrays: one 2D array for features and one 1D array for class labels.

In [7]:
def get_numpy_data(dataframe, features, label):
    dataframe['constant'] = 1
    features = ['constant'] + features
    features_frame = dataframe[features]
    feature_matrix = features_frame.as_matrix()
    label_sarray = dataframe[label]
    label_array = label_sarray.as_matrix()
    return(feature_matrix, label_array)

Using the function written above, we convert train_data and validation_data into multi-dimensional arrays.

In [8]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment') 
2.2 Adding L2 penalty to the derivative

To apply regularization to our logistic regression, we just need to add the L2 penalty term to the per-coefficient derivative of log likelihood: $$ \frac{\partial l}{\partial w_j} = \displaystyle\sum_{i=1}^{N} h(x_i)(1[y_i = +1]-(y_i=+1|x_i ,w))-2\lambda w_j$$ Via the regularization parameter λ, we can then control how well we fit the training data while keeping the weights small. By increasing the value of λ , we increase the regularization strength.

Frist, we write a function to calculate the conditional probability. The function should take two parameters: feature_matrix and coefficients. Then,

  • compute the dot product of feature_matrix and coefficients.
  • compute the sigmoid function P(y=+1|x,w).

The function return the predictions given by the sigmoid function.

In [9]:
def predict_probability(feature_matrix, coefficients):
    score = np.dot(feature_matrix,coefficients.transpose())
    predictions = 1/(1+np.exp(-score))
    return predictions

We now write a function to compute the derivative of log likelihood with respect to a single coefficient w_j. This function will accept five parameters: errors, feature, coefficient, l2_penalty, feature_is_constant and return derivative. The l2_penalty parameter is the L2 penalty constant. The feature_is_constant parameter is a Boolean value indicating whether the j-th feature is constant or not

In [10]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, feature_is_constant): 
    derivative = np.dot(errors.transpose(),feature)
    if not feature_is_constant: 
        derivative -=2*l2_penalty*coefficient
    return derivative

To verify the correctness of the gradient descent algorithm, we write a function for computing log likelihood (which we recall from the last assignment was a topic detailed in an advanced optional video, and used here for its numerical stability), which is given by the formula $$ ll(w) = \displaystyle\sum_{i=1}^{N} \Big((1[y_i = +1]-1)w^T h(w_i)-ln(1+exp(−w^T h(x_i)))\Big)-\lambda ||w_j||_2^2 $$

In [11]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores))) - l2_penalty*np.sum(coefficients[1:]**2)
    return lp

We now write a function to fit a logistic regression model under L2 regularization. The function should accept the following parameters:

  • feature_matrix: 2D array of features
  • sentiment: 1D array of class labels
  • initial_coefficients: 1D array containing initial values of coefficients
  • step_size: a parameter controlling the size of the gradient steps
  • l2_penalty: the L2 penalty constant \lambdaλ
  • max_iter: number of iterations to run gradient ascent

The function returns the last set of coefficients after performing gradient ascent.

The function carries out the following steps:

  1. Initialize vector coefficients to initial_coefficients.
  2. Predict the class probability $P(y_i=+1|x_i,w)$ using your predict_probability function and save it to variable predictions.
  3. Compute indicator value for $(y_i = +1)$ by comparing sentiment against +1. Save it to variable indicator.
  4. Compute the errors as difference between indicator and predictions. Save the errors to variable errors.
  5. For each j-th coefficient, compute the per-coefficient derivative by calling feature_derivative_L2 with the j-th column of feature_matrix. Don't forget to supply the L2 penalty. Then increment the j-th coefficient by (step_size*derivative).
  6. Once in a while, insert code to print out the log likelihood.
  7. Repeat steps 2-6 for max_iter times.
In [12]:
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in range(max_iter):
        predictions = predict_probability(feature_matrix, coefficients)
        indicator = (sentiment==+1)
        errors = indicator - predictions
        for j in range(len(coefficients)): # loop over each coefficient
            is_intercept = (j == 0)
            derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
            coefficients[j]=coefficients[j] + step_size*derivative
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
            print ('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
    return coefficients
2.3 Explore effects of L2 regularization

We explore the benefits of using L2 regularization by training models with 6 different L2 penalty values 0, 4, 10, 1e2, 1e3, and 1e5.

In [13]:
feature_matrix = feature_matrix_train
sentiment = sentiment_train
initial_coefficients = np.zeros(194)
step_size = 5e-6
max_iter = 501
In [14]:
### L2_penalty = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 0, max_iter)
iteration   0: log likelihood of observed labels = -29179.39138303
iteration   1: log likelihood of observed labels = -29003.71259047
iteration   2: log likelihood of observed labels = -28834.66187288
iteration   3: log likelihood of observed labels = -28671.70781507
iteration   4: log likelihood of observed labels = -28514.43078198
iteration   5: log likelihood of observed labels = -28362.48344665
iteration   6: log likelihood of observed labels = -28215.56713122
iteration   7: log likelihood of observed labels = -28073.41743783
iteration   8: log likelihood of observed labels = -27935.79536396
iteration   9: log likelihood of observed labels = -27802.48168669
iteration  10: log likelihood of observed labels = -27673.27331484
iteration  11: log likelihood of observed labels = -27547.98083656
iteration  12: log likelihood of observed labels = -27426.42679977
iteration  13: log likelihood of observed labels = -27308.44444728
iteration  14: log likelihood of observed labels = -27193.87673876
iteration  15: log likelihood of observed labels = -27082.57555831
iteration  20: log likelihood of observed labels = -26570.43059938
iteration  30: log likelihood of observed labels = -25725.48742389
iteration  40: log likelihood of observed labels = -25055.53326910
iteration  50: log likelihood of observed labels = -24509.63590026
iteration  60: log likelihood of observed labels = -24054.97906083
iteration  70: log likelihood of observed labels = -23669.51640848
iteration  80: log likelihood of observed labels = -23337.89167628
iteration  90: log likelihood of observed labels = -23049.07066021
iteration 100: log likelihood of observed labels = -22794.90974921
iteration 200: log likelihood of observed labels = -21283.29527353
iteration 300: log likelihood of observed labels = -20570.97485473
iteration 400: log likelihood of observed labels = -20152.21466944
iteration 500: log likelihood of observed labels = -19876.62333410
In [15]:
### L2_penalty = 4
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 4, max_iter)
iteration   0: log likelihood of observed labels = -29179.39508175
iteration   1: log likelihood of observed labels = -29003.73417180
iteration   2: log likelihood of observed labels = -28834.71441858
iteration   3: log likelihood of observed labels = -28671.80345068
iteration   4: log likelihood of observed labels = -28514.58077957
iteration   5: log likelihood of observed labels = -28362.69830317
iteration   6: log likelihood of observed labels = -28215.85663259
iteration   7: log likelihood of observed labels = -28073.79071393
iteration   8: log likelihood of observed labels = -27936.26093762
iteration   9: log likelihood of observed labels = -27803.04751805
iteration  10: log likelihood of observed labels = -27673.94684207
iteration  11: log likelihood of observed labels = -27548.76901327
iteration  12: log likelihood of observed labels = -27427.33612958
iteration  13: log likelihood of observed labels = -27309.48101569
iteration  14: log likelihood of observed labels = -27195.04624253
iteration  15: log likelihood of observed labels = -27083.88333261
iteration  20: log likelihood of observed labels = -26572.49874392
iteration  30: log likelihood of observed labels = -25729.32604153
iteration  40: log likelihood of observed labels = -25061.34245801
iteration  50: log likelihood of observed labels = -24517.52091982
iteration  60: log likelihood of observed labels = -24064.99093939
iteration  70: log likelihood of observed labels = -23681.67373669
iteration  80: log likelihood of observed labels = -23352.19298741
iteration  90: log likelihood of observed labels = -23065.50180166
iteration 100: log likelihood of observed labels = -22813.44844580
iteration 200: log likelihood of observed labels = -21321.14164794
iteration 300: log likelihood of observed labels = -20624.98634439
iteration 400: log likelihood of observed labels = -20219.92048845
iteration 500: log likelihood of observed labels = -19956.11341777
In [16]:
### L2_penalty = 10
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 10, max_iter)
iteration   0: log likelihood of observed labels = -29179.40062984
iteration   1: log likelihood of observed labels = -29003.76654163
iteration   2: log likelihood of observed labels = -28834.79322654
iteration   3: log likelihood of observed labels = -28671.94687528
iteration   4: log likelihood of observed labels = -28514.80571589
iteration   5: log likelihood of observed labels = -28363.02048079
iteration   6: log likelihood of observed labels = -28216.29071186
iteration   7: log likelihood of observed labels = -28074.35036891
iteration   8: log likelihood of observed labels = -27936.95892966
iteration   9: log likelihood of observed labels = -27803.89576265
iteration  10: log likelihood of observed labels = -27674.95647005
iteration  11: log likelihood of observed labels = -27549.95042714
iteration  12: log likelihood of observed labels = -27428.69905549
iteration  13: log likelihood of observed labels = -27311.03455140
iteration  14: log likelihood of observed labels = -27196.79890162
iteration  15: log likelihood of observed labels = -27085.84308528
iteration  20: log likelihood of observed labels = -26575.59697506
iteration  30: log likelihood of observed labels = -25735.07304608
iteration  40: log likelihood of observed labels = -25070.03447306
iteration  50: log likelihood of observed labels = -24529.31188025
iteration  60: log likelihood of observed labels = -24079.95349572
iteration  70: log likelihood of observed labels = -23699.83199186
iteration  80: log likelihood of observed labels = -23373.54108747
iteration  90: log likelihood of observed labels = -23090.01500055
iteration 100: log likelihood of observed labels = -22841.08995135
iteration 200: log likelihood of observed labels = -21377.25595328
iteration 300: log likelihood of observed labels = -20704.63995428
iteration 400: log likelihood of observed labels = -20319.25685307
iteration 500: log likelihood of observed labels = -20072.16321721
In [17]:
### L2_penalty = 1e2
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e2, max_iter)
iteration   0: log likelihood of observed labels = -29179.48385120
iteration   1: log likelihood of observed labels = -29004.25177457
iteration   2: log likelihood of observed labels = -28835.97382190
iteration   3: log likelihood of observed labels = -28674.09410083
iteration   4: log likelihood of observed labels = -28518.17112932
iteration   5: log likelihood of observed labels = -28367.83774654
iteration   6: log likelihood of observed labels = -28222.77708939
iteration   7: log likelihood of observed labels = -28082.70799392
iteration   8: log likelihood of observed labels = -27947.37595368
iteration   9: log likelihood of observed labels = -27816.54738615
iteration  10: log likelihood of observed labels = -27690.00588850
iteration  11: log likelihood of observed labels = -27567.54970126
iteration  12: log likelihood of observed labels = -27448.98991327
iteration  13: log likelihood of observed labels = -27334.14912742
iteration  14: log likelihood of observed labels = -27222.86041863
iteration  15: log likelihood of observed labels = -27114.96648229
iteration  20: log likelihood of observed labels = -26621.50201299
iteration  30: log likelihood of observed labels = -25819.72803950
iteration  40: log likelihood of observed labels = -25197.34035501
iteration  50: log likelihood of observed labels = -24701.03698195
iteration  60: log likelihood of observed labels = -24296.66378580
iteration  70: log likelihood of observed labels = -23961.38842316
iteration  80: log likelihood of observed labels = -23679.38088853
iteration  90: log likelihood of observed labels = -23439.31824267
iteration 100: log likelihood of observed labels = -23232.88192018
iteration 200: log likelihood of observed labels = -22133.50726528
iteration 300: log likelihood of observed labels = -21730.03957488
iteration 400: log likelihood of observed labels = -21545.87572145
iteration 500: log likelihood of observed labels = -21451.95551390
In [18]:
### L2_penalty = 1e3
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e3, max_iter)
iteration   0: log likelihood of observed labels = -29180.31606471
iteration   1: log likelihood of observed labels = -29009.07176112
iteration   2: log likelihood of observed labels = -28847.62378912
iteration   3: log likelihood of observed labels = -28695.14439397
iteration   4: log likelihood of observed labels = -28550.95060743
iteration   5: log likelihood of observed labels = -28414.45771129
iteration   6: log likelihood of observed labels = -28285.15124375
iteration   7: log likelihood of observed labels = -28162.56976044
iteration   8: log likelihood of observed labels = -28046.29387744
iteration   9: log likelihood of observed labels = -27935.93902900
iteration  10: log likelihood of observed labels = -27831.15045502
iteration  11: log likelihood of observed labels = -27731.59955260
iteration  12: log likelihood of observed labels = -27636.98108219
iteration  13: log likelihood of observed labels = -27547.01092670
iteration  14: log likelihood of observed labels = -27461.42422295
iteration  15: log likelihood of observed labels = -27379.97375625
iteration  20: log likelihood of observed labels = -27027.18208317
iteration  30: log likelihood of observed labels = -26527.22737267
iteration  40: log likelihood of observed labels = -26206.59048765
iteration  50: log likelihood of observed labels = -25995.96903148
iteration  60: log likelihood of observed labels = -25854.95710284
iteration  70: log likelihood of observed labels = -25759.08109950
iteration  80: log likelihood of observed labels = -25693.05688014
iteration  90: log likelihood of observed labels = -25647.09929349
iteration 100: log likelihood of observed labels = -25614.81468705
iteration 200: log likelihood of observed labels = -25536.20998919
iteration 300: log likelihood of observed labels = -25532.57691220
iteration 400: log likelihood of observed labels = -25532.35543765
iteration 500: log likelihood of observed labels = -25532.33970049
In [19]:
### L2_penalty = 1e5
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, 1e5, max_iter)
iteration   0: log likelihood of observed labels = -29271.85955115
iteration   1: log likelihood of observed labels = -29271.71006589
iteration   2: log likelihood of observed labels = -29271.65738833
iteration   3: log likelihood of observed labels = -29271.61189923
iteration   4: log likelihood of observed labels = -29271.57079975
iteration   5: log likelihood of observed labels = -29271.53358505
iteration   6: log likelihood of observed labels = -29271.49988440
iteration   7: log likelihood of observed labels = -29271.46936584
iteration   8: log likelihood of observed labels = -29271.44172890
iteration   9: log likelihood of observed labels = -29271.41670149
iteration  10: log likelihood of observed labels = -29271.39403722
iteration  11: log likelihood of observed labels = -29271.37351294
iteration  12: log likelihood of observed labels = -29271.35492661
iteration  13: log likelihood of observed labels = -29271.33809523
iteration  14: log likelihood of observed labels = -29271.32285309
iteration  15: log likelihood of observed labels = -29271.30905015
iteration  20: log likelihood of observed labels = -29271.25729150
iteration  30: log likelihood of observed labels = -29271.20657205
iteration  40: log likelihood of observed labels = -29271.18775997
iteration  50: log likelihood of observed labels = -29271.18078247
iteration  60: log likelihood of observed labels = -29271.17819447
iteration  70: log likelihood of observed labels = -29271.17723457
iteration  80: log likelihood of observed labels = -29271.17687853
iteration  90: log likelihood of observed labels = -29271.17674648
iteration 100: log likelihood of observed labels = -29271.17669750
iteration 200: log likelihood of observed labels = -29271.17666862
iteration 300: log likelihood of observed labels = -29271.17666862
iteration 400: log likelihood of observed labels = -29271.17666862
iteration 500: log likelihood of observed labels = -29271.17666862

3. Compare Coefficients

We compare the coefficients for each of the models that were trained above. We will create a table of features and learned coefficients associated with each of the different L2 penalty values.

In [20]:
table = pd.DataFrame({'word': ['(intercept)'] + important_words})
def add_coefficients_to_table(coefficients, column_name):
    table[column_name] = coefficients
    return table
In [21]:
add_coefficients_to_table(coefficients_0_penalty, 'coefficients [L2=0]')
add_coefficients_to_table(coefficients_4_penalty, 'coefficients [L2=4]')
add_coefficients_to_table(coefficients_10_penalty, 'coefficients [L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, 'coefficients [L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, 'coefficients [L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, 'coefficients [L2=1e5]')
Out[21]:
word coefficients [L2=0] coefficients [L2=4] coefficients [L2=10] coefficients [L2=1e2] coefficients [L2=1e3] coefficients [L2=1e5]
0 (intercept) -0.063742 -0.063143 -0.062256 -0.050438 0.000054 0.011362
1 baby 0.074073 0.073994 0.073877 0.072360 0.059752 0.001784
2 one 0.012753 0.012495 0.012115 0.007247 -0.008761 -0.001827
3 great 0.801625 0.796897 0.789935 0.701425 0.376012 0.008950
4 love 1.058554 1.050856 1.039529 0.896644 0.418354 0.009042
5 use -0.000104 0.000163 0.000556 0.005481 0.017326 0.000418
6 would -0.287021 -0.286027 -0.284564 -0.265993 -0.188662 -0.008127
7 like -0.003384 -0.003442 -0.003527 -0.004635 -0.007043 -0.000827
8 easy 0.984559 0.977600 0.967362 0.838245 0.401904 0.008808
9 little 0.524419 0.521385 0.516917 0.460235 0.251221 0.005941
10 seat -0.086968 -0.086125 -0.084883 -0.069109 -0.017718 0.000611
11 old 0.208912 0.207749 0.206037 0.184332 0.105074 0.002741
12 well 0.453866 0.450969 0.446700 0.392304 0.194926 0.003945
13 get -0.196835 -0.196100 -0.195017 -0.181251 -0.122728 -0.004578
14 also 0.158163 0.157246 0.155899 0.139153 0.080918 0.001929
15 really -0.017906 -0.017745 -0.017508 -0.014481 -0.004448 -0.000340
16 son 0.128396 0.127761 0.126828 0.115192 0.070411 0.001552
17 time -0.072429 -0.072281 -0.072065 -0.069480 -0.057581 -0.002805
18 bought -0.151817 -0.150917 -0.149594 -0.132884 -0.072431 -0.001985
19 product -0.263330 -0.262328 -0.260854 -0.242391 -0.167962 -0.006211
20 good 0.156507 0.155270 0.153445 0.129972 0.047879 0.000266
21 daughter 0.263418 0.261775 0.259357 0.228685 0.117158 0.002401
22 much -0.013247 -0.013295 -0.013366 -0.014326 -0.015219 -0.000839
23 loves 1.052484 1.043903 1.031265 0.870794 0.345870 0.006150
24 stroller -0.037533 -0.036988 -0.036186 -0.025990 0.005912 0.001326
25 put -0.000330 -0.000323 -0.000312 -0.000127 0.001529 -0.000097
26 months -0.067995 -0.067315 -0.066314 -0.053594 -0.013083 -0.000157
27 car 0.193364 0.191904 0.189754 0.162531 0.072719 0.001765
28 still 0.188508 0.187071 0.184955 0.158163 0.068491 0.000976
29 back -0.268954 -0.267419 -0.265161 -0.236730 -0.134671 -0.003988
... ... ... ... ... ... ... ...
164 started -0.153174 -0.151852 -0.149905 -0.125084 -0.045084 -0.000877
165 anything -0.186801 -0.185242 -0.182943 -0.153602 -0.057284 -0.001053
166 last -0.099469 -0.098692 -0.097547 -0.083001 -0.034797 -0.000775
167 company -0.276548 -0.274151 -0.270621 -0.225839 -0.084898 -0.001719
168 come -0.032009 -0.031804 -0.031502 -0.027685 -0.014185 -0.000426
169 returned -0.572707 -0.567518 -0.559870 -0.462056 -0.150021 -0.002225
170 maybe -0.224076 -0.222015 -0.218976 -0.180192 -0.058149 -0.000945
171 took -0.046445 -0.046199 -0.045838 -0.041422 -0.025566 -0.000772
172 broke -0.555195 -0.550209 -0.542861 -0.448989 -0.148726 -0.002182
173 makes -0.009023 -0.008764 -0.008382 -0.003467 0.008757 0.000255
174 stay -0.300563 -0.297920 -0.294024 -0.244247 -0.083709 -0.001310
175 instead -0.193123 -0.191418 -0.188907 -0.156863 -0.054125 -0.000925
176 idea -0.465370 -0.461130 -0.454879 -0.374890 -0.118469 -0.001627
177 head -0.110472 -0.109559 -0.108215 -0.090992 -0.032986 -0.000502
178 said -0.098049 -0.097331 -0.096274 -0.082875 -0.037594 -0.000947
179 less -0.136801 -0.135652 -0.133958 -0.112360 -0.042260 -0.000873
180 went -0.106836 -0.106003 -0.104776 -0.089294 -0.039417 -0.001006
181 working -0.320363 -0.317559 -0.313427 -0.260764 -0.092334 -0.001674
182 high 0.003326 0.003282 0.003217 0.002404 0.000236 -0.000062
183 unit -0.196121 -0.194516 -0.192153 -0.162210 -0.066568 -0.001567
184 seems 0.058308 0.057905 0.057312 0.049753 0.022875 0.000329
185 picture -0.196906 -0.195273 -0.192866 -0.162143 -0.061171 -0.001151
186 completely -0.277845 -0.275461 -0.271947 -0.227098 -0.081775 -0.001421
187 wish 0.173191 0.171640 0.169352 0.140022 0.044374 0.000468
188 buying -0.132197 -0.131083 -0.129441 -0.108471 -0.040331 -0.000792
189 babies 0.052494 0.052130 0.051594 0.044805 0.021026 0.000365
190 won 0.004960 0.004907 0.004830 0.003848 0.001084 0.000017
191 tub -0.166745 -0.165367 -0.163338 -0.137693 -0.054778 -0.000936
192 almost -0.031916 -0.031621 -0.031186 -0.025604 -0.007361 -0.000125
193 either -0.228852 -0.226793 -0.223758 -0.184986 -0.061138 -0.000980

194 rows × 7 columns

Using the coefficients trained with L2 penalty 0, we determine the 5 most positive/negative words (with largest positive/negative coefficients)

In [22]:
### the 5 most positive words
positive_words = list((table.sort_values('coefficients [L2=0]', ascending = False)[0:5][['word']]).values.flatten())
positive_words
Out[22]:
['love', 'loves', 'easy', 'perfect', 'great']
In [23]:
### the 5 most negative words
negative_words = list((table.sort_values('coefficients [L2=0]', ascending = True)[0:5][['word']]).values.flatten())
negative_words
Out[23]:
['disappointed', 'money', 'return', 'waste', 'returned']

We observe the effect of increasing L2 penalty on the 10 words selected above by making a plot of the coefficients for the 10 words over the different values of L2 penalty.

In [24]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 6

def make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list):
    cmap_positive = plt.get_cmap('Reds')
    cmap_negative = plt.get_cmap('Blues')
    
    xx = l2_penalty_list
    plt.plot(xx, [0.]*len(xx), '--', lw=1, color='k')
    
    table_positive_words = table[table['word'].isin(positive_words)]
    table_negative_words = table[table['word'].isin(negative_words)]
    del table_positive_words['word']
    del table_negative_words['word']
    
    for i in range(len(positive_words)):
        color = cmap_positive(0.8*((i+1)/(len(positive_words)*1.2)+0.15))
        plt.plot(xx, table_positive_words[i:i+1].as_matrix().flatten(),
                 '-', label=positive_words[i], linewidth=4.0, color=color)
        
    for i in range(len(negative_words)):
        color = cmap_negative(0.8*((i+1)/(len(negative_words)*1.2)+0.15))
        plt.plot(xx, table_negative_words[i:i+1].as_matrix().flatten(),
                 '-', label=negative_words[i], linewidth=4.0, color=color)
        
    plt.legend(loc='best', ncol=3, prop={'size':16}, columnspacing=0.5)
    plt.axis([1, 1e5, -1, 2])
    plt.title('Coefficient path')
    plt.xlabel('L2 penalty ($\lambda$)')
    plt.ylabel('Coefficient value')
    plt.xscale('log')
    plt.rcParams.update({'font.size': 18})
    plt.tight_layout()


make_coefficient_plot(table, positive_words, negative_words, l2_penalty_list=[0, 4, 10, 1e2, 1e3, 1e5])