In this lession, we will use LASSO to select features, building on a pre-implemented solver for LASSO.
Learning objectives:
In addition, we also will implement your own LASSO solver using coordinate descent.
We will continue to use the House data from previous notebooks.
import pandas as pd
import numpy as np
from sklearn import linear_model
from math import sqrt, log
from sklearn.metrics import mean_squared_error
### Importing data:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int,
'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float,
'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int,
'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
sales = pd.DataFrame(sales)
sales.head()
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']
We consider features that are some transformations of inputs.
sales['sqft_living_sqrt'] = sales['sqft_living'].apply(sqrt)
sales['sqft_lot_sqrt'] = sales['sqft_lot'].apply(sqrt)
sales['bedrooms_square'] = sales['bedrooms']*sales['bedrooms']
sales['floors_square'] = sales['floors']*sales['floors']
Let us fit a model with all the features available, plus the features we just created above.
all_features = ['bedrooms', 'bedrooms_square', 'bathrooms','sqft_living', 'sqft_living_sqrt',
'sqft_lot', 'sqft_lot_sqrt', 'floors', 'floors_square', 'waterfront', 'view',
'condition', 'grade', 'sqft_above', 'sqft_basement','yr_built', 'yr_renovated']
model_lasso = linear_model.Lasso(alpha = 5e2, normalize = True)
model_all = model_lasso.fit(sales[all_features], sales['price'])
### Find what features had non-zero weight.
coef_model_all1 = {}
for coef, feat in zip(model_all.coef_,all_features):
coef_model_all1[feat] = coef
coef_model_all1
To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:
training = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_train_data.csv', dtype = dtype_dict)
validation = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_valid_data.csv', dtype = dtype_dict)
testing = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_test_data.csv', dtype = dtype_dict)
Next, we write a loop that does the following:
testing['sqft_living_sqrt'] = testing['sqft_living'].apply(sqrt)
testing['sqft_lot_sqrt'] = testing['sqft_lot'].apply(sqrt)
testing['bedrooms_square'] = testing['bedrooms']*testing['bedrooms']
testing['floors_square'] = testing['floors']*testing['floors']
training['sqft_living_sqrt'] = training['sqft_living'].apply(sqrt)
training['sqft_lot_sqrt'] = training['sqft_lot'].apply(sqrt)
training['bedrooms_square'] = training['bedrooms']*training['bedrooms']
training['floors_square'] = training['floors']*training['floors']
validation['sqft_living_sqrt'] = validation['sqft_living'].apply(sqrt)
validation['sqft_lot_sqrt'] = validation['sqft_lot'].apply(sqrt)
validation['bedrooms_square'] = validation['bedrooms']*validation['bedrooms']
validation['floors_square'] = validation['floors']*validation['floors']
l1_penalty = np.logspace(1, 7, num=13)
for alpha in (l1_penalty):
print('Alpha', alpha)
model_lasso2 = linear_model.Lasso(alpha, normalize=True)
model2 = model_lasso2.fit(training[all_features], training['price'])
print('The number of nonzero coefficients:',np.count_nonzero(model2.coef_) + np.count_nonzero(model2.intercept_))
rss_val = mean_squared_error(validation['price'], model2.predict(validation[all_features]))
print('RSS for validation data:', rss_val)
rss_test = mean_squared_error(testing['price'], model2.predict(testing[all_features]))
print('RSS for testing data:', rss_test)
print('-------------------------------------')
What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive "a rule of thumb" --- an interpretable model that has only a few features in them.
In this section, you are going to implement a simple, two phase procedure to achive this goal:
Explore a large range of l1_penalty values to find a narrow region of l1_penalty values where models are likely to have the desired number of non-zero weights. Further explore the narrow region you found to find a good value for l1_penalty that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for l1_penalty.
max_nonzeros = 7
l1_penalty = np.logspace(1, 4, num=20)
Extract the weights of the model and count the number of nonzeros. Save the number of nonzeros to a list.
non_zeros = []
for alpha in l1_penalty:
print('Alpha:', alpha)
model_lasso3 = linear_model.Lasso(alpha=alpha, normalize=True)
model3 = model_lasso3.fit(training[all_features], training['price'])
non_zeros.append(np.count_nonzero(model3.coef_)+np.count_nonzero(model3.intercept_))
print('The number of nonzero coefs:',np.count_nonzero(model3.coef_)+np.count_nonzero(model3.intercept_))
print('-----------------------------------')
alpha_nonzeros = {}
for alpha, nonzero in zip(l1_penalty,non_zeros):
alpha_nonzeros[alpha] = nonzero
alpha_nonzeros
What values did you find for l1_penalty_min andl1_penalty_max?
key1 = min([val for val in alpha_nonzeros.values() if val > max_nonzeros])
l1_penalty_min = max([k for k, v in alpha_nonzeros.items() if v == key1])
print('l1_penalty_min:',l1_penalty_min)
key2 = max([val for val in alpha_nonzeros.values() if val < max_nonzeros])
l1_penalty_max = min(k for k, v in alpha_nonzeros.items() if v == key2)
print('l1_penalty_max:',l1_penalty_max)
Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty. At one end, we will have l1_penalty values that have too few non-zeros, and at the other end, we will have an l1_penalty that has too many non-zeros.
More formally, find:
We will now explore the narrow region of l1_penalty values we found. We want the solution with the right number of non-zeros and the lowest RSS on the validation set.
alp_list = []
rss_list = []
for l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):
model_lasso4 = linear_model.Lasso(alpha=l1_penalty, normalize=True)
model4 = model_lasso4.fit(training[all_features], training['price'])
nonzeros = np.count_nonzero(model4.coef_)+np.count_nonzero(model4.intercept_)
rss = mean_squared_error(validation['price'], model4.predict(validation[all_features]))
if nonzeros == max_nonzeros:
print('Alpha:', l1_penalty)
print('The number of nonzero coefs:',nonzeros)
print('RSS for Validation data:', rss)
print('-----------------------------------')
alp_list.append(l1_penalty)
rss_list.append(rss)
alp_rss = {}
for alp, rss in zip(alp_list, rss_list):
alp_rss[alp] = rss
alp_rss
What value of l1_penalty in our narrow range has the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzeros?
alpha_for_min_rss = min(alp_rss.keys(), key = lambda i: alp_rss[i])
print('Alpha for min RSS:',alpha_for_min_rss)
What features in this model have non-zero coefficients?
model_lasso5 = linear_model.Lasso(alpha= alpha_for_min_rss, normalize = True)
model5 = model_lasso5.fit(training[all_features], training['price'])
model5_coef = {}
for feat, coef in zip(all_features, model5.coef_):
model5_coef[feat] = coef
model5_coef
We will implement your very own LASSO solver via coordinate descent by writing a function to normalize features, implementing coordinate descent for LASSO, and exploring effects of L1 penalty.
### The get_numpy_data function
def get_numpy_data(data_sframe, features, output):
features_matrix = data_sframe[features]
features_matrix['constant'] = 1
features.insert(0, 'constant')
features_matrix = features_matrix[features]
features_matrix = np.array(features_matrix)
output_array = data_sframe[output]
output_array = np.array(output_array)
return(features_matrix, output_array)
### The predict_output function
def predict_output(feature_matrix, weights):
x_arr = feature_matrix*weights
predictions = x_arr.sum(axis = 1)
return predictions
In the house dataset, features vary wildly in their relative magnitude: sqft_living is very large overall compared to bedrooms, for instance. As a result, weight for sqft_living would be much smaller than weight for bedrooms. This is problematic because "small" weights are dropped first as l1_penalty goes up.
To give equal considerations for all features, we need to normalize features as discussed in the lectures: we divide each feature by its 2-norm so that the transformed feature has norm 1.
Let's see how we can do this normalization easily with Numpy: let us first consider a small matrix.
def normalize_features(features_matrix):
norms = np.linalg.norm(features_matrix, axis=0)
normalized_features = features_matrix / norms
return (normalized_features, norms)
feature, price = get_numpy_data(sales, ['sqft_living', 'bedrooms'], 'price')
normalized_features, norms = normalize_features(feature)
ro = [0, 0, 0]
w = [1,4,1]
prediction = predict_output(normalized_features, w)
for i in range(normalized_features.shape[1]):
prediction = predict_output(normalized_features, w)
error = price - prediction + normalized_features[:,i]*w[i]
ro[i] = np.dot(normalized_features[:,i],error.transpose())
print(ro[i])
print(ro)
print('The range of values of l1_penalty: between', 80966698.66623949*2,'and', 87939470.82325175*2)
Using the formula above, implement coordinate descent that minimizes the cost function over a single feature i. Note that the intercept (weight 0) is not regularized. The function should accept feature matrix, output, current weights, l1 penalty, and index of feature to optimize over. The function should return new weight for feature i.
def zerolistmaker(n):
listofzeros = [0] * n
return listofzeros
def lasso_coordinate_descent_step(i,feature_matrix, output, weights, l1_penalty):
prediction = predict_output(feature_matrix, weights)
error = output - prediction + feature_matrix[:,i]*weights[i]
ro = np.dot(feature_matrix[:,i],error.transpose())
if i == 0:
weights = ro
elif ro < -l1_penalty/2.:
weights = ro + l1_penalty/2
elif ro > l1_penalty/2.:
weights = ro - l1_penalty/2
else:
weights = 0.
return weights
# should print 0.425558846691
import math
print (lasso_coordinate_descent_step(1, np.array([[3./math.sqrt(13),1./math.sqrt(10)],
[2./math.sqrt(13),3./math.sqrt(10)]]), np.array([1., 1.]), np.array([1., 4.]), 0.1))
Now that we have a function that optimizes the cost function over a single coordinate, let us implement cyclical coordinate descent where we optimize coordinates 0, 1, ..., (d-1) in order and repeat.
def lasso_cyclical_coordinate_descent(normalized_features, output, initial_weights, l1_penalty, tolerance):
weights = initial_weights
ini_weights = initial_weights
loop = 0
weights_change = zerolistmaker(normalized_features.shape[1])
converged = False
while not converged:
loop+=1
for i in range(len(weights)):
weights_new = lasso_coordinate_descent_step(i,normalized_features, output, weights, l1_penalty)
weights_change[i] = np.abs(weights_new - weights[i])
weights[i] = weights_new
max_change = max(weights_change)
if max_change < tolerance:
converged = True
return weights
feature, price = get_numpy_data(sales, ['sqft_living', 'bedrooms'], 'price')
initial_weights = [0., 0., 0.]
l1_penalty = 1e7
tolerance = 1.0
w = lasso_cyclical_coordinate_descent(normalized_features, price, initial_weights, l1_penalty, tolerance)
print(w)
Let us split the sales dataset into training and test sets, and create a normalized feature matrix from the training data with more features.
training = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype = dtype_dict)
testing = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype = dtype_dict)
### Let us consider the following set of features.
more_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront',
'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']
First, create a normalized feature matrix from the TRAINING data with these features. (Make you store the norms for the normalization, since we'll use them later)
feature1, price1 = get_numpy_data(training, more_features, 'price')
normalized_features1, norms1 = normalize_features(feature1)
### l1_penalty=1e7
initial_weights1 = [0.]*len(more_features)
l1_penalty = 1e7
tolerance = 1.0
weights1e7 = lasso_cyclical_coordinate_descent(normalized_features1, price1, initial_weights1, l1_penalty, tolerance)
weights1e7_dict = {}
for feat, coef in zip(more_features, weights1e7):
weights1e7_dict[feat] = coef
weights1e7_dict
### l1_penalty=1e8
initial_weights1 = [0.]*len(more_features)
l1_penalty = 1e8
tolerance = 1.0
weights1e8 = lasso_cyclical_coordinate_descent(normalized_features1, price1, initial_weights1, l1_penalty, tolerance)
print(weights1e8)
weights1e8_dict = {}
for feat, coef in zip(more_features, weights1e8):
weights1e8_dict[feat] = coef
weights1e8_dict
### l1_penalty=1e4
initial_weights1 = [0.]*len(more_features)
l1_penalty = 1e4
tolerance = 5e5
weights1e4 = lasso_cyclical_coordinate_descent(normalized_features1, price1, initial_weights1, l1_penalty, tolerance)
weights1e4_dict = {}
for feat, coef in zip(more_features, weights1e4):
weights1e4_dict[feat] = coef
weights1e4_dict
Create a normalized version of each of the weights learned above. (weights1e4, weights1e7, weights1e8).
weights1e7_normalized = weights1e7 / norms1
weights1e8_normalized = weights1e8 / norms1
weights1e4_normalized = weights1e4 / norms1
Evaluating each of the learned models on the test data
more_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront',
'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']
(feature2, price2) = get_numpy_data(testing, more_features, 'price')
weights_list = [weights1e4_normalized, weights1e7_normalized, weights1e8_normalized]
for w in weights_list:
w_array = np.array(w)
y_pred = np.dot(feature2, w_array)
rss = (np.square(price2 - y_pred)).sum()
print('RSS', rss)