In this lesson, we will run ridge regression multiple times with different L2 penalties to see which one produces the best fit. We will revisit the example of polynomial regression as a means to see the effect of L2 regularization.

Learning objectives:

- Using a pre-built implementation of regression to run polynomial regression;
- Using matplotlib to visualize polynomial regressions;
- Using a pre-built implementation of regression to run polynomial regression, this time with L2 penalty;
- Using matplotlib to visualize polynomial regressions under L2 regularization;
- Choosing best L2 penalty using cross-validation;
- Assessing the final fit using test data.

In addition, we will implement our own ridge regression learning algorithm using gradient descent.

We will continue to use the House data from previous notebooks.

In [1]:

```
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
```

In [2]:

```
### Importing data:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int,
'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float,
'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int,
'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
sales = pd.DataFrame(sales)
sales.sort_values(['sqft_living', 'price'], inplace = True)
sales.head()
```

Out[2]:

Polynominal_dataframe Function: This function creates an data frame consisting of the powers of an array up to a specific degree.

In [3]:

```
def polynominal_dataframe(feature, degree, dataset):
poly_dataframe = dataset[['id','price', feature]]
poly_dataframe.rename(columns = {feature: 'power_1'}, inplace =True)
if degree > 1:
for i in range(2,degree+1):
name = 'power_'+str(i)
poly_dataframe[name] = np.power(poly_dataframe.power_1, i)
return poly_dataframe
```

Let us revisit the 15th-order polynomial model using the 'sqft_living' input. Generate polynomial features up to degree 15 using polynominal_dataframe() function and fit a model with these features.

In [4]:

```
data1 = polynominal_dataframe('sqft_living', 15, sales)
x_arr = []
for i in range(1,15+1):
name_x = 'power_'+str(i)
x_arr.append(name_x)
x = data1[x_arr]
x.head()
y = data1['price']
```

When fitting the model, use an L2 penalty of 1e-5:

In [5]:

```
l2_small_penalty = 1.5e-5
```

Note: When we have so many features and so few data points, the solution can become highly numerically unstable, which can sometimes lead to strange unpredictable results. Thus, rather than using no regularization, we will introduce a tiny amount of regularization (l2_penalty=1.5e-5) to make the solution numerically stable. (In lecture, we discussed the fact that regularization can also help with numerical stability, and here we are seeing a practical example.)

With the L2 penalty specified above, fit the model and print out the learned weights. Add "alpha=l2_small_penalty" and "normalize=True" to the parameter list of linear_model.Ridge:

In [6]:

```
ridge1 = linear_model.Ridge(alpha=l2_small_penalty, normalize=True)
ridge1.fit(x,y)
print(ridge1.coef_)
```

Recall that the polynomial fit of degree 15 changed wildly whenever the data changed. In particular, when we split the sales data into four subsets and fit the model of degree 15, the result came out to be very different for each subset. The model had a high variance. We will see in a moment that ridge regression reduces such variance. But first, we must reproduce the experiment we did in the last lesson.

In [7]:

```
set1 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_1_data.csv', dtype = dtype_dict)
set2 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_2_data.csv', dtype = dtype_dict)
set3 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_3_data.csv', dtype = dtype_dict)
set4 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_4_data.csv', dtype = dtype_dict)
```

Visualizing Function: Let's use the function that report intercept, weights, and produce a scatter plot of the training data (just square feet vs price) and add the fitted model based on the coresponding degree polynomial feature ‘sqft_living’

In [8]:

```
l2_small_penalty=1e-9
def plot_lines(dataset, deg, penalty):
data = polynominal_dataframe('sqft_living', deg, dataset)
y = data['price'].values.reshape(-1,1)
arr_x = []
for i in range(deg):
name_var = 'power_'+str(i+1)
arr_x.append(name_var)
print(arr_x)
x = data[arr_x]
ridge = linear_model.Ridge(alpha = penalty, normalize = True)
model_poly = ridge.fit(x, y)
print('coef', model_poly.coef_)
print('intercept', model_poly.intercept_)
y_hat1 = model_poly.predict(x)
name_var1 = 'power_'+str(i)
x_line = data['power_1']
plt.scatter(x_line, y)
plt.plot(x_line, y_hat1)
```

In [9]:

```
### Trying the first data set with a 15th degree polynomial
plot_lines(set1, 15, l2_small_penalty)
```

In [10]:

```
### Trying the second data set with a 15th degree polynomial
plot_lines(set2, 15, l2_small_penalty)
```

In [11]:

```
### Trying the third data set with a 15th degree polynomial
plot_lines(set3, 15, l2_small_penalty)
```

In [12]:

```
### Trying the fourth data set with a 15th degree polynomial
plot_lines(set4, 15, l2_small_penalty)
```

Generally, whenever we see weights change so much in response to change in data, we believe the variance of our estimate to be large. Ridge regression aims to address this issue by penalizing "large" weights. (The weights looked quite small, but they are not that small because 'sqft_living' input is in the order of thousands.) Fit a 15th-order polynomial model on set_1, set_2, set_3, and set_4, this time with a large L2 penalty.

In [13]:

```
l2_large_penalty=1.23e2
```

In [14]:

```
### Trying the first data set with a 15th degree polynomial
plot_lines(set1, 15, l2_large_penalty)
```

In [15]:

```
### Trying the second data set with a 15th degree polynomial
plot_lines(set2, 15, l2_large_penalty)
```

In [16]:

```
### Trying the third data set with a 15th degree polynomial
plot_lines(set3, 15, l2_large_penalty)
```

In [17]:

```
### Trying the fourth data set with a 15th degree polynomial
plot_lines(set4, 15, l2_large_penalty)
```

Just like the polynomial degree, the L2 penalty is a "magic" parameter we need to select. We could use the validation set approach as we did in the last module, but that approach has a major disadvantage: it leaves fewer observations available for training. Cross-validation seeks to overcome this issue by using all of the training set in a smart way.

We will implement a kind of cross-validation called k-fold cross-validation. The method gets its name because it involves dividing the training set into k segments of roughtly equal size. Similar to the validation set method, we measure the validation error with one of the segments designated as the validation set. The major difference is that we repeat the process k times as follows:

Set aside segment 0 as the validation set, and fit a model on rest of data, and evalutate it on this validation set Set aside segment 1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set ... Set aside segment k-1 as the validation set, and fit a model on rest of data, and evalutate it on this validation set

After this process, we compute the average of the k validation errors, and use it as an estimate of the generalization error. Notice that all observations are used for both training and validation, as we iterate over segments of data.

To estimate the generalization error well, it is crucial to shuffle the training data before dividing them into segments. GraphLab Create has a utility function for shuffling a given SFrame. We reserve 10% of the data as the test set and shuffle the remainder. (Make sure to use seed=1 to get consistent answer.)

In [18]:

```
train_valid_shuffled = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_train_valid_shuffled.csv', dtype = dtype_dict)
test_wk4 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_test_data.csv', dtype = dtype_dict)
```

Once the data is shuffled, we divide it into equal segments. Each segment should receive n/k elements, where n is the number of observations in the training set and k is the number of segments. Since the segment 0 starts at index 0 and contains n/k elements, it ends at index (n/k)-1. The segment 1 starts where the segment 0 left off, at index (n/k). With n/k elements, the segment 1 ends at index (n*2/k)-1. Continuing in this fashion, we deduce that the segment i starts at index (n*i/k) and ends at (n*(i+1)/k)-1.

With this pattern in mind, we write a short loop that prints the starting and ending indices of each segment, just to make sure you are getting the splits right.

In [19]:

```
n = len(train_valid_shuffled)
k = 10 # 10-fold cross-validation
for i in range(k):
start = int((n*i)/k)
end = int((n*(i+1))/k-1)
print (i, (start, end))
```

Now we are ready to implement k-fold cross-validation. Write a function that computes k validation errors by designating each of the k segments as the validation set. It accepts as parameters (i) k, (ii) l2_penalty, (iii) dataframe, (iv) name of output column (e.g. price) and (v) list of feature names. The function returns the average validation error using k segments as validation sets.

For each i in [0, 1, ..., k-1]: Compute starting and ending indices of segment i and call 'start' and 'end' Form validation set by taking a slice (start:end+1) from the data. Form training set by appending slice (end+1:n) to the end of slice (0:start). Train a linear model using training set just formed, with a given l2_penalty Compute validation error using validation set just formed

In [20]:

```
from sklearn.metrics import mean_squared_error
def sample(dataset, deg):
y = dataset['price']
arr_x = []
for i in range(deg):
name_var = 'power_'+str(i+1)
arr_x.append(name_var)
x = dataset[arr_x]
return (x,y)
def k_fold_cross_validation(k, l2_penalty, data, deg):
n = len(train_valid_shuffled)
k = 10 # 10-fold cross-validation
ridge = linear_model.Ridge(alpha = l2_penalty, normalize = True)
rss =[]
for i in range(k):
start = int((n*i)/k)
end = int((n*(i+1))/k-1)
data_val = data[start:end]
data_train = data[0:start].append(data[end+1:n])
x_train,y_train = sample(data_train, deg)
x_val, y_val = sample(data_val, deg)
ridge1 = ridge.fit(x,y)
y_val_pred = ridge1.predict(x_val)
RSS = len(y_val)*mean_squared_error(y_val, ridge1.predict(x_val))
rss.append(RSS)
print('Minimum RSS:', min(rss), 'with alpha is', '%e' % l2_penalty)
return min(rss)
```

Once we have a function to compute the average validation error for a model, we can write another function to find the model that minimizes the average validation error. Write the function that does the following:

- We will again be aiming to fit a 15th-order polynomial model using the sqft_living input
- Loop for different l2_penalty
- Run 10-fold cross-validation with l2_penalty
- Report which L2 penalty produced the lowest average validation error.

In [21]:

```
def penalty(pen, k, data, deg):
rss = []
for i in range(len(pen)):
l2_penalty = pen[i]
rss1 = k_fold_cross_validation(k, l2_penalty, data, deg)
rss.append(rss1)
print(min(rss))
return min(rss)
```

In [22]:

```
data_crossk = polynominal_dataframe('sqft_living', 15, train_valid_shuffled)
```

In [23]:

```
pen = np.logspace(3, 9, num=13)
print(pen)
```

In [24]:

```
penalty(pen, 10, data_crossk, 15)
```

Out[24]:

In this assignment, we will implement ridge regression via gradient descent. We will convert an data frame into a Numpy array, write a Numpy function to compute the derivative of the regression weights with respect to a single feature, and write gradient descent function to compute the regression weights given an initial weight vector, step size, tolerance, and L2 penalty.

In [25]:

```
### Importing data:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int,
'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float,
'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int,
'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
data1 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype= dtype_dict)
data_train = pd.DataFrame(data1)
data2 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype= dtype_dict)
data_test = pd.DataFrame(data2)
data = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
data = pd.DataFrame(data)
```

Now we will write a function that will accept a DataFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:

- A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
- A numpy array containing the values of the output

In [26]:

```
def get_numpy_data(data_sframe, features, output):
features_matrix = data_sframe[features]
features_matrix['constant'] = 1
features.insert(0, 'constant')
features_matrix = features_matrix[features]
features_matrix = np.array(features_matrix)
output_array = data_sframe[output]
output_array = np.array(output_array)
return(features_matrix, output_array)
```

A function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’.

In [27]:

```
def predict_output(feature_matrix, weights):
predictions = np.dot(feature_matrix, weights)
return(predictions)
```

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output, plus the L2 penalty term.

Cost(w) = SUM[ (prediction - output)^2 ]

- l2_penalty*(w[0]^2 + w[1]^2 + ... + w[k]^2). Since the derivative of a sum is the sum of the derivatives, we can take the derivative of the first part (the RSS) as we did in the notebook for the unregularized case in Week 2 and add the derivative of the regularization part. As we saw, the derivative of the RSS with respect to w[i] can be written as:

2*SUM[ error*[feature_i] ].
The derivative of the regularization term with respect to w[i] is:

2*l2_penalty*w[i].
Summing both, we get

2*SUM[ error*[feature_i] ] + 2*l2_penalty*w[i].
That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself, plus 2*l2_penalty*w[i].

We will not regularize the constant. Thus, in the case of the constant, the derivative is just twice the sum of the errors (without the 2*l2_penalty*w[0] term).

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors, plus 2*l2_penalty*w[i].

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points). To decide when to we are dealing with the constant (so we don't regularize it) we added the extra parameter to the call feature_is_constant which you should set to True when computing the derivative of the constant and False otherwise.

In [28]:

```
def feature_derivative_ridge(errors, feature, weight, l2_penalty, feature_is_constant):
rss_w = 2*(np.dot(feature,errors.transpose()))
if feature_is_constant == True:
derivative = 2*(np.dot(feature,errors.transpose()))
if feature_is_constant == False:
derivative = 2*(np.dot(feature,errors.transpose())) + 2*l2_penalty*weight
return derivative
```

To test your feature derivartive run the following:

In [29]:

```
(example_features, example_output) = get_numpy_data(sales, ['sqft_living'], 'price')
my_weights = np.array([1., 10.])
test_predictions = predict_output(example_features, my_weights)
errors = test_predictions - example_output # prediction errors
# next two lines should print the same values
print (feature_derivative_ridge(errors, example_features[:,1], my_weights[1], 1, False))
print (np.sum(errors*example_features[:,1])*2+20.)
# next two lines should print the same values
print (feature_derivative_ridge(errors, example_features[:,0], my_weights[0], 1, True))
print (np.sum(errors)*2.)
```

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. Unlike in Week 2, this time we will set a maximum number of iterations and take gradient steps until we reach this maximum number. If no maximum number is supplied, the maximum should be set 100 by default. (Use default parameter values in Python.)

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent, we update the weight for each feature before computing our stopping criteria.

In [30]:

```
def ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations):
weights = np.array(initial_weights) # make sure it's a numpy array
j = 0
w = initial_weights
while j< max_iterations:
j+=1
test_predictions = predict_output(feature_matrix, w)
errors = test_predictions - output
for i in range(len(w)):
if i == 0:
der_cons = feature_derivative_ridge(errors, feature_matrix[:,i], w[i], l2_penalty, True)
w[i] = w[i] - step_size*der_cons
else:
der = feature_derivative_ridge(errors, feature_matrix[:,i], w[i], l2_penalty, False)
w[i] = w[i] - step_size*der
return w
```

In this part, we will only use 'sqft_living' to predict 'price'. Use the get_numpy_data function to get a Numpy versions of your data with only this feature, for both the train_data and the test_data.

In [31]:

```
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(data_train, simple_features, my_output)
```

In [32]:

```
simple_features = ['sqft_living']
my_output = 'price'
(simple_test_feature_matrix, test_output) = get_numpy_data(data_test, simple_features, my_output)
```

let's consider no regularization. Set the l2_penalty to 0.0 and run your ridge regression algorithm to learn the weights of your model.

In [33]:

```
step_size = 1e-12
l2_penalty = 0
max_iterations = 1000
initial_weights = np.array([0., 0.])
simple_weights_0_penalty = ridge_regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)
print(simple_weights_0_penalty)
```

let's consider high regularization. Set the l2_penalty to 1e11 and run your ridge regression algorithm to learn the weights of your model.

In [34]:

```
step_size = 1e-12
l2_penalty = 1e11
max_iterations = 1000
initial_weights = np.array([0., 0.])
simple_weights_high_penalty = ridge_regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)
print(simple_weights_high_penalty)
```

This code will plot the two learned models. (The blue line is for the model with no regularization and the red line is for the one with high regularization.)

In [35]:

```
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(simple_feature_matrix,output,'k.',
simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_0_penalty),'b-',
simple_feature_matrix,predict_output(simple_feature_matrix, simple_weights_high_penalty),'r-')
```

Out[35]:

Compute the RSS on the TEST data for the following three sets of weights:

- The initial weights (all zeros)
- The weights learned with no regularization
- The weights learned with high regularization

In [36]:

```
### simple_test_feature_matrix, test_output
initial_weights = np.array([0., 0.])
rss_initial = (np.square(np.dot(simple_test_feature_matrix,initial_weights.transpose())-test_output)).sum()
print('rss_initial:', rss_initial)
rss_no_reg = (np.square(np.dot(simple_test_feature_matrix, simple_weights_0_penalty.transpose()) - test_output)).sum()
print('rss_no_reg:',rss_no_reg)
rss_reg = (np.square(np.dot(simple_test_feature_matrix, simple_weights_high_penalty.transpose()) - test_output)).sum()
print('rss_reg:', rss_reg)
```

Let us now consider a model with 2 features: ['sqft_living', 'sqft_living15'].

In [37]:

```
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(feature_matrix, output) = get_numpy_data(data_train, model_features, my_output)
```

In [38]:

```
model_features = ['sqft_living', 'sqft_living15']
my_output = 'price'
(test_feature_matrix, test_output) = get_numpy_data(data_test, model_features, my_output)
```

In [39]:

```
step_size = 1e-12
l2_penalty = 0
max_iterations = 1000
initial_weights = np.array([0., 0., 0.])
multiple_weights_0_penalty = ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)
print(multiple_weights_0_penalty)
```

In [40]:

```
step_size = 1e-12
l2_penalty = 1e11
max_iterations = 1000
initial_weights = np.array([0., 0., 0.])
multiple_weights_high_penalty = ridge_regression_gradient_descent(feature_matrix, output, initial_weights, step_size, l2_penalty, max_iterations)
print(multiple_weights_high_penalty)
```

In [41]:

```
initial_weights = np.array([0., 0., 0.])
rss_initial = (np.square(np.dot(test_feature_matrix,initial_weights.transpose())-test_output)).sum()
print('rss_initial:', rss_initial)
rss_no_reg = (np.square(np.dot(test_feature_matrix, multiple_weights_0_penalty.transpose()) - test_output)).sum()
print('rss_no_reg:',rss_no_reg)
rss_reg = (np.square(np.dot(test_feature_matrix, multiple_weights_high_penalty.transpose()) - test_output)).sum()
print('rss_reg:', rss_reg)
```