Gradient Descent for Multiple Regression

In this lession, we will use Pandas along with Numpy package to solve for the regression weights with gradient descent.

Learning objectives:

  • Converting an DataFrame into a Numpy array;
  • Writing a predict_output() function using Numpy;
  • Writing a numpy function to compute the derivative of the regression weights with respect to a single feature;
  • Writing gradient descent function to compute the regression weights given an initial weight vector, step size and tolerance;
  • Using the gradient descent function to estimate regression weights for multiple features.

Before we start, importing libraries, loading the King County house price dataset, and then storing the data in a data frame

In [1]:
import pandas as pd
import numpy as np 
In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
data1 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype= dtype_dict)
train_data = pd.DataFrame(data1)
data2 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype= dtype_dict)
test_data = pd.DataFrame(data2)
data = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
data = pd.DataFrame(data)

1. Converting to Numpy Array

Although Pandas dataframe offers a number of benefits to users (especially when using Big Data) in order to understand the details of the implementation of algorithms it's important to work with a library that allows for direct (and optimized) matrix operations. Numpy is a Python solution to work with matrices (or any multi-dimensional "array").

Recall that the predicted value given the weights and the features is just the dot product between the feature and weight vector. Similarly, if we put all of the features row-by-row in a matrix then the predicted value for all the observations can be computed by right multiplying the "feature matrix" by the "weight vector".

First we need to take the Dataframe of our data and convert it into a 2D numpy array (also called a matrix).

Now we will write a function that will accept a DataFrame, a list of feature names (e.g. ['sqft_living', 'bedrooms']) and an target feature e.g. ('price') and will return two things:

  • A numpy matrix whose columns are the desired features plus a constant column (this is how we create an 'intercept')
  • A numpy array containing the values of the output
In [3]:
def get_numpy_data(data_sframe, features, output):
    features_matrix = data_sframe[features] 
    features_matrix['constant'] = 1
    features.insert(0, 'constant')
    features_matrix = features_matrix[features]
    features_matrix = np.array(features_matrix)
    output_array = data_sframe[output]
    output_array = np.array(output_array)
    return(features_matrix, output_array)

To test the function above, we use the 'sqft_living' feature and a constant as our features and price as our output:

In [4]:
(example_features, example_output) = get_numpy_data(data, ['sqft_living'], 'price') 
print(example_features[0,:])
print(example_output[0])
[1.00e+00 1.18e+03]
221900.0

2. Predicting Output Given Regression Weights

If the features matrix (including a column of 1s for the constant) is stored as a 2D array (or matrix) and the regression weights are stored as a 1D array then the predicted output is just the dot product between the features matrix and the weights (with the weights on the right). Write a function ‘predict_output’ which accepts a 2D array ‘feature_matrix’ and a 1D array ‘weights’ and returns a 1D array ‘predictions’.

In [5]:
def predict_output(feature_matrix, weights):
    # assume feature_matrix is a numpy matrix containing the features as columns and weights is a corresponding numpy array
    # create the predictions vector by using np.dot()
    predictions = np.dot(feature_matrix, weights)
    return(predictions)

3. Computing the Derivative

We are now going to move to computing the derivative of the regression cost function. Recall that the cost function is the sum over the data points of the squared difference between an observed output and a predicted output.

Since the derivative of a sum is the sum of the derivatives we can compute the derivative for a single data point and then sum over data points. We can write the squared difference between the observed output and predicted output for a single point as follows:

(w[0][CONSTANT] + w[1][feature_1] + ... + w[i] [feature_i] + ... + w[k][feature_k] - output)^2

Where we have k features and a constant. So the derivative with respect to weight w[i] by the chain rule is:

2(w[0][CONSTANT] + w[1][feature_1] + ... + w[i] [feature_i] + ... + w[k][feature_k] - output) [feature_i]

The term inside the paranethesis is just the error (difference between prediction and output). So we can re-write this as:

2error[feature_i]

That is, the derivative for the weight for feature i is the sum (over data points) of 2 times the product of the error and the feature itself. In the case of the constant then this is just twice the sum of the errors!

Recall that twice the sum of the product of two vectors is just twice the dot product of the two vectors. Therefore the derivative for the weight for feature_i is just two times the dot product between the values of feature_i and the current errors.

With this in mind complete the following derivative function which computes the derivative of the weight given the value of the feature (over all data points) and the errors (over all data points).

3.1 Derivative function

If we have a the values of a single input feature in an array ‘feature’ and the prediction ‘errors’ (predictions - output) then the derivative of the regression cost function with respect to the weight of ‘feature’ is just twice the dot product between ‘feature’ and ‘errors’. Write a function that accepts a ‘feature’ array and ‘error’ array and returns the ‘derivative’ (a single number) as below:

In [6]:
def feature_derivative(errors, feature):
    # Assume that errors and feature are both numpy arrays of the same length (number of data points)
    # compute twice the dot product of these vectors as 'derivative' and return the value
    derivative = 2*np.dot(errors, feature)
    return(derivative)

To test the feature of derivartive function by running the following:

In [7]:
(example_features, example_output) = get_numpy_data(data, ['sqft_living'], 'price') 
my_weights = np.array([0., 0.]) # this makes all the predictions 0
test_predictions = predict_output(example_features, my_weights) 
# just like SFrames 2 numpy arrays can be elementwise subtracted with '-': 
errors = test_predictions - example_output # prediction errors in this case is just the -example_output
feature = example_features[:,0] # let's compute the derivative with respect to 'constant', the ":" indicates "all rows"
derivative = feature_derivative(errors, feature)
print(derivative)
print(-np.sum(example_output)*2) # should be the same as derivative
-23345850016.0
-23345850016.0
3.2 Gradient Descent

Now we will write a function that performs a gradient descent. The basic premise is simple. Given a starting point we update the current weights by moving in the negative gradient direction. Recall that the gradient is the direction of increase and therefore the negative gradient is the direction of decrease and we're trying to minimize a cost function.

The amount by which we move in the negative gradient direction is called the 'step size'. We stop when we are 'sufficiently close' to the optimum. We define this by requiring that the magnitude (length) of the gradient vector to be smaller than a fixed 'tolerance'.

With this in mind, complete the following gradient descent function below using your derivative function above. For each step in the gradient descent we update the weight for each feature befofe computing our stopping criteria

In [8]:
from math import sqrt
In [9]:
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
    converged = False 
    weights = np.array(initial_weights) 
    while not converged:
        predictions = predict_output(feature_matrix, weights)
        errors = predictions - output
        gradient_sum_squares = 0 
        for i in range(len(weights)): 
            derivative = feature_derivative(errors, feature_matrix[:, i])
            gradient_sum_squares += (derivative**2)
            weights[i] -= (step_size * derivative)
        gradient_magnitude = sqrt(gradient_sum_squares)
        if gradient_magnitude < tolerance:
            converged = True
    return(weights)

4. Running the Gradient Descent as Simple Regression

Although the gradient descent is designed for multiple regression since the constant is now a feature we can use the gradient descent function to estimat the parameters in the simple regression on squarefeet. The folowing cell sets up the feature_matrix, output, initial weights and step size for the first model:

In [10]:
# let's test out the gradient descent
simple_features = ['sqft_living']
my_output = 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7

Next run your gradient descent with the above parameters.

In [11]:
test_weight = regression_gradient_descent(simple_feature_matrix, output, initial_weights, step_size, tolerance)
print(test_weight)
[-46999.88716555    281.91211918]

Use your newly estimated weights and your predict_output() function to compute the predictions on all the TEST data (you will need to create a numpy array of the test feature_matrix and test output first:

In [12]:
simple_features = ['sqft_living']
my_output = 'price'
(test_simple_feature_matrix, test_output) = get_numpy_data(test_data, simple_features, my_output)
In [13]:
test_predictions = predict_output(test_simple_feature_matrix, test_weight)
print("Example:",test_predictions[0])
Example: 356134.4432550024

Now that you have the predictions on test data, compute the RSS on the test data set. Save this value for comparison later. Recall that RSS is the sum of the squared errors (difference between prediction and output).

In [14]:
test_residuals = test_output - test_predictions
test_RSS = (test_residuals * test_residuals).sum()
print(test_RSS)
275400044902128.3

5. Running the Gradient Descent as Simple Regression

Now we will use more than one actual feature. Use the following code to produce the weights for a second model with the following parameters:

In [15]:
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
my_output = 'price'
(feature_matrix, output) = get_numpy_data(train_data, model_features, my_output)
initial_weights = np.array([-100000., 1., 1.])
step_size = 4e-12
tolerance = 1e9
In [16]:
weight_2 = regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance)
print(weight_2)
[-9.99999688e+04  2.45072603e+02  6.52795267e+01]
In [17]:
model_features = ['sqft_living', 'sqft_living15'] # sqft_living15 is the average squarefeet for the nearest 15 neighbors. 
my_output = 'price'
(test_feature_matrix, test_output) = get_numpy_data(test_data, model_features, my_output)

test_predictions_2 = predict_output(test_feature_matrix, weight_2)
print("Example:",test_predictions_2[0])
Example: 366651.4116294939

Now use your predictions and the output to compute the RSS for model 2 on TEST data.

In [18]:
test_residuals_2 = test_output - test_predictions_2
test_RSS_2 = (test_residuals_2**2).sum()
print(test_RSS_2)
270263443629803.56