Nearest Neighbors & Kernel Regression

In this lesson, you will predict house prices using k-nearest neighbors regression.

Learning objectives:

  • Find the k-nearest neighbors of a given query input
  • Predict the output for the query input using the k-nearest neighbors
  • Choose the best value of k using a validation set
In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
In [2]:
### Importing data:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':float, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data_small.csv', dtype = dtype_dict)
training = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data_small_train.csv', dtype = dtype_dict)
testing = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data_small_test.csv', dtype = dtype_dict)
val = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data_validation.csv', dtype = dtype_dict)

Import useful functions from previous notebooks

To efficiently compute pairwise distances among data points, we will convert the data frame into a 2D Numpy array. First import the numpy library and then copy and paste get_numpy_data() from the previous lesson.

In [3]:
def get_numpy_data(data_sframe, features, output):
    features_matrix = data_sframe[features] 
    features_matrix['constant'] = 1
    features.insert(0, 'constant')
    features_matrix = features_matrix[features]
    features_matrix = np.array(features_matrix)
    output_array = data_sframe[output]
    output_array = np.array(output_array)
    return(features_matrix, output_array)
def normalize_features(features_matrix):
    norms = np.linalg.norm(features_matrix, axis=0)
    normalized_features = features_matrix / norms
    return (normalized_features, norms)

Extract features and normalize

In [4]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors','waterfront', 
                'view','condition', 'grade', 'sqft_above','sqft_basement','yr_built',  
                'yr_renovated','lat','long', 'sqft_living15','sqft_lot15']
X_training, Y_training = get_numpy_data(training,features, 'price')
In [5]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors','waterfront', 
                'view','condition', 'grade', 'sqft_above','sqft_basement','yr_built',  
                'yr_renovated','lat','long', 'sqft_living15','sqft_lot15']
X_testing, Y_testing = get_numpy_data(testing,features, 'price')
In [6]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot','floors','waterfront', 
                'view','condition', 'grade', 'sqft_above','sqft_basement','yr_built',  
                'yr_renovated','lat','long', 'sqft_living15','sqft_lot15']
X_val, Y_val = get_numpy_data(testing,features, 'price')

In computing distances, it is crucial to normalize features. Otherwise, for example, the sqft_living feature (typically on the order of thousands) would exert a much larger influence on distance than the bedrooms feature (typically on the order of ones). We divide each column of the training feature matrix by its 2-norm, so that the transformed column has unit norm.

IMPORTANT: Make sure to store the norms of the features in the training set. The features in the test and validation sets must be divided by these same norms, so that the training, test, and validation sets are normalized consistently.

In [7]:
X_training_normalize, norms = normalize_features(X_training)
X_testing_normalize = X_testing/norms
X_val_normalize = X_val/norms

Compute a single distance

To start, let's just explore computing the "distance" between two given houses. We will take our query house to be the first house of the test set and look at the distance between this house and the 10th house of the training set.

To see the features associated with the query house, print the first row (index 0) of the test feature matrix. You should get an 18-dimensional vector whose components are between 0 and 1.

In [8]:
print(X_training_normalize[9])
print(X_testing_normalize[0])
[ 0.01345102  0.01163464  0.00602491  0.0083488   0.00050756  0.01279425
  0.          0.          0.01938684  0.01390535  0.0096309   0.
  0.01302544  0.          0.01346821 -0.01346251  0.01195898  0.00156612]
[ 0.01345102  0.01551285  0.01807473  0.01759212  0.00160518  0.017059
  0.          0.05102365  0.0116321   0.01564352  0.01362084  0.02481682
  0.01350306  0.          0.01345387 -0.01346922  0.01375926  0.0016225 ]

What is the Euclidean distance between the query house and the 10th house of the training set?

In [9]:
Euclidean_distance = np.sqrt(((X_testing_normalize[0] - X_training_normalize[9])**2).sum())
print(Euclidean_distance)
0.05972359371398078

Among the first 10 training houses, which house is the closest to the query house?

In [10]:
def result_dict(keys_list, values_list):
    result = {}
    for keys, values in zip(keys_list, values_list):
        result[keys] = values
    return result
In [11]:
k = 10
distance_list = [0.]*k
training_house = [0.]*k
for i in range(k):
    training_house[i] = i
    distance_list[i] = np.sqrt(((X_testing_normalize[0] - X_training_normalize[i])**2).sum())
In [12]:
distance = result_dict(training_house, distance_list)
house_min_dist = min(distance.keys(), key = lambda i: distance[i])
print('The closest distance is:', min(distance_list))
print('The house with the closest distance is:', house_min_dist)
The closest distance is: 0.052383627840220305
The house with the closest distance is: 8

2. Perform Neighbor Regression

Now that we have the element-wise differences, it is not too hard to compute the Euclidean distances between our query house and all of the training houses. First, write a single-line expression to define a variable ‘diff’ such that ‘diff[i]’ gives the element-wise difference between the features of the query house and the i-th training house.

2.1 Perform 1-nearest neighbor regression

We will write a function that computes the distances from a query house to all training houses. The function should take two parameters: (i) the matrix of training features and (ii) the single feature vector associated with the query.

In [13]:
def compute_distances(features_instances, features_query):
    diff = features_instances - features_query
    distances = np.sqrt(np.sum(diff**2, axis = 1))
    return distances
In [14]:
dis_3idx = compute_distances(X_training_normalize, X_testing_normalize[2])
print('The closest distance is:', min(dis_3idx))
print('The house with the closest distance is:',np.argmin(dis_3idx))
The closest distance is: 0.0028604955575117085
The house with the closest distance is: 382
In [15]:
print('The predicted value of the query house is:', Y_training[np.argmin(dis_3idx)])
The predicted value of the query house is: 249000.0
2.2 Perform k-nearest neighbor regression

For k-nearest neighbors, we need to find a set of k houses in the training set closest to a given query house. We then make predictions based on these k nearest neighbors.

Writing a function that takes in: i) the value of k; ii)the feature matrix for the training houses; and iii) the feature vector of the query house and returns the indices of the k closest training houses. For instance, with 2-nearest neighbor, a return value of [5, 10] would indicate that the 6th and 11th training houses are closest to the query house.

In [16]:
def k_nearest_neighbors(k, features_instances, features_query):
    distance = compute_distances(features_instances, features_query)
    neighbors = np.argsort(distance, axis = 0)[:k]
    return neighbors
In [17]:
third_house_distance = k_nearest_neighbors(4, X_training_normalize, X_testing_normalize[2])
print(third_house_distance)
[ 382 1149 4087 3142]

Now that we know how to find the k-nearest neighbors, write a function that predicts the value of a given query house. For simplicity, take the average of the prices of the k nearest neighbors in the training set. The function should have the following parameters: i) the value of k; ii) the feature matrix for the training houses;iii) the output values (prices) of the training houses; iv) and the feature vector of the query house, whose price we are predicting. The function should return a predicted value of the query house.

In [18]:
def predict_output_of_query(k, features_train, output_train, features_query):
    neighbors = k_nearest_neighbors(k, features_train, features_query)
    prediction = np.mean(output_train[neighbors])
    return (prediction)
In [19]:
third_house_prediction = predict_output_of_query(4, X_training_normalize,Y_training, X_testing_normalize[2])
print(third_house_prediction)
413987.5
Make multiple predictions

Write a function to predict the value of each and every house in a query set. (The query set can be any subset of the dataset, be it the test set or validation set.) The idea is to have a loop where we take each house in the query set as the query house and make a prediction for that specific house. The new function should take the following parameters:i)the value of k; ii)the feature matrix for the training houses; iii) the output values (prices) of the training houses; iv) and the feature matrix for the query set. The function should return a set of predicted values, one for each house in the query set.

In [20]:
def predict_output(k, features_train, output_train, features_query):
    predictions = [0.]*features_query.shape[0]
    for i in range(features_query.shape[0]):
        predictions[i] = predict_output_of_query(k, features_train, output_train, features_query[i])
    predictions = np.array(predictions)
    return predictions
In [21]:
predicted_value = predict_output(10, X_training_normalize,Y_training, X_testing_normalize)
print(np.argmin(predicted_value[:10]))
6
In [22]:
a = predict_output_of_query(10, X_training_normalize,Y_training, X_testing_normalize[0])
print(a)
881300.0

3. Choosing the best value of k using a validation set

There remains a question of choosing the value of k to use in making predictions. Here, we use a validation set to choose this value. Write a loop that does the following:

  • For k in [1, 2, ..., 15]: Makes predictions for each house in the VALIDATION set using the k-nearest neighbors from the TRAINING set; computes the RSS for these predictions on the VALIDATION set; stores the RSS computed above in rss_all.
  • Report which k produced the lowest RSS on VALIDATION set.
In [23]:
rss_np = np.zeros(15)
for k in range(1,16):
    predictions = predict_output(k, X_training_normalize,Y_training, X_val_normalize)
    rss = ((Y_val - predictions)**2).sum()
    rss_np[k-1] = rss
    print('RSS for k =', k, ':',rss)
print(rss_np)
RSS for k = 1 : 189300603178121.0
RSS for k = 2 : 162969655347954.25
RSS for k = 3 : 149008586983833.7
RSS for k = 4 : 137914467769569.0
RSS for k = 5 : 132270467766797.27
RSS for k = 6 : 132736445589546.16
RSS for k = 7 : 131757081714411.9
RSS for k = 8 : 133118823551516.81
RSS for k = 9 : 133492572320285.5
RSS for k = 10 : 132558116652748.0
RSS for k = 11 : 132648119251013.47
RSS for k = 12 : 132270862294516.1
RSS for k = 13 : 132766702659606.86
RSS for k = 14 : 133006256365677.47
RSS for k = 15 : 134342939295287.66
[1.89300603e+14 1.62969655e+14 1.49008587e+14 1.37914468e+14
 1.32270468e+14 1.32736446e+14 1.31757082e+14 1.33118824e+14
 1.33492572e+14 1.32558117e+14 1.32648119e+14 1.32270862e+14
 1.32766703e+14 1.33006256e+14 1.34342939e+14]
In [24]:
print(np.argmin(rss_np))
6

To visualize the performance as a function of k, plot the RSS on the VALIDATION set for each considered k value:

In [25]:
import matplotlib.pyplot as plt
%matplotlib inline

rss_list = pd.DataFrame(rss_np)
kvals = range(1, 16)
plt.plot(kvals, rss_list,'bo-')
Out[25]:
[<matplotlib.lines.Line2D at 0xbcda048>]

What is the RSS on the TEST data using the value of k found above? To be clear, sum over all houses in the TEST set.

In [26]:
predicted_value = predict_output(6, X_training_normalize,Y_training, X_testing_normalize)
rss = ((Y_testing - predicted_value)**2).sum()
print(rss)
132736445589546.16