Assessing Performance - Polynomial Regression

In this lesson, we will compare different regression models in order to assess which model fits best. We will be using polynomial regression as a means to examine this topic.

Learning objectives:

  • Writing a function to take an an array and a degree and return an data frame where each column is the array to a polynomial value up to the total degree;
  • Using a plotting tool (e.g. matplotlib) to visualize polynomial regressions;
  • Using a plotting tool (e.g. matplotlib) to visualize the same polynomial degree on different subsets of the data;
  • Using a validation set to select a polynomial degree;
  • Assessing the final fit using test data.

We will continue to use the House data from previous notebooks.

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
In [2]:
### Importing data:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
sales = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
sales = pd.DataFrame(sales)
sales.head()
Out[2]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3.0 1.00 1180.0 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340.0 5650.0
1 6414100192 20141209T000000 538000.0 3.0 2.25 2570.0 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690.0 7639.0
2 5631500400 20150225T000000 180000.0 2.0 1.00 770.0 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720.0 8062.0
3 2487200875 20141209T000000 604000.0 4.0 3.00 1960.0 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360.0 5000.0
4 1954400510 20150218T000000 510000.0 3.0 2.00 1680.0 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800.0 7503.0

5 rows × 21 columns

1. Polynominal_dataframe Function

This function creates an data frame consisting of the powers of an array up to a specific degree:

In [3]:
def polynominal_dataframe(feature, degree, dataset):
    poly_dataframe = dataset[['id','price', feature]]
    poly_dataframe.rename(columns = {feature: 'power_1'}, inplace =True)
    if degree > 1:
        for i in range(2,degree+1):
            name = 'power_'+str(i)
            poly_dataframe[name] = np.power(poly_dataframe.power_1, i)
    return poly_dataframe

Let's test our function

In [4]:
poly1_data = polynominal_dataframe('sqft_living', 3, sales)
poly1_data.head()
Out[4]:
id price power_1 power_2 power_3
0 7129300520 221900.0 1180.0 1392400.0 1.643032e+09
1 6414100192 538000.0 2570.0 6604900.0 1.697459e+10
2 5631500400 180000.0 770.0 592900.0 4.565330e+08
3 2487200875 604000.0 1960.0 3841600.0 7.529536e+09
4 1954400510 510000.0 1680.0 2822400.0 4.741632e+09

2. Visualizing Polynomial Regression

Let's use matplotlib to visualize what a polynomial regression looks like on some real data. We start with a degree 1 polynomial using 'sqft_living' (i.e. a line) to predict 'price' and plot what it looks like.

In [5]:
# sort data by 'sqft_living' and 'price'
sales.sort_values(['sqft_living', 'price'], inplace = True)
poly1_data = polynominal_dataframe('sqft_living', 1, sales)
poly1_data.head()
Out[5]:
id price power_1
19452 3980300371 142000.0 290.0
15381 2856101479 276000.0 370.0
860 1723049033 245000.0 380.0
18379 1222029077 265000.0 384.0
4868 6896300380 228000.0 390.0
2.1 Visualizing Function

Let's write a function that report intercept, weights, and produce a scatter plot of the training data (just square feet vs price) and add the fitted model based on the coresponding degree polynomial feature ‘sqft_living’

In [6]:
reg = linear_model.LinearRegression()
In [7]:
def plot_lines(dataset, deg):
    data = polynominal_dataframe('sqft_living', deg, dataset)
    y = data['price'].values.reshape(-1,1)
    arr_x = []
    for i in range(deg):
        name_var = 'power_'+str(i+1)
        arr_x.append(name_var)
    print(arr_x)
    x = data[arr_x]
    model_poly1 = reg.fit(x, y)
    print('coef', model_poly1.coef_)
    print('intercept', model_poly1.intercept_)
    y_hat1 = model_poly1.predict(x)
    name_var1 = 'power_'+str(i)
    x_line = data['power_1']
    plt.scatter(x_line, y)
    plt.plot(x_line, y_hat1)
In [8]:
### Trying a 1st degree polynomial
plot_lines(sales, 1)
['power_1']
coef [[280.6235679]]
intercept [-43580.74309447]
In [9]:
### Trying a 2nd degree polynomial
plot_lines(sales, 2)
['power_1', 'power_2']
coef [[6.79940947e+01 3.85812609e-02]]
intercept [199222.27930549]
In [10]:
### Trying a 3rd degree polynomial
plot_lines(sales, 3)
['power_1', 'power_2', 'power_3']
coef [[-9.01819864e+01  8.70465089e-02 -3.84055260e-06]]
intercept [336819.74822121]
In [11]:
### Trying a 15th degree polynomial
plot_lines(sales, 15)
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
coef [[ 4.56404164e-91  6.91712192e-51 -5.84649119e-56  2.78197775e-88
   1.19863929e-74  2.68575522e-71  2.26147568e-67  1.85900299e-63
   1.47144116e-59  1.09771012e-55  7.43509038e-52  4.23015578e-48
   1.61618577e-44 -2.49283826e-48  9.59718336e-53]]
intercept [537116.32963771]

What do you think of the 15th degree polynomial? Do you think this is appropriate? If we were to change the data do you think you'd get pretty much the same curve? Let's take a look.

2.2 Changing the Data and Re-learning

We're going to split the sales data into four subsets of roughly equal size. Then you will estimate a 15th degree polynomial model on all four subsets of the data. Print the coefficients (you should use .print_rows(num_rows = 16) to view all of them) and plot the resulting fit (as we did above).

In [12]:
set1 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_1_data.csv', dtype = dtype_dict)
set2 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_2_data.csv', dtype = dtype_dict)
set3 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_3_data.csv', dtype = dtype_dict)
set4 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_set_4_data.csv', dtype = dtype_dict)
In [13]:
### Trying the first data set with a 15th degree polynomial
plot_lines(set1, 15)
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
coef [[ 3.05794168e-90 -1.69394900e-49  2.35908952e-55  1.21888065e-88
   3.23082736e-74  1.10358333e-70  8.37724029e-67  6.23411957e-63
   4.49156442e-59  3.06938763e-55  1.91749300e-51  1.01335180e-47
   3.62176959e-44 -5.63501661e-48  2.18641116e-52]]
intercept [539058.81866818]
In [14]:
### Trying the second data set with a 15th degree polynomial
plot_lines(set2, 15)
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
coef [[ 2.71335943e-77  7.33542374e-39 -1.85052450e-44  1.39207185e-49
   5.73786189e-71  1.51934986e-58  3.64549609e-55  1.50416255e-51
   5.76015653e-48  1.95770493e-44  5.39396528e-41  9.40376341e-38
  -3.63529134e-41  4.65476514e-45 -1.97199988e-49]]
intercept [506913.13379136]
In [15]:
### Trying the third data set with a 15th degree polynomial
plot_lines(set3, 15)
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
coef [[ 2.83751934e-88 -7.80224148e-49 -1.38766434e-55  3.98272978e-59
   1.57170169e-72  4.27894908e-69  2.68009626e-65  1.63911362e-61
   9.66435015e-58  5.38044653e-54  2.72563636e-50  1.16253248e-46
   3.33756141e-43 -6.76238818e-47  3.43132932e-51]]
intercept [530874.31665333]
In [16]:
### Trying the fourth data set with a 15th degree polynomial
plot_lines(set1, 15)
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
coef [[ 3.05794168e-90 -1.69394900e-49  2.35908952e-55  1.21888065e-88
   3.23082736e-74  1.10358333e-70  8.37724029e-67  6.23411957e-63
   4.49156442e-59  3.06938763e-55  1.91749300e-51  1.01335180e-47
   3.62176959e-44 -5.63501661e-48  2.18641116e-52]]
intercept [539058.81866818]

Since the “best” polynomial degree is unknown to us we will use cross validation to select the best degree. Now for each degree from 1 to 15:

  • Building an polynomial data set using training_data[‘sqft_living’] as the feature and the current degree;
  • Adding training_data[‘price’] as a column to your polynomial data set;
  • Learning a model on TRAINING data to predict ‘price’ based on your polynomial data set at the current degree;
  • Computing the RSS on VALIDATION for the current model (print or save the RSS).
In [17]:
def model_poly(dataset, deg, val_data):
    data = polynominal_dataframe('sqft_living', deg, dataset)
    data_v = polynominal_dataframe('sqft_living', deg, val_data)
    y = data['price'].values.reshape(-1,1)
    arr_x = []
    for i in range(deg):
        name_var = 'power_'+str(i+1)
        arr_x.append(name_var)
    print(arr_x)
    x = data[arr_x]
    model_poly1 = reg.fit(x, y)
    x_val = data_v[arr_x]
    return model_poly1, x_val


def testing(training_data, validation_data, degree):
    data_training = polynominal_dataframe('sqft_living', degree, training_data)
    data_val = polynominal_dataframe('sqft_living', degree, validation_data)
    y = data_training['price'].values.reshape(-1,1)
    for i in range(1,degree+1):
        model1, x_val1 = model_poly(training_data, i, validation_data)
        print('Model:', i)
        print('Coefficients: \n', model1.coef_)
        print('Intercept: \n', model1.intercept_)
        print(x_val1.shape)
        y_val = data_val['price'].values.reshape(-1,1)
        print(y_val.shape)
        RSS_val = ((y_val - model1.predict(x_val1))**2).sum()
        print("{:.2e}".format(RSS_val))
In [18]:
traning_wk3 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_train_data.csv', dtype = dtype_dict)
val_wk3 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_valid_data.csv', dtype = dtype_dict)
test_wk3 = pd.read_csv('C:/Users/Duy Tung/Downloads/wk3_kc_house_test_data.csv', dtype = dtype_dict)
In [19]:
testing(traning_wk3, val_wk3, 15)
['power_1']
Model: 1
Coefficients: 
 [[288.59846375]]
Intercept: 
 [-59493.31716521]
(9635, 1)
(9635, 1)
6.29e+14
['power_1', 'power_2']
Model: 2
Coefficients: 
 [[1.22673842 0.0522949 ]]
Intercept: 
 [267506.28013224]
(9635, 2)
(9635, 1)
6.24e+14
['power_1', 'power_2', 'power_3']
Model: 3
Coefficients: 
 [[7.50292074e+00 5.03063603e-02 1.67090667e-07]]
Intercept: 
 [262170.64833998]
(9635, 3)
(9635, 1)
6.26e+14
['power_1', 'power_2', 'power_3', 'power_4']
Model: 4
Coefficients: 
 [[-1.53852689e+01  6.08970906e-02 -1.61496576e-06  9.11725213e-11]]
Intercept: 
 [277368.45598654]
(9635, 4)
(9635, 1)
6.30e+14
['power_1', 'power_2', 'power_3', 'power_4', 'power_5']
Model: 5
Coefficients: 
 [[ 3.19801946e-05  5.44553066e-02 -6.43007280e-07  4.83000202e-11
  -2.30348916e-16]]
Intercept: 
 [266164.86279303]
(9635, 5)
(9635, 1)
6.28e+14
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6']
Model: 6
Coefficients: 
 [[ 8.50063329e-12  1.43264294e-08  3.79412711e-05 -9.89794250e-09
   1.06074707e-12 -3.90874341e-17]]
Intercept: 
 [297506.75687836]
(9635, 6)
(9635, 1)
5.66e+14
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7']
Model: 7
Coefficients: 
 [[ 8.80209599e-19  1.06735191e-12  8.65239821e-12  1.40282516e-08
  -4.30479933e-12  4.64796174e-16 -1.67447771e-20]]
Intercept: 
 [344491.52315947]
(9635, 7)
(9635, 1)
1.07e+15
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8']
Model: 8
Coefficients: 
 [[ 5.37676814e-26 -5.00854864e-15  7.20257201e-19  2.17475857e-15
   3.74718437e-12 -1.17819508e-15  1.24986835e-19 -4.38361168e-24]]
Intercept: 
 [391486.56180874]
(9635, 8)
(9635, 1)
7.09e+15
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9']
Model: 9
Coefficients: 
 [[ 2.35440809e-33  5.55324966e-18  2.62095377e-23  1.61351476e-22
   4.53006377e-19  8.23523085e-16 -2.58575972e-19  2.69914449e-23
  -9.28485548e-28]]
Intercept: 
 [431228.23045065]
(9635, 9)
(9635, 1)
4.53e+16
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10']
Model: 10
Coefficients: 
 [[ 8.10194270e-41  1.71840710e-21  7.50379732e-27  8.37647524e-30
   2.82989842e-26  8.32511110e-23  1.58082448e-19 -4.93871231e-23
   5.10007290e-27 -1.73354064e-31]]
Intercept: 
 [461935.75119527]
(9635, 10)
(9635, 1)
2.48e+17
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11']
Model: 11
Coefficients: 
 [[ 2.32096276e-48  2.39357028e-25 -1.64821623e-30 -9.33016886e-35
   1.22821093e-33  4.56464992e-30  1.39644941e-26  2.74435909e-23
  -8.53531134e-27  8.75400297e-31 -2.95407695e-35]]
Intercept: 
 [484243.26382819]
(9635, 11)
(9635, 1)
1.19e+18
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12']
Model: 12
Coefficients: 
 [[ 5.72477700e-56 -5.43112074e-28  1.90876964e-34 -2.49593397e-39
   4.20354136e-41  1.78635499e-37  6.87060807e-34  2.16675289e-30
   4.37085447e-27 -1.35480626e-30  1.38346810e-34 -4.64795591e-39]]
Intercept: 
 [500337.65157448]
(9635, 12)
(9635, 1)
5.10e+18
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13']
Model: 13
Coefficients: 
 [[ 1.51503805e-73 -5.39749807e-42  2.85697086e-47 -1.33638236e-51
   8.66478085e-58  2.61981981e-54  1.65074167e-50  9.86650488e-47
   5.38893090e-43  2.48790292e-39  7.75959524e-36 -1.39618816e-39
   6.24901286e-44]]
Intercept: 
 [531845.57492908]
(9635, 13)
(9635, 1)
7.62e+17
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14']
Model: 14
Coefficients: 
 [[ 2.35065885e-81  8.24655512e-46 -5.90276221e-51 -3.26265223e-55
   1.65004191e-65  4.76035719e-62  3.14955042e-58  2.01353491e-54
   1.21821264e-50  6.72101619e-47  3.12875226e-43  9.82528405e-40
  -1.77391184e-43  7.95992831e-48]]
Intercept: 
 [533699.97470524]
(9635, 14)
(9635, 1)
2.30e+18
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6', 'power_7', 'power_8', 'power_9', 'power_10', 'power_11', 'power_12', 'power_13', 'power_14', 'power_15']
Model: 15
Coefficients: 
 [[ 3.65649652e-89 -3.49017023e-49 -4.97900608e-55  1.23631576e-86
   3.04293598e-73  8.36182560e-70  5.73133489e-66  3.83786524e-62
   2.47884969e-58  1.51269750e-54  8.40594099e-51  3.93663091e-47
   1.24240972e-43 -2.25041880e-47  1.01236550e-51]]
Intercept: 
 [534979.81242518]
(9635, 15)
(9635, 1)
6.96e+18
In [20]:
testing(traning_wk3, test_wk3, 6)
['power_1']
Model: 1
Coefficients: 
 [[288.59846375]]
Intercept: 
 [-59493.31716521]
(2217, 1)
(2217, 1)
1.42e+14
['power_1', 'power_2']
Model: 2
Coefficients: 
 [[1.22673842 0.0522949 ]]
Intercept: 
 [267506.28013224]
(2217, 2)
(2217, 1)
1.36e+14
['power_1', 'power_2', 'power_3']
Model: 3
Coefficients: 
 [[7.50292074e+00 5.03063603e-02 1.67090667e-07]]
Intercept: 
 [262170.64833998]
(2217, 3)
(2217, 1)
1.36e+14
['power_1', 'power_2', 'power_3', 'power_4']
Model: 4
Coefficients: 
 [[-1.53852689e+01  6.08970906e-02 -1.61496576e-06  9.11725213e-11]]
Intercept: 
 [277368.45598654]
(2217, 4)
(2217, 1)
1.35e+14
['power_1', 'power_2', 'power_3', 'power_4', 'power_5']
Model: 5
Coefficients: 
 [[ 3.19801946e-05  5.44553066e-02 -6.43007280e-07  4.83000202e-11
  -2.30348916e-16]]
Intercept: 
 [266164.86279303]
(2217, 5)
(2217, 1)
1.35e+14
['power_1', 'power_2', 'power_3', 'power_4', 'power_5', 'power_6']
Model: 6
Coefficients: 
 [[ 8.50063329e-12  1.43264294e-08  3.79412711e-05 -9.89794250e-09
   1.06074707e-12 -3.90874341e-17]]
Intercept: 
 [297506.75687836]
(2217, 6)
(2217, 1)
1.35e+14