Multiple Regression with Scikit-Learn¶

In this lesson, we study what linear regression is and how it can be implemented for multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python. We will use data on house sales in King County (Seattle, WA) to predict prices using multiple regression.

Learning objectives:

Using Pandas package to read and manage data;
Using Scikit-Learn to build up model and compute the regression weights;
Computing the Residual Sum of Squares;
Looking at coefficients and interpreting their meanings;
Evaluating multiple models via RSS.

Before we start, importing libraries, loading the King County house price dataset, and then storing the data in a data frame.

import pandas as pd
import numpy as np 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
data1 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype= dtype_dict)
data_train = pd.DataFrame(data1)
data2 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype= dtype_dict)
data_test = pd.DataFrame(data2)
data = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
data = pd.DataFrame(data)

data_train.head()

1. Data Set-up¶

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms. We will use the logarithm function to create a new feature.

Creating new variables for training and testing data.
Reporting some statistics.

1.1 Creating new variables¶

# Training data
data_train['bedrooms_squared'] = data_train.bedrooms**2
data_train['bed_bath_rooms'] = data_train.bedrooms*data_train.bathrooms
data_train['log_sqft_living'] = np.log(data_train.sqft_living)
data_train['lat_plus_long'] = data_train.lat+data_train.long

# Testing data
data_test['bedrooms_squared'] = data_test.bedrooms**2
data_test['bed_bath_rooms'] = data_test.bedrooms*data_train.bathrooms
data_test['log_sqft_living'] = np.log(data_test.sqft_living)
data_test['lat_plus_long'] = data_test.lat+data_train.long

1.2 Reporting some statistics¶

what are the mean (arithmetic average) values of your 4 new variables on TEST data?

print(np.mean(data_test.bedrooms_squared))
print(np.mean(data_test.bed_bath_rooms))
print(np.mean(data_test.log_sqft_living))
print(np.mean(data_test.lat_plus_long))

12.4466777015843
6.967427287774887
7.550274679645921
-74.6531748876802

2. Learning Multiple Models¶

Estimate the regression coefficients/weights for predicting ‘price’ for the following three models (traning data):

Model 1: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’, and ‘long’
Model 2: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, and ‘bed_bath_rooms’
Model 3: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, ‘bed_bath_rooms’, ‘bedrooms_squared’, ‘log_sqft_living’, and ‘lat_plus_long’

2.1 Creating set of features for each model¶

var1 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']]
var2 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms']]
var3 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms', 
                   'log_sqft_living', 'lat_plus_long']]
y = data_train[['price']]

2.2 Regression¶

Now that you have the features, learn the weights for the three different models for predicting target = 'price' using linear_model.LinearRegression() and look at the value of the weights/coefficients:

reg = linear_model.LinearRegression()
model1 = reg.fit(var1, y)
print("The weights for features in model 1:",model1.coef_)

The weights for features in model 1: [[ 3.12258646e+02 -5.95865332e+04  1.57067421e+04  6.58619264e+05
  -3.09374351e+05]]

model2 = reg.fit(var2, y)
print("The weights for features in model 2:",model2.coef_)

The weights for features in model 2: [[ 3.06610053e+02 -1.13446368e+05 -7.14613083e+04  6.54844630e+05
  -2.94298969e+05  2.55796520e+04]]

model3 = reg.fit(var3, y)
print("The weights for features in model 3:",model3.coef_)

The weights for features in model 3: [[ 5.32283566e+02  1.19729630e+03  9.43375296e+04  5.30911958e+05
  -4.03638426e+05 -1.62724577e+04 -5.64708989e+05  1.27273532e+05]]

2.3 Making Predictions¶

Once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above.

### Model 1
model1 = reg.fit(var1, y)
y_model1_predicted = model1.predict(var1)
print("Example:", y_model1_predicted[0])

Example: [244657.18811044]

### Model 2
model1 = reg.fit(var2, y)
y_model2_predicted = model2.predict(var2)
print("Example:", y_model2_predicted[0])

Example: [251332.76667794]

### Model 2
model3 = reg.fit(var3, y)
y_model3_predicted = model3.predict(var3)
print("Example:", y_model3_predicted[0])

Example: [277659.33514468]

2.4 Evaluating the performance of the models¶

After learning three models and making predictions, we will use these estimated models to compute the RSS (Residual Sum of Squares) on the Training data.

rss_model1 = mean_squared_error(y, y_model1_predicted)
print("The RSS for model 1:",rss_model1)

The RSS for model 1: 55676481997.78795

rss_model2 = mean_squared_error(y, y_model2_predicted)
print("The RSS for model 2:",rss_model2)

The RSS for model 2: 55132284576.28106

rss_model3 = mean_squared_error(y, y_model3_predicted)
print("The RSS for model 2:",rss_model2)

The RSS for model 2: 55132284576.28106

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900.0	3.0	1.00	1180.0	5650	1	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340.0	5650.0
1	6414100192	20141209T000000	538000.0	3.0	2.25	2570.0	7242	2	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690.0	7639.0
2	5631500400	20150225T000000	180000.0	2.0	1.00	770.0	10000	1	...	6	770	0	1933	0	98028	47.7379	-122.233	2720.0	8062.0
3	2487200875	20141209T000000	604000.0	4.0	3.00	1960.0	5000	1	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360.0	5000.0
4	1954400510	20150218T000000	510000.0	3.0	2.00	1680.0	8080	1	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800.0	7503.0