Multiple Regression with Scikit-Learn

In this lesson, we study what linear regression is and how it can be implemented for multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python. We will use data on house sales in King County (Seattle, WA) to predict prices using multiple regression.

Learning objectives:

  • Using Pandas package to read and manage data;
  • Using Scikit-Learn to build up model and compute the regression weights;
  • Computing the Residual Sum of Squares;
  • Looking at coefficients and interpreting their meanings;
  • Evaluating multiple models via RSS.

Before we start, importing libraries, loading the King County house price dataset, and then storing the data in a data frame.

In [1]:
import pandas as pd
import numpy as np 
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int, 
              'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float, 
              'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
data1 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype= dtype_dict)
data_train = pd.DataFrame(data1)
data2 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype= dtype_dict)
data_test = pd.DataFrame(data2)
data = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
data = pd.DataFrame(data)
In [3]:
data_train.head()
Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3.0 1.00 1180.0 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340.0 5650.0
1 6414100192 20141209T000000 538000.0 3.0 2.25 2570.0 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690.0 7639.0
2 5631500400 20150225T000000 180000.0 2.0 1.00 770.0 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720.0 8062.0
3 2487200875 20141209T000000 604000.0 4.0 3.00 1960.0 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360.0 5000.0
4 1954400510 20150218T000000 510000.0 3.0 2.00 1680.0 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800.0 7503.0

5 rows × 21 columns

1. Data Set-up

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms. We will use the logarithm function to create a new feature.

  • Creating new variables for training and testing data.
  • Reporting some statistics.
1.1 Creating new variables
In [4]:
# Training data
data_train['bedrooms_squared'] = data_train.bedrooms**2
data_train['bed_bath_rooms'] = data_train.bedrooms*data_train.bathrooms
data_train['log_sqft_living'] = np.log(data_train.sqft_living)
data_train['lat_plus_long'] = data_train.lat+data_train.long

# Testing data
data_test['bedrooms_squared'] = data_test.bedrooms**2
data_test['bed_bath_rooms'] = data_test.bedrooms*data_train.bathrooms
data_test['log_sqft_living'] = np.log(data_test.sqft_living)
data_test['lat_plus_long'] = data_test.lat+data_train.long
1.2 Reporting some statistics

what are the mean (arithmetic average) values of your 4 new variables on TEST data?

In [5]:
print(np.mean(data_test.bedrooms_squared))
print(np.mean(data_test.bed_bath_rooms))
print(np.mean(data_test.log_sqft_living))
print(np.mean(data_test.lat_plus_long))
12.4466777015843
6.967427287774887
7.550274679645921
-74.6531748876802

2. Learning Multiple Models

Estimate the regression coefficients/weights for predicting ‘price’ for the following three models (traning data):

  • Model 1: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’, and ‘long’
  • Model 2: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, and ‘bed_bath_rooms’
  • Model 3: ‘sqft_living’, ‘bedrooms’, ‘bathrooms’, ‘lat’,‘long’, ‘bed_bath_rooms’, ‘bedrooms_squared’, ‘log_sqft_living’, and ‘lat_plus_long’
2.1 Creating set of features for each model
In [6]:
var1 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']]
var2 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms']]
var3 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms', 
                   'log_sqft_living', 'lat_plus_long']]
y = data_train[['price']]
2.2 Regression

Now that you have the features, learn the weights for the three different models for predicting target = 'price' using linear_model.LinearRegression() and look at the value of the weights/coefficients:

In [7]:
reg = linear_model.LinearRegression()
model1 = reg.fit(var1, y)
print("The weights for features in model 1:",model1.coef_)
The weights for features in model 1: [[ 3.12258646e+02 -5.95865332e+04  1.57067421e+04  6.58619264e+05
  -3.09374351e+05]]
In [8]:
model2 = reg.fit(var2, y)
print("The weights for features in model 2:",model2.coef_)
The weights for features in model 2: [[ 3.06610053e+02 -1.13446368e+05 -7.14613083e+04  6.54844630e+05
  -2.94298969e+05  2.55796520e+04]]
In [9]:
model3 = reg.fit(var3, y)
print("The weights for features in model 3:",model3.coef_)
The weights for features in model 3: [[ 5.32283566e+02  1.19729630e+03  9.43375296e+04  5.30911958e+05
  -4.03638426e+05 -1.62724577e+04 -5.64708989e+05  1.27273532e+05]]
2.3 Making Predictions

Once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above.

In [10]:
### Model 1
model1 = reg.fit(var1, y)
y_model1_predicted = model1.predict(var1)
print("Example:", y_model1_predicted[0])
Example: [244657.18811044]
In [11]:
### Model 2
model1 = reg.fit(var2, y)
y_model2_predicted = model2.predict(var2)
print("Example:", y_model2_predicted[0])
Example: [251332.76667794]
In [12]:
### Model 2
model3 = reg.fit(var3, y)
y_model3_predicted = model3.predict(var3)
print("Example:", y_model3_predicted[0])
Example: [277659.33514468]
2.4 Evaluating the performance of the models

After learning three models and making predictions, we will use these estimated models to compute the RSS (Residual Sum of Squares) on the Training data.

In [13]:
rss_model1 = mean_squared_error(y, y_model1_predicted)
print("The RSS for model 1:",rss_model1)
The RSS for model 1: 55676481997.78795
In [14]:
rss_model2 = mean_squared_error(y, y_model2_predicted)
print("The RSS for model 2:",rss_model2)
The RSS for model 2: 55132284576.28106
In [15]:
rss_model3 = mean_squared_error(y, y_model3_predicted)
print("The RSS for model 2:",rss_model2)
The RSS for model 2: 55132284576.28106