In this lesson, we study what linear regression is and how it can be implemented for multiple variables using Scikit-Learn, which is one of the most popular machine learning libraries for Python. We will use data on house sales in King County (Seattle, WA) to predict prices using multiple regression.
Learning objectives:
Before we start, importing libraries, loading the King County house price dataset, and then storing the data in a data frame.
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 'grade':int,
'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 'sqft_lot15':float,
'sqft_living':float, 'floors':str, 'condition':int, 'lat':float, 'date':str, 'sqft_basement':int,
'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}
data1 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_train_data.csv', dtype= dtype_dict)
data_train = pd.DataFrame(data1)
data2 = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_test_data.csv', dtype= dtype_dict)
data_test = pd.DataFrame(data2)
data = pd.read_csv('C:/Users/Duy Tung/Downloads/kc_house_data.csv', dtype= dtype_dict)
data = pd.DataFrame(data)
data_train.head()
Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms. We will use the logarithm function to create a new feature.
# Training data
data_train['bedrooms_squared'] = data_train.bedrooms**2
data_train['bed_bath_rooms'] = data_train.bedrooms*data_train.bathrooms
data_train['log_sqft_living'] = np.log(data_train.sqft_living)
data_train['lat_plus_long'] = data_train.lat+data_train.long
# Testing data
data_test['bedrooms_squared'] = data_test.bedrooms**2
data_test['bed_bath_rooms'] = data_test.bedrooms*data_train.bathrooms
data_test['log_sqft_living'] = np.log(data_test.sqft_living)
data_test['lat_plus_long'] = data_test.lat+data_train.long
what are the mean (arithmetic average) values of your 4 new variables on TEST data?
print(np.mean(data_test.bedrooms_squared))
print(np.mean(data_test.bed_bath_rooms))
print(np.mean(data_test.log_sqft_living))
print(np.mean(data_test.lat_plus_long))
Estimate the regression coefficients/weights for predicting ‘price’ for the following three models (traning data):
var1 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long']]
var2 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms']]
var3 = data_train[['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'bed_bath_rooms',
'log_sqft_living', 'lat_plus_long']]
y = data_train[['price']]
Now that you have the features, learn the weights for the three different models for predicting target = 'price' using linear_model.LinearRegression() and look at the value of the weights/coefficients:
reg = linear_model.LinearRegression()
model1 = reg.fit(var1, y)
print("The weights for features in model 1:",model1.coef_)
model2 = reg.fit(var2, y)
print("The weights for features in model 2:",model2.coef_)
model3 = reg.fit(var3, y)
print("The weights for features in model 3:",model3.coef_)
Once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above.
### Model 1
model1 = reg.fit(var1, y)
y_model1_predicted = model1.predict(var1)
print("Example:", y_model1_predicted[0])
### Model 2
model1 = reg.fit(var2, y)
y_model2_predicted = model2.predict(var2)
print("Example:", y_model2_predicted[0])
### Model 2
model3 = reg.fit(var3, y)
y_model3_predicted = model3.predict(var3)
print("Example:", y_model3_predicted[0])
After learning three models and making predictions, we will use these estimated models to compute the RSS (Residual Sum of Squares) on the Training data.
rss_model1 = mean_squared_error(y, y_model1_predicted)
print("The RSS for model 1:",rss_model1)
rss_model2 = mean_squared_error(y, y_model2_predicted)
print("The RSS for model 2:",rss_model2)
rss_model3 = mean_squared_error(y, y_model3_predicted)
print("The RSS for model 2:",rss_model2)