Fill This Form To Receive Instant Help
Homework answers / question archive / Project Linear Regression: Boston House Price Prediction Marks: 30 Welcome to the project on Linear Regression
Welcome to the project on Linear Regression. We will use the Boston house price data for the exercise.
The problem on hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us.
Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-
Attribute Information (in order):
# import libraries for data manipulation
import pandas as pd
import numpy as np
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import ProbPlot
# import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
# import library for preparing data
from sklearn.model_selection import train_test_split
# import library for data preprocessing
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv("Boston.csv")
df.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO \
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7
LSTAT MEDV
0 4.98 24.0
1 9.14 21.6
2 4.03 34.7
3 2.94 33.4
4 5.33 36.2
Observations
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null int64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null int64
10 PTRATIO 506 non-null float64
11 LSTAT 506 non-null float64
12 MEDV 506 non-null float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB
Observations
#write your code here
Observations:____
Before performing the modeling, it is important to check the univariate distribution of the variables.
# let's plot all the columns to look at their distributions
for i in df.columns:
plt.figure(figsize=(7, 4))
sns.histplot(data=df, x=i, kde = True)
plt.show()
Observations
As the dependent variable is sightly skewed, we will apply a log transformation on the 'MEDV' column and check the distribution of the transformed column.
df['MEDV_log'] = np.log(df['MEDV'])
sns.histplot(data=df, x='MEDV_log', kde = True)
<matplotlib.axes._subplots.AxesSubplot at 0x117319e6550>
Observations
Before creating the linear regression model, it is important to check the bivariate relationship between the variables. Let's check the same using the heatmap and scatterplot.
plt.figure(figsize=(12,8))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(______________,annot=True,fmt='.2f',cmap=cmap ) #write your code here
plt.show()
Observations:______
Now, we will visualize the relationship between the pairs of features having significant correlations.
# scatterplot to visualize the relationship between NOX and INDUS
plt.figure(figsize=(6, 6))
#___________________________ #write you code here
plt.show()
Observations:____
# scatterplot to visualize the relationship between AGE and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()
Observations:____
# scatterplot to visualize the relationship between DIS and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()
**Observations:___**
# scatterplot to visualize the relationship between AGE and DIS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'DIS', data = df)
plt.show()
Observations:
# scatterplot to visualize the relationship between AGE and INDUS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'INDUS', data = df)
plt.show()
Observations:
# scatterplot to visulaize the relationship between RAD and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RAD', y = 'TAX', data = df)
plt.show()
Observations:
Let's check the correlation after removing the outliers.
# remove the data corresponding to high tax rate
df1 = df[df['TAX'] < 600]
# import the required function
from scipy.stats import pearsonr
# calculate the correlation
print('The correlation between TAX and RAD is', pearsonr(df1['TAX'], df1['RAD'])[0])
The correlation between TAX and RAD is 0.24975731331429202
So the high correlation between TAX and RAD is due to the outliers. The tax rate for some properties might be higher due to some other reason.
# scatterplot to visualize the relationship between INDUS and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'INDUS', y = 'TAX', data = df)
plt.show()
Observations:
# scatterplot to visulaize the relationship between RM and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RM', y = 'MEDV', data = df)
plt.show()
Observations:
# scatterplot to visulaize the relationship between LSTAT and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'LSTAT', y = 'MEDV', data = df)
plt.show()
Observations:
We have seen that the variables LSTAT and RM have a linear relationship with the dependent variable MEDV. Also, there are significant relationships among a few independent variables, which is not desirable for a linear regression model. Let's first split the dataset.
Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.
# separate the dependent and independent variable
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})
# add the intercept term
X = sm.add_constant(X)
# splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)
Next, we will check the multicollinearity in the train dataset.
We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.
Features having a VIF score > 5 will be dropped/treated till all the features have a VIF score < 5
from statsmodels.stats.outliers_influence import variance_inflation_factor
# function to check VIF
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
print(checking_vif(X_train))
feature VIF
0 const 535.372593
1 CRIM 1.924114
2 ZN 2.743574
3 INDUS 3.999538
4 CHAS 1.076564
5 NOX 4.396157
6 RM 1.860950
7 AGE 3.150170
8 DIS 4.355469
9 RAD 8.345247
10 TAX 10.191941
11 PTRATIO 1.943409
12 LSTAT 2.861881
Observations:
# create the model after dropping TAX
X_train = #Write your code here
# check for VIF
print(checking_vif(X_train))
Now, we will create the linear regression model as the VIF is less than 5 for all the independent variables, and we can assume that multicollinearity has been removed between the variables.
# create the model
model1 = #write your code here
# get the model summary
model1.summary()
Observations:_____
It is not enough to fit a multiple regression model to the data, it is necessary to check whether all the regression coefficients are significant or not. Significance here means whether the population regression parameters are significantly different from zero.
From the above it may be noted that the regression coefficients corresponding to ZN, AGE, and INDUS are not statistically significant at level α = 0.05. In other words, the regression coefficients corresponding to these three are not significantly different from 0 in the population. Hence, we will eliminate the three features and create a new model.
# create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df dataframe
Y = df['MEDV_log']
X = df.drop(_____________________________) #write your code here
X = sm.add_constant(X)
#splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)
# create the model
model2 = __________________________ #write your code here
# get the model summary
model2.summary()
Observations:
Now, we will check the linear regression assumptions.
residuals =
# Write your code here
Observations:____
from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
import statsmodels.stats.api as sms
name = ["F statistic", "p-value"]
test = ____________________________
lzip(name, test)
Observations:____
It states that the predictor variables must have a linear relation with the dependent variable.
To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.
# predicted values
fitted = model2.fittedvalues
# sns.set_style("whitegrid")
sns.residplot(x = ______ y = ________, color="lightblue", lowess=True) #write your code here
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()
Observations:_____
The residuals should be normally distributed.
# Plot histogram of residuals
#write your code here
# Plot q-q plot of residuals
import pylab
import scipy.stats as stats
stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()
Observations:_____
# RMSE
def rmse(predictions, targets):
return np.sqrt(((targets - predictions) ** 2).mean())
# MAPE
def mape(predictions, targets):
return np.mean(np.abs((targets - predictions)) / targets) * 100
# MAE
def mae(predictions, targets):
return np.mean(np.abs((targets - predictions)))
# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):
# In-sample Prediction
y_pred_train = olsmodel.predict(x_train)
y_observed_train = y_train
# Prediction on test data
y_pred_test = olsmodel.predict(x_test)
y_observed_test = y_test
print(
pd.DataFrame(
{
"Data": ["Train", "Test"],
"RMSE": [
rmse(y_pred_train, y_observed_train),
rmse(y_pred_test, y_observed_test),
],
"MAE": [
mae(y_pred_train, y_observed_train),
mae(y_pred_test, y_observed_test),
],
"MAPE": [
mape(y_pred_train, y_observed_train),
mape(y_pred_test, y_observed_test),
],
}
)
)
# Checking model performance
model_pref(model2, X_train, X_test)
Data RMSE MAE MAPE
0 Train 0.195504 0.143686 4.981813
1 Test 0.198045 0.151284 5.257965
Observations:____
# import the required function
from sklearn.model_selection import cross_val_score
# build the regression model and cross-validate
linearregression = LinearRegression()
cv_Score11 = #write your code here
cv_Score12 = #write your code here
print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std() * 2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std() * 2))
Observations
We may want to reiterate the model building process again with new features or better feature engineering to increase the R-squared and decrease the MSE on cross validation.
coef = #write your code here
# Let us write the equation of the fit
Equation = "log (Price) ="
print(Equation, end='\t')
for i in range(len(coef)):
print('(', coef[i], ') * ', coef.index[i], '+', end = ' ')
log (Price) = ( 4.649385823266652 ) * const + ( -0.012500455079103941 ) * CRIM + ( 0.11977319077019594 ) * CHAS + ( -1.0562253516683235 ) * NOX + ( 0.058906575109279144 ) * RM + ( -0.044068890799406124 ) * DIS + ( 0.007848474606244051 ) * RAD + ( -0.048503620794999036 ) * PTRATIO + ( -0.029277040479797338 ) * LSTAT +
Write Conclusions here
Write Recommendations here