Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / Project Linear Regression: Boston House Price Prediction Marks: 30 Welcome to the project on Linear Regression

Project Linear Regression: Boston House Price Prediction Marks: 30 Welcome to the project on Linear Regression

Computer Science

Project Linear Regression: Boston House Price Prediction

Marks: 30

Welcome to the project on Linear Regression. We will use the Boston house price data for the exercise.


## Problem Statement

The problem on hand is to predict the housing prices of a town or a suburb based on the features of the locality provided to us. In the process, we need to identify the most important features in the dataset. We need to employ techniques of data preprocessing and build a linear regression model that predicts the prices for us.


## Data Information

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. Detailed attribute information can be found below-

Attribute Information (in order):

  • CRIM: per capita crime rate by town
  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX: nitric oxides concentration (parts per 10 million)
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built before 1940
  • DIS: weighted distances to five Boston employment centers
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per 10,000 dollars
  • PTRATIO: pupil-teacher ratio by town
  • LSTAT: %lower status of the population
  • MEDV: Median value of owner-occupied homes in 1000 dollars

Let us start by importing the required libraries

# import libraries for data manipulation
import pandas as pd
import numpy as np

# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.graphics.gofplots import ProbPlot

# import libraries for building linear regression model
from statsmodels.formula.api import ols
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression

# import library for preparing data
from sklearn.model_selection import train_test_split

# import library for data preprocessing
from sklearn.preprocessing import MinMaxScaler

import warnings
warnings.filterwarnings("ignore")

Read the dataset

df = pd.read_csv("Boston.csv")
df.head()

      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  PTRATIO  \
0  0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296     15.3  
1  0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242     17.8  
2  0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242     17.8  
3  0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222     18.7  
4  0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222     18.7  

   LSTAT  MEDV 
0   4.98  24.0 
1   9.14  21.6 
2   4.03  34.7 
3   2.94  33.4 
4   5.33  36.2 

Observations

  • The price of the house indicated by the variable MEDV is the target variable and the rest are the independent variables based on which we will predict house price.

Get information about the dataset using the info() method

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64 
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64 
 9   TAX      506 non-null    int64 
 10  PTRATIO  506 non-null    float64
 11  LSTAT    506 non-null    float64
 12  MEDV     506 non-null    float64
dtypes: float64(10), int64(3)
memory usage: 51.5 KB

Observations

  • There are a total of 506 non-null observations in each of the columns. This indicates that there are no missing values in the data.
  • Every column in this dataset is numeric in nature.

Let's now check the summary statistics of this dataset

Question 1: Write the code to find the summary statistics and write your observations based on that. (1 Mark)

#write your code here

Observations:____

Before performing the modeling, it is important to check the univariate distribution of the variables.

Univariate Analysis

Check the distribution of the variables

# let's plot all the columns to look at their distributions
for i in df.columns:
    plt.figure(figsize=(7, 4))
    sns.histplot(data=df, x=i, kde = True)
    plt.show()

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Observations

  • The variables CRIM and ZN are positively skewed. This suggests that most of the areas have lower crime rates and most residential plots are under the area of 25,000 sq. ft.
  • The variable CHAS, with only 2 possible values 0 and 1, follows a binomial distribution, and the majority of the houses are away from Charles river (CHAS = 0).
  • The distribution of the variable AGE suggests that many of the owner-occupied houses were built before 1940.
  • The variable DIS (average distances to five Boston employment centers) has a nearly exponential distribution, which indicates that most of the houses are closer to these employment centers.
  • The variables TAX and RAD have a bimodal distribution., indicating that the tax rate is possibly higher for some properties which have a high index of accessibility to radial highways.
     
  • The dependent variable MEDV seems to be slightly right skewed.

As the dependent variable is sightly skewed, we will apply a log transformation on the 'MEDV' column and check the distribution of the transformed column.

df['MEDV_log'] = np.log(df['MEDV'])

sns.histplot(data=df, x='MEDV_log', kde = True)

<matplotlib.axes._subplots.AxesSubplot at 0x117319e6550>

 

 

Observations

  • The log-transformed variable (MEDV_log) appears to have a nearly normal distribution without skew, and hence we can proceed.

Before creating the linear regression model, it is important to check the bivariate relationship between the variables. Let's check the same using the heatmap and scatterplot.

Bivariate Analysis

Let's check the correlation using the heatmap

Question 2 (3 Marks):

  • Write the code to plot the correlation heatmap between the variables (1 Mark)
  • Write your observations (2 Marks)

plt.figure(figsize=(12,8))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(______________,annot=True,fmt='.2f',cmap=cmap ) #write your code here
plt.show()

Observations:______

Now, we will visualize the relationship between the pairs of features having significant correlations.

Visualizing the relationship between the features having significant correlations (> 0.7)

Question 3 (6 Marks):

  • Create a scatter plot to visualize the relationship between the features having significant correlations (>0.7) (3 Marks)
  • Write your observations from the plots (3 Marks)

# scatterplot to visualize the relationship between NOX and INDUS
plt.figure(figsize=(6, 6))
#___________________________ #write you code here
plt.show()

Observations:____

# scatterplot to visualize the relationship between AGE and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()

Observations:____

# scatterplot to visualize the relationship between DIS and NOX
plt.figure(figsize=(6, 6))
#_____________________________ #Write your code here
plt.show()

**Observations:___**

# scatterplot to visualize the relationship between AGE and DIS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'DIS', data = df)
plt.show()

 

 

Observations:

  • The distance of the houses to the Boston employment centers appears to decrease moderately as the the proportion of the old houses increase in the town. It is possible that the Boston employment centers are located in the established towns where proportion of owner-occupied units built prior to 1940 is comparatively high.

# scatterplot to visualize the relationship between AGE and INDUS
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'AGE', y = 'INDUS', data = df)
plt.show()

 

 

Observations:

  • No trend between the two variables is visible in the above plot.

# scatterplot to visulaize the relationship between RAD and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RAD', y = 'TAX', data = df)
plt.show()

 

 

Observations:

  • The correlation between RAD and TAX is very high. But, no trend is visible between the two variables. This might be due to outliers.

Let's check the correlation after removing the outliers.

# remove the data corresponding to high tax rate
df1 = df[df['TAX'] < 600]
# import the required function
from scipy.stats import pearsonr
# calculate the correlation
print('The correlation between TAX and RAD is', pearsonr(df1['TAX'], df1['RAD'])[0])

The correlation between TAX and RAD is 0.24975731331429202

So the high correlation between TAX and RAD is due to the outliers. The tax rate for some properties might be higher due to some other reason.

# scatterplot to visualize the relationship between INDUS and TAX
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'INDUS', y = 'TAX', data = df)
plt.show()

 

 

Observations:

  • The tax rate appears to increase with an increase in the proportion of non-retail business acres per town. This might be due to the reason that the variables TAX and INDUS are related with a third variable.

# scatterplot to visulaize the relationship between RM and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'RM', y = 'MEDV', data = df)
plt.show()

 

 

Observations:

  • The price of the house seems to increase as the value of RM increases. This is expected as the price is generally higher for more rooms.
  • There are a few outliers in a horizontal line as the MEDV value seems to be capped at 50.

# scatterplot to visulaize the relationship between LSTAT and MEDV
plt.figure(figsize=(6, 6))
sns.scatterplot(x = 'LSTAT', y = 'MEDV', data = df)
plt.show()

 

 

Observations:

  • The price of the house tends to decrease with an increase in LSTAT. This is also possible as the house price is lower in areas where lower status people live.
  • There are few outliers and the data seems to be capped at 50.

We have seen that the variables LSTAT and RM have a linear relationship with the dependent variable MEDV. Also, there are significant relationships among a few independent variables, which is not desirable for a linear regression model. Let's first split the dataset.

Split the dataset

Let's split the data into the dependent and independent variables and further split it into train and test set in a ratio of 70:30 for train and test set.

# separate the dependent and independent variable
Y = df['MEDV_log']
X = df.drop(columns = {'MEDV', 'MEDV_log'})

# add the intercept term
X = sm.add_constant(X)

# splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

Next, we will check the multicollinearity in the train dataset.

Check for Multicollinearity

We will use the Variance Inflation Factor (VIF), to check if there is multicollinearity in the data.

Features having a VIF score > 5 will be dropped/treated till all the features have a VIF score < 5

from statsmodels.stats.outliers_influence import variance_inflation_factor

# function to check VIF
def checking_vif(train):
    vif = pd.DataFrame()
    vif["feature"] = train.columns

    # calculating VIF for each feature
    vif["VIF"] = [
        variance_inflation_factor(train.values, i) for i in range(len(train.columns))
    ]
    return vif


print(checking_vif(X_train))

    feature         VIF
0     const  535.372593
1      CRIM    1.924114
2        ZN    2.743574
3     INDUS    3.999538
4      CHAS    1.076564
5       NOX    4.396157
6        RM    1.860950
7       AGE    3.150170
8       DIS    4.355469
9       RAD    8.345247
10      TAX   10.191941
11  PTRATIO    1.943409
12    LSTAT    2.861881

Observations:

  • There are two variables with a high VIF - RAD and TAX. Let's remove TAX as it has the highest VIF values and check the multicollinearity again.

Question 4: Drop the column 'TAX' from the training data and check if multicollinearity is removed? (1 Mark)

# create the model after dropping TAX
X_train = #Write your code here

# check for VIF
print(checking_vif(X_train))

Now, we will create the linear regression model as the VIF is less than 5 for all the independent variables, and we can assume that multicollinearity has been removed between the variables.

Question 5: Write the code to create the linear regression model and print the model summary. Write your observations from the model. (3 Marks)

# create the model
model1 = #write your code here

# get the model summary
model1.summary()

Observations:_____

Question 6: Drop insignificant variables from the above model and create the regression model again. (2 Marks)

Examining the significance of the model

It is not enough to fit a multiple regression model to the data, it is necessary to check whether all the regression coefficients are significant or not. Significance here means whether the population regression parameters are significantly different from zero.

From the above it may be noted that the regression coefficients corresponding to ZN, AGE, and INDUS are not statistically significant at level α = 0.05. In other words, the regression coefficients corresponding to these three are not significantly different from 0 in the population. Hence, we will eliminate the three features and create a new model.

# create the model after dropping columns 'MEDV', 'MEDV_log', 'TAX', 'ZN', 'AGE', 'INDUS' from df dataframe
Y = df['MEDV_log']
X = df.drop(_____________________________) #write your code here
X = sm.add_constant(X)

#splitting the data in 70:30 ratio of train to test data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30 , random_state=1)

# create the model
model2 = __________________________ #write your code here
# get the model summary
model2.summary()

Observations:

  • We can see that the R-squared value has decreased by 0.002, since we have removed variables from the model, whereas the adjusted R-squared value has increased by 0.001, since we removed statistically insignificant variables only.

Now, we will check the linear regression assumptions.

Check the below linear regression assumptions

  1. Mean of residuals should be 0
  2. No Heteroscedasticity
  3. Linearity of variables
  4. Normality of error terms

Question 7: Write the code to check the above linear regression assumptions and provide insights. (4 Marks)

Check for mean residuals

residuals =

# Write your code here

Observations:____

Check for homoscedasticity

  • Homoscedasticity - If the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.
  • Heteroscedasticity - If the residuals are not symmetrically distributed across the regression line, then the data is said to be heteroscedastic. In this case, the residuals can form a funnel shape or any other non-symmetrical shape.
  • We'll use Goldfeldquandt Test to test the following hypothesis with alpha = 0.05:
    • Null hypothesis: Residuals are homoscedastic
    • Alternate hypothesis: Residuals have heteroscedastic

from statsmodels.stats.diagnostic import het_white
from statsmodels.compat import lzip
import statsmodels.stats.api as sms

name = ["F statistic", "p-value"]
test = ____________________________
lzip(name, test)

Observations:____

Linearity of variables

It states that the predictor variables must have a linear relation with the dependent variable.

To test the assumption, we'll plot residuals and fitted values on a plot and ensure that residuals do not form a strong pattern. They should be randomly and uniformly scattered on the x-axis.

# predicted values
fitted = model2.fittedvalues

# sns.set_style("whitegrid")
sns.residplot(x = ______ y = ________, color="lightblue", lowess=True) #write your code here
plt.xlabel("Fitted Values")
plt.ylabel("Residual")
plt.title("Residual PLOT")
plt.show()

 

 

Observations:_____

Normality of error terms

The residuals should be normally distributed.

# Plot histogram of residuals
#write your code here

# Plot q-q plot of residuals
import pylab
import scipy.stats as stats

stats.probplot(residuals, dist="norm", plot=pylab)
plt.show()

 

 

Observations:_____

Check the performance of the model on the train and test data set

Question 8: Write your observations by comparing model performance of train and test dataset (2 Marks)

# RMSE
def rmse(predictions, targets):
    return np.sqrt(((targets - predictions) ** 2).mean())


# MAPE
def mape(predictions, targets):
    return np.mean(np.abs((targets - predictions)) / targets) * 100


# MAE
def mae(predictions, targets):
    return np.mean(np.abs((targets - predictions)))


# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):

    # In-sample Prediction
    y_pred_train = olsmodel.predict(x_train)
    y_observed_train = y_train

    # Prediction on test data
    y_pred_test = olsmodel.predict(x_test)
    y_observed_test = y_test

    print(
        pd.DataFrame(
            {
                "Data": ["Train", "Test"],
                "RMSE": [
                    rmse(y_pred_train, y_observed_train),
                    rmse(y_pred_test, y_observed_test),
                ],
                "MAE": [
                    mae(y_pred_train, y_observed_train),
                    mae(y_pred_test, y_observed_test),
                ],
                "MAPE": [
                    mape(y_pred_train, y_observed_train),
                    mape(y_pred_test, y_observed_test),
                ],
            }
        )
    )


# Checking model performance
model_pref(model2, X_train, X_test) 

    Data      RMSE       MAE      MAPE
0  Train  0.195504  0.143686  4.981813
1   Test  0.198045  0.151284  5.257965

Observations:____

Apply cross validation to improve the model and evaluate it using different evaluation metrics

Question 9: Apply the cross validation technique to improve the model and evaluate it using different evaluation metrics. (1 Mark)

# import the required function

from sklearn.model_selection import cross_val_score

# build the regression model and cross-validate
linearregression = LinearRegression()                                    

cv_Score11 = #write your code here
cv_Score12 = #write your code here                               


print("RSquared: %0.3f (+/- %0.3f)" % (cv_Score11.mean(), cv_Score11.std() * 2))
print("Mean Squared Error: %0.3f (+/- %0.3f)" % (-1*cv_Score12.mean(), cv_Score12.std() * 2))

Observations

  • The R-squared on the cross validation is 0.729, whereas on the training dataset it was 0.769
  • And the MSE on cross validation is 0.041, whereas on the training dataset it was 0.038

We may want to reiterate the model building process again with new features or better feature engineering to increase the R-squared and decrease the MSE on cross validation.

Question 10: Get model Coefficients in a pandas dataframe with column 'Feature' having all the features and column 'Coefs' with all the corresponding Coefs. Write the regression equation. (2 Marks)

coef = #write your code here

# Let us write the equation of the fit
Equation = "log (Price) ="
print(Equation, end='\t')
for i in range(len(coef)):
    print('(', coef[i], ') * ', coef.index[i], '+', end = ' ')

log (Price) =     ( 4.649385823266652 ) *  const + ( -0.012500455079103941 ) *  CRIM + ( 0.11977319077019594 ) *  CHAS + ( -1.0562253516683235 ) *  NOX + ( 0.058906575109279144 ) *  RM + ( -0.044068890799406124 ) *  DIS + ( 0.007848474606244051 ) *  RAD + ( -0.048503620794999036 ) *  PTRATIO + ( -0.029277040479797338 ) *  LSTAT +

Question 11: Write the conclusions and business recommendations derived from the model. (5 Marks)

Write Conclusions here

Write Recommendations here

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions