Fill This Form To Receive Instant Help
Homework answers / question archive / Project: Classification - Hotel Booking Cancellation Prediction Marks: 30 Welcome to the project on classification
Welcome to the project on classification. We will use the INN Hotels dataset for this problem.
A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers' booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#To scale the data using z-score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
#Algorithms to use
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
#To tune the model
from sklearn.model_selection import GridSearchCV
#Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve
hotel = pd.read_csv("INNHotelsGroup.csv")
# copying data to another variable to avoid any changes to original data
data = hotel.copy()
data.head()
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
0 INN00001 2 0 1
1 INN00002 2 0 2
2 INN00003 1 0 2
3 INN00004 2 0 0
4 INN00005 2 0 1
no_of_week_nights type_of_meal_plan required_car_parking_space \
0 2 Meal Plan 1 0
1 3 Not Selected 0
2 1 Meal Plan 1 0
3 2 Meal Plan 1 0
4 1 Not Selected 0
room_type_reserved lead_time arrival_year arrival_month arrival_date \
0 Room_Type 1 224 2017 10 2
1 Room_Type 1 5 2018 11 6
2 Room_Type 1 1 2018 2 28
3 Room_Type 1 211 2018 5 20
4 Room_Type 1 48 2018 4 11
market_segment_type repeated_guest no_of_previous_cancellations \
0 Offline 0 0
1 Online 0 0
2 Online 0 0
3 Online 0 0
4 Online 0 0
no_of_previous_bookings_not_canceled avg_price_per_room \
0 0 65.00
1 0 106.68
2 0 60.00
3 0 100.00
4 0 94.50
no_of_special_requests booking_status
0 0 Not_Canceled
1 1 Not_Canceled
2 0 Canceled
3 0 Canceled
4 0 Canceled
data.tail()
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
36270 INN36271 3 0 2
36271 INN36272 2 0 1
36272 INN36273 2 0 2
36273 INN36274 2 0 0
36274 INN36275 2 0 1
no_of_week_nights type_of_meal_plan required_car_parking_space \
36270 6 Meal Plan 1 0
36271 3 Meal Plan 1 0
36272 6 Meal Plan 1 0
36273 3 Not Selected 0
36274 2 Meal Plan 1 0
room_type_reserved lead_time arrival_year arrival_month \
36270 Room_Type 4 85 2018 8
36271 Room_Type 1 228 2018 10
36272 Room_Type 1 148 2018 7
36273 Room_Type 1 63 2018 4
36274 Room_Type 1 207 2018 12
arrival_date market_segment_type repeated_guest \
36270 3 Online 0
36271 17 Online 0
36272 1 Online 0
36273 21 Online 0
36274 30 Offline 0
no_of_previous_cancellations no_of_previous_bookings_not_canceled \
36270 0 0
36271 0 0
36272 0 0
36273 0 0
36274 0 0
avg_price_per_room no_of_special_requests booking_status
36270 167.80 1 Not_Canceled
36271 90.95 2 Canceled
36272 98.39 2 Not_Canceled
36273 94.50 0 Canceled
36274 161.67 0 Not_Canceled
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Booking_ID 36275 non-null object
1 no_of_adults 36275 non-null int64
2 no_of_children 36275 non-null int64
3 no_of_weekend_nights 36275 non-null int64
4 no_of_week_nights 36275 non-null int64
5 type_of_meal_plan 36275 non-null object
6 required_car_parking_space 36275 non-null int64
7 room_type_reserved 36275 non-null object
8 lead_time 36275 non-null int64
9 arrival_year 36275 non-null int64
10 arrival_month 36275 non-null int64
11 arrival_date 36275 non-null int64
12 market_segment_type 36275 non-null object
13 repeated_guest 36275 non-null int64
14 no_of_previous_cancellations 36275 non-null int64
15 no_of_previous_bookings_not_canceled 36275 non-null int64
16 avg_price_per_room 36275 non-null float64
17 no_of_special_requests 36275 non-null int64
18 booking_status 36275 non-null object
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
data.Booking_ID.nunique()
36275
Observations:
data = data.drop(["Booking_ID"], axis=1)
data.head()
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
0 2 0 1 2
1 2 0 2 3
2 1 0 2 1
3 2 0 0 2
4 2 0 1 1
type_of_meal_plan required_car_parking_space room_type_reserved lead_time \
0 Meal Plan 1 0 Room_Type 1 224
1 Not Selected 0 Room_Type 1 5
2 Meal Plan 1 0 Room_Type 1 1
3 Meal Plan 1 0 Room_Type 1 211
4 Not Selected 0 Room_Type 1 48
arrival_year arrival_month arrival_date market_segment_type \
0 2017 10 2 Offline
1 2018 11 6 Online
2 2018 2 28 Online
3 2018 5 20 Online
4 2018 4 11 Online
repeated_guest no_of_previous_cancellations \
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
no_of_previous_bookings_not_canceled avg_price_per_room \
0 0 65.00
1 0 106.68
2 0 60.00
3 0 100.00
4 0 94.50
no_of_special_requests booking_status
0 0 Not_Canceled
1 1 Not_Canceled
2 0 Canceled
3 0 Canceled
4 0 Canceled
#Selecting numerical columns and checking summary statistics
num_cols = data.select_dtypes('number').columns
data[num_cols].describe().T
count mean std min \
no_of_adults 36275.0 1.844962 0.518715 0.0
no_of_children 36275.0 0.105279 0.402648 0.0
no_of_weekend_nights 36275.0 0.810724 0.870644 0.0
no_of_week_nights 36275.0 2.204300 1.410905 0.0
required_car_parking_space 36275.0 0.030986 0.173281 0.0
lead_time 36275.0 85.232557 85.930817 0.0
arrival_year 36275.0 2017.820427 0.383836 2017.0
arrival_month 36275.0 7.423653 3.069894 1.0
arrival_date 36275.0 15.596995 8.740447 1.0
repeated_guest 36275.0 0.025637 0.158053 0.0
no_of_previous_cancellations 36275.0 0.023349 0.368331 0.0
no_of_previous_bookings_not_canceled 36275.0 0.153411 1.754171 0.0
avg_price_per_room 36275.0 103.423539 35.089424 0.0
no_of_special_requests 36275.0 0.619655 0.786236 0.0
25% 50% 75% max
no_of_adults 2.0 2.00 2.0 4.0
no_of_children 0.0 0.00 0.0 10.0
no_of_weekend_nights 0.0 1.00 2.0 7.0
no_of_week_nights 1.0 2.00 3.0 17.0
required_car_parking_space 0.0 0.00 0.0 1.0
lead_time 17.0 57.00 126.0 443.0
arrival_year 2018.0 2018.00 2018.0 2018.0
arrival_month 5.0 8.00 10.0 12.0
arrival_date 8.0 16.00 23.0 31.0
repeated_guest 0.0 0.00 0.0 1.0
no_of_previous_cancellations 0.0 0.00 0.0 13.0
no_of_previous_bookings_not_canceled 0.0 0.00 0.0 58.0
avg_price_per_room 80.3 99.45 120.0 540.0
no_of_special_requests 0.0 0.00 1.0 5.0
Observations:_______
#Checking the rows where avg_price_per_room is 0
data[data["avg_price_per_room"] == 0]
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
63 1 0 0 1
145 1 0 0 2
209 1 0 0 0
266 1 0 0 2
267 1 0 2 1
... ... ... ... ...
35983 1 0 0 1
36080 1 0 1 1
36114 1 0 0 1
36217 2 0 2 1
36250 1 0 0 2
type_of_meal_plan required_car_parking_space room_type_reserved \
63 Meal Plan 1 0 Room_Type 1
145 Meal Plan 1 0 Room_Type 1
209 Meal Plan 1 0 Room_Type 1
266 Meal Plan 1 0 Room_Type 1
267 Meal Plan 1 0 Room_Type 1
... ... ... ...
35983 Meal Plan 1 0 Room_Type 7
36080 Meal Plan 1 0 Room_Type 7
36114 Meal Plan 1 0 Room_Type 1
36217 Meal Plan 1 0 Room_Type 2
36250 Meal Plan 2 0 Room_Type 1
lead_time arrival_year arrival_month arrival_date \
63 2 2017 9 10
145 13 2018 6 1
209 4 2018 2 27
266 1 2017 8 12
267 4 2017 8 23
... ... ... ... ...
35983 0 2018 6 7
36080 0 2018 3 21
36114 1 2018 3 2
36217 3 2017 8 9
36250 6 2017 12 10
market_segment_type repeated_guest no_of_previous_cancellations \
63 Complementary 0 0
145 Complementary 1 3
209 Complementary 0 0
266 Complementary 1 0
267 Complementary 0 0
... ... ... ...
35983 Complementary 1 4
36080 Complementary 1 3
36114 Online 0 0
36217 Online 0 0
36250 Online 0 0
no_of_previous_bookings_not_canceled avg_price_per_room \
63 0 0.0
145 5 0.0
209 0 0.0
266 1 0.0
267 0 0.0
... ... ...
35983 17 0.0
36080 15 0.0
36114 0 0.0
36217 0 0.0
36250 0 0.0
no_of_special_requests booking_status
63 1 Not_Canceled
145 1 Not_Canceled
209 1 Not_Canceled
266 1 Not_Canceled
267 1 Not_Canceled
... ... ...
35983 1 Not_Canceled
36080 1 Not_Canceled
36114 0 Not_Canceled
36217 2 Not_Canceled
36250 0 Not_Canceled
[545 rows x 18 columns]
data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()
Complementary 354
Online 191
Name: market_segment_type, dtype: int64
for col in ['lead_time', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room']:
print(col)
print('Skew :',round(data[col].skew(),2))
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
data[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=data[col])
plt.show()
lead_time
Skew : 1.29
no_of_previous_cancellations
Skew : 25.2
no_of_previous_bookings_not_canceled
Skew : 19.25
avg_price_per_room
Skew : 0.67
# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)
# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)
# Calculating IQR
IQR = Q3 - Q1
# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker
179.55
# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker
cat_cols = ['no_of_adults', 'no_of_children', 'no_of_week_nights', 'no_of_weekend_nights', 'required_car_parking_space',
'type_of_meal_plan', 'room_type_reserved', 'arrival_month', 'market_segment_type', 'no_of_special_requests',
'booking_status']:
#Write your code here
Observations:________
Replacing values 9 and 10 for the number of children with 3 and encoding the target variable
# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)
data["booking_status"] = data["booking_status"].apply(lambda x: 1 if x == "Canceled" else 0)
We are done with univariate analysis and data preprocessing. Let's explore the data a bit more with bivariate analysis.
Let's check the relationship of market segment type with the average price per room.
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="market_segment_type", y="avg_price_per_room")
plt.show()
Let's see how booking status varies across different market segments. Also, how lead time impacts booking status
plt.figure(figsize=(10, 6))
sns.countplot(x='market_segment_type', hue='booking_status', data=data)
plt.show()
plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="booking_status", y="lead_time")
plt.show()
Now, let's check how the arrival month impacts the booking status
plt.figure(figsize=(10, 6))
sns.countplot(x='arrival_month', hue='booking_status', data=data)
plt.show()
Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?
plt.figure(figsize=(10, 6))
sns.countplot(x='repeated_guest', hue='booking_status', data=data)
plt.show()
We have explored different combinations of variables. Now, let's see the pairwise correlations between all the variables.
plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.show()
Now that we have explored our data, let's prepare it for modeling.
#Remove the blanks and complete the below code
X=___________
Y=___________
#Creating dummy variables
#drop_first=True is used to avoid redundant variables
X = pd.get_dummies(X, drop_first=True)
#Splitting the data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.30, random_state=1)
Before training the model, let's choose the appropriate model evaluation criterion as per the problem on hand.
Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.
#function to print classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize=(8,5))
sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Canceled', 'Canceled'], yticklabels=['Not Canceled', 'Canceled'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
#define logistic regression model
log_reg= #write your code here
#fit the model
#write you code here
Let's check the coefficient of each dependent variable in the data
pd.Series(log_reg.coef_[0], index=X_train.columns).sort_values(ascending=False)
market_segment_type_Online 0.624071
type_of_meal_plan_Not Selected 0.321587
no_of_weekend_nights 0.180871
avg_price_per_room 0.019686
lead_time 0.015407
no_of_adults 0.011404
no_of_week_nights 0.006240
arrival_date 0.001224
type_of_meal_plan_Meal Plan 3 0.000400
room_type_reserved_Room_Type 3 0.000341
arrival_year -0.001730
room_type_reserved_Room_Type 2 -0.004286
market_segment_type_Complementary -0.008782
room_type_reserved_Room_Type 5 -0.011135
room_type_reserved_Room_Type 7 -0.017821
no_of_previous_cancellations -0.024671
market_segment_type_Corporate -0.031280
room_type_reserved_Room_Type 4 -0.032748
repeated_guest -0.043413
room_type_reserved_Room_Type 6 -0.045607
no_of_children -0.053844
arrival_month -0.057248
type_of_meal_plan_Meal Plan 2 -0.094291
required_car_parking_space -0.139098
no_of_previous_bookings_not_canceled -0.212323
market_segment_type_Offline -0.598263
no_of_special_requests -1.550049
dtype: float64
Observations:_________
odds = np.exp(log_reg.coef_[0]) #finding the odds
# adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train.columns, columns=['odds']).sort_values(by='odds', ascending=False)
odds
market_segment_type_Online 1.866510
type_of_meal_plan_Not Selected 1.379315
no_of_weekend_nights 1.198261
avg_price_per_room 1.019881
lead_time 1.015527
no_of_adults 1.011469
no_of_week_nights 1.006260
arrival_date 1.001225
type_of_meal_plan_Meal Plan 3 1.000400
room_type_reserved_Room_Type 3 1.000341
arrival_year 0.998271
room_type_reserved_Room_Type 2 0.995723
market_segment_type_Complementary 0.991256
room_type_reserved_Room_Type 5 0.988927
room_type_reserved_Room_Type 7 0.982337
no_of_previous_cancellations 0.975631
market_segment_type_Corporate 0.969204
room_type_reserved_Room_Type 4 0.967782
repeated_guest 0.957516
room_type_reserved_Room_Type 6 0.955417
no_of_children 0.947580
arrival_month 0.944360
type_of_meal_plan_Meal Plan 2 0.910018
required_car_parking_space 0.870143
no_of_previous_bookings_not_canceled 0.808703
market_segment_type_Offline 0.549766
no_of_special_requests 0.212238
Observations:_________
Now, let's check the performance of the model on the training set
# Checking performance on the training data
y_pred_train = ________
metrics_score__________
Reading confusion matrix (clockwise):
Observations:_____
Precision-Recall Curve for Logistic Regression
y_scores=log_reg.predict_proba(X_train) #predict_proba gives the probability of each observation belonging to each class
precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores[:,1])
#Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()
Observations:
#calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds)):
if precisions[i]==recalls[i]:
print(thresholds[i])
0.4174887148772929
optimal_threshold1 = 0.42
metrics_score(y_train, y_scores[:,1]>optimal_threshold1)
precision recall f1-score support
0 0.84 0.84 0.84 17029
1 0.68 0.68 0.68 8363
accuracy 0.79 25392
macro avg 0.76 0.76 0.76 25392
weighted avg 0.79 0.79 0.79 25392
Let's check the performance of the model on the test data
#Checking performance on the testing data
y_pred_test = log_reg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold1)
precision recall f1-score support
0 0.85 0.85 0.85 7361
1 0.68 0.68 0.68 3522
accuracy 0.79 10883
macro avg 0.77 0.77 0.77 10883
weighted avg 0.79 0.79 0.79 10883
Observations:_____
# scaling the data
scaler=StandardScaler()
X_train_scaled=pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns) #fit_transform the training data
X_test_scaled=pd.DataFrame(scaler.transform(X_test), columns=X_test.columns) #transform the testing data
Points to note:
knn = KNeighborsClassifier()
params_knn = {'n_neighbors':np.arange(2,20,2), 'weights':['uniform','distance'], 'p':[1,2]}
grid_knn = GridSearchCV(estimator=knn, param_grid=params_knn, scoring='f1', cv=10)
model_knn=grid_knn.fit(X_train_scaled,y_train)
knn_estimator = model_knn.best_estimator_
print(knn_estimator)
KNeighborsClassifier(n_neighbors=14, p=1, weights='distance')
#Fit the KNN model on the scaled training data
___________
#Make predictions on the scaled training data and check the performance (using metrics_score function)
y_pred_train = ___________
___________
#Make predictions on the scaled testing data and check the performance (using metrics_score function)
y_pred_test = __________
____________
Observations:_________
Write your conclusion here
Write your recommendations here