Homework answers / question archive / Project: Classification - Hotel Booking Cancellation Prediction Marks: 30 Welcome to the project on classification

Project: Classification - Hotel Booking Cancellation Prediction Marks: 30 Welcome to the project on classification

Computer Science

Share With

Project: Classification - Hotel Booking Cancellation Prediction

Marks: 30

Welcome to the project on classification. We will use the INN Hotels dataset for this problem.

Context

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers' booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

Loss of resources (revenue) when the hotel cannot resell the room.
Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
Human resources to make arrangements for the guests.

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Booking_ID: unique identifier of each booking
no_of_adults: Number of adults
no_of_children: Number of Children
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
type_of_meal_plan: Type of meal plan booked by the customer:
- Not Selected No meal plan selected
- Meal Plan 1 Breakfast
- Meal Plan 2 Half board (breakfast and one other meal)
- Meal Plan 3 Full board (breakfast, lunch, and dinner)
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
lead_time: Number of days between the date of booking and the arrival date
arrival_year: Year of arrival date
arrival_month: Month of arrival date
arrival_date: Date of the month
market_segment_type: Market segment designation.
repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
booking_status: Flag indicating if the booking was canceled or not.

Importing necessary libraries and overview of the dataset

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#To scale the data using z-score
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

#Algorithms to use
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

#To tune the model
from sklearn.model_selection import GridSearchCV

#Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve

Loading data

hotel = pd.read_csv("INNHotelsGroup.csv")

# copying data to another variable to avoid any changes to original data
data = hotel.copy()

View the first and last 5 rows of the dataset

data.head()

Booking_ID no_of_adults no_of_children no_of_weekend_nights \
0   INN00001             2               0                     1
1   INN00002             2               0                     2
2   INN00003             1               0                     2
3   INN00004             2               0                     0
4   INN00005             2               0                     1

   no_of_week_nights type_of_meal_plan required_car_parking_space \
0                  2       Meal Plan 1                           0
1                  3      Not Selected                           0
2                  1       Meal Plan 1                           0
3                  2       Meal Plan 1                           0
4                  1      Not Selected                           0

room_type_reserved lead_time arrival_year arrival_month arrival_date \
0        Room_Type 1        224          2017             10             2
1        Room_Type 1          5          2018             11             6
2        Room_Type 1         1          2018              2            28
3        Room_Type 1        211          2018              5            20
4        Room_Type 1         48          2018              4            11

market_segment_type repeated_guest no_of_previous_cancellations \
0             Offline               0                             0
1              Online               0                             0
2              Online               0                             0
3              Online               0                             0
4              Online               0                             0

   no_of_previous_bookings_not_canceled avg_price_per_room \
0                                     0               65.00
1                                     0              106.68
2                                     0               60.00
3                                     0              100.00
4                                     0               94.50

   no_of_special_requests booking_status
0                       0   Not_Canceled
1                       1   Not_Canceled
2                       0       Canceled
3                       0       Canceled
4                       0       Canceled

data.tail()

      Booking_ID no_of_adults no_of_children no_of_weekend_nights \
36270   INN36271             3               0                     2
36271   INN36272             2               0                     1
36272   INN36273             2               0                     2
36273   INN36274             2               0                     0
36274   INN36275             2               0                     1

       no_of_week_nights type_of_meal_plan required_car_parking_space \
36270                  6       Meal Plan 1                           0
36271                  3       Meal Plan 1                           0
36272                  6       Meal Plan 1                           0
36273                  3      Not Selected                           0
36274                  2       Meal Plan 1                           0

      room_type_reserved lead_time arrival_year arrival_month \
36270        Room_Type 4         85          2018              8
36271        Room_Type 1        228          2018             10
36272        Room_Type 1        148          2018              7
36273        Room_Type 1         63          2018              4
36274        Room_Type 1        207          2018             12

       arrival_date market_segment_type repeated_guest \
36270             3              Online               0
36271            17              Online               0
36272             1              Online               0
36273            21              Online               0
36274            30             Offline               0

       no_of_previous_cancellations no_of_previous_bookings_not_canceled \
36270                             0                                     0
36271                             0                                     0
36272                             0                                     0
36273                             0                                     0
36274                             0                                     0

       avg_price_per_room no_of_special_requests booking_status
36270              167.80                       1   Not_Canceled
36271               90.95                       2       Canceled
36272               98.39                       2   Not_Canceled
36273               94.50                       0       Canceled
36274              161.67                       0   Not_Canceled

Check the info of the data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
#   Column                                Non-Null Count Dtype
--- ------                                -------------- -----
0   Booking_ID                            36275 non-null object
1   no_of_adults                          36275 non-null int64
2   no_of_children                        36275 non-null int64
3   no_of_weekend_nights                  36275 non-null int64
4   no_of_week_nights                     36275 non-null int64
5   type_of_meal_plan                     36275 non-null object
6   required_car_parking_space            36275 non-null int64
7   room_type_reserved                    36275 non-null object
8   lead_time                             36275 non-null int64
9   arrival_year                         36275 non-null int64
10 arrival_month                         36275 non-null int64
11 arrival_date                          36275 non-null int64
12 market_segment_type                   36275 non-null object
13 repeated_guest                        36275 non-null int64
14 no_of_previous_cancellations          36275 non-null int64
15 no_of_previous_bookings_not_canceled 36275 non-null int64
16 avg_price_per_room                    36275 non-null float64
17 no_of_special_requests                36275 non-null int64
18 booking_status                        36275 non-null object
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB

The dataset has 36,275 rows and 19 columns.
Booking_ID, type_of_meal_plan, room_type_reserved, market_segment_type, and booking_status are of object type while rest columns are numeric in nature.
There are no null values in the dataset.
Booking_ID column is an identifier. Let's check if each entry of the column is unique.

data.Booking_ID.nunique()

36275

Observations:

We can see that all the entries of this column are unique. Hence, this column would not add any value to our analysis.
Let's drop this column.

Dropping the Booking_ID column

data = data.drop(["Booking_ID"], axis=1)

data.head()

   no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
0             2               0                     1                  2
1             2               0                     2                  3
2             1               0                     2                  1
3             2               0                     0                  2
4             2               0                     1                  1

type_of_meal_plan required_car_parking_space room_type_reserved lead_time \
0       Meal Plan 1                           0        Room_Type 1        224
1      Not Selected                           0        Room_Type 1          5
2       Meal Plan 1                           0        Room_Type 1          1
3       Meal Plan 1                           0        Room_Type 1        211
4      Not Selected                           0        Room_Type 1         48

   arrival_year arrival_month arrival_date market_segment_type \
0          2017             10             2             Offline
1          2018             11             6              Online
2          2018              2            28              Online
3          2018              5            20              Online
4          2018              4            11              Online

   repeated_guest no_of_previous_cancellations \
0               0                             0
1               0                             0
2               0                             0
3               0                             0
4               0                             0

   no_of_previous_bookings_not_canceled avg_price_per_room \
0                                     0               65.00
1                                     0              106.68
2                                     0               60.00
3                                     0              100.00
4                                     0               94.50

   no_of_special_requests booking_status
0                       0   Not_Canceled
1                       1   Not_Canceled
2                       0       Canceled
3                       0       Canceled
4                       0       Canceled

Exploratory Data Analysis

Summary Statistics for numerical columns

Question 1: Write the observations from the below summary statistics (2 Marks)

#Selecting numerical columns and checking summary statistics
num_cols = data.select_dtypes('number').columns

data[num_cols].describe().T

                                        count         mean        std     min \
no_of_adults                          36275.0     1.844962   0.518715     0.0
no_of_children                        36275.0     0.105279   0.402648     0.0
no_of_weekend_nights                  36275.0     0.810724   0.870644     0.0
no_of_week_nights                     36275.0     2.204300   1.410905     0.0
required_car_parking_space            36275.0     0.030986   0.173281     0.0
lead_time                             36275.0    85.232557 85.930817     0.0
arrival_year                          36275.0 2017.820427   0.383836 2017.0
arrival_month                         36275.0     7.423653   3.069894     1.0
arrival_date                          36275.0    15.596995   8.740447     1.0
repeated_guest                        36275.0     0.025637   0.158053     0.0
no_of_previous_cancellations          36275.0     0.023349   0.368331     0.0
no_of_previous_bookings_not_canceled 36275.0     0.153411   1.754171     0.0
avg_price_per_room                    36275.0   103.423539 35.089424     0.0
no_of_special_requests                36275.0     0.619655   0.786236     0.0

                                         25%      50%     75%     max
no_of_adults                             2.0     2.00     2.0     4.0
no_of_children                           0.0     0.00     0.0    10.0
no_of_weekend_nights                     0.0     1.00     2.0     7.0
no_of_week_nights                        1.0     2.00     3.0    17.0
required_car_parking_space               0.0     0.00     0.0     1.0
lead_time                               17.0    57.00   126.0   443.0
arrival_year                          2018.0 2018.00 2018.0 2018.0
arrival_month                            5.0     8.00    10.0    12.0
arrival_date                             8.0    16.00    23.0    31.0
repeated_guest                           0.0     0.00     0.0     1.0
no_of_previous_cancellations             0.0     0.00     0.0    13.0
no_of_previous_bookings_not_canceled     0.0     0.00     0.0    58.0
avg_price_per_room                      80.3    99.45   120.0   540.0
no_of_special_requests                   0.0     0.00     1.0     5.0

Observations:_______

#Checking the rows where avg_price_per_room is 0
data[data["avg_price_per_room"] == 0]

       no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
63                1               0                     0                  1
145               1               0                     0                  2
209               1               0                     0                  0
266               1               0                     0                  2
267               1               0                     2                  1
...             ...             ...                   ...                ...
35983             1               0                     0                  1
36080             1               0                     1                  1
36114             1               0                     0                  1
36217             2               0                     2                  1
36250             1               0                     0                  2

      type_of_meal_plan required_car_parking_space room_type_reserved \
63          Meal Plan 1                           0        Room_Type 1
145         Meal Plan 1                           0        Room_Type 1
209         Meal Plan 1                           0        Room_Type 1
266         Meal Plan 1                           0        Room_Type 1
267         Meal Plan 1                           0        Room_Type 1
...                 ...                         ...                ...
35983       Meal Plan 1                           0        Room_Type 7
36080       Meal Plan 1                           0        Room_Type 7
36114       Meal Plan 1                           0        Room_Type 1
36217       Meal Plan 1                           0        Room_Type 2
36250       Meal Plan 2                           0        Room_Type 1

       lead_time arrival_year arrival_month arrival_date \
63             2          2017              9            10
145           13          2018              6             1
209            4          2018              2            27
266            1          2017              8            12
267            4          2017              8            23
...          ...           ...            ...           ...
35983          0          2018              6             7
36080          0          2018              3            21
36114          1          2018              3             2
36217          3          2017              8             9
36250          6          2017             12            10

      market_segment_type repeated_guest no_of_previous_cancellations \
63          Complementary               0                             0
145         Complementary               1                             3
209         Complementary               0                             0
266         Complementary               1                             0
267         Complementary               0                             0
...                   ...             ...                           ...
35983       Complementary               1                             4
36080       Complementary               1                             3
36114              Online               0                             0
36217              Online               0                             0
36250              Online               0                             0

       no_of_previous_bookings_not_canceled avg_price_per_room \
63                                        0                 0.0
145                                       5                 0.0
209                                      0                 0.0
266                                       1                 0.0
267                                       0                 0.0
...                                     ...                 ...
35983                                    17                 0.0
36080                                    15                 0.0
36114                                     0                 0.0
36217                                     0                 0.0
36250                                     0                 0.0

       no_of_special_requests booking_status
63                          1   Not_Canceled
145                         1   Not_Canceled
209                         1   Not_Canceled
266                         1   Not_Canceled
267                         1   Not_Canceled
...                       ...            ...
35983                       1   Not_Canceled
36080                       1   Not_Canceled
36114                       0   Not_Canceled
36217                       2   Not_Canceled
36250                       0   Not_Canceled

[545 rows x 18 columns]

In the market segment column, it looks like many values are complementary. Let's check the market segment where the room prices are equal to 0.

data.loc[data["avg_price_per_room"] == 0, "market_segment_type"].value_counts()

Complementary 354
Online 191
Name: market_segment_type, dtype: int64

It makes sense that most values with room prices equal to 0 are the rooms given as complimentary service given by the hotel.
The rooms booked online might be a part of some promotional campaign done by the hotel. We will not treat these rows as we don't have the data to test this claim.

Check the distribution and outliers for some columns in the data

for col in ['lead_time', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room']:
    print(col)
    print('Skew :',round(data[col].skew(),2))
    plt.figure(figsize=(15,4))
    plt.subplot(1,2,1)
    data[col].hist(bins=10, grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=data[col])
    plt.show()

lead_time
Skew : 1.29

no_of_previous_cancellations
Skew : 25.2

no_of_previous_bookings_not_canceled
Skew : 19.25

avg_price_per_room
Skew : 0.67

The distribution of lead time is right-skewed. Many customers have made the booking on the same day of arrival as well. There are many outliers, some customers made booking more than 400 days in advance.
Very few customers have more than one cancellation. Some customers canceled more than 12 times.
Very few customers have more than 1 booking not canceled previously. Some customers have not canceled their bookings around 60 times.
The distribution of average price per room is skewed to right. There are outliers on both sides. The median price of a room is around ~100 euros. There is 1 observation where the average price of the room is more than 500 euros. This observation is quite far away from the rest of the values. We can treat this by clipping the value to the upper whisker (Q3 + 1.5 * IQR).

# Calculating the 25th quantile
Q1 = data["avg_price_per_room"].quantile(0.25)

# Calculating the 75th quantile
Q3 = data["avg_price_per_room"].quantile(0.75)

# Calculating IQR
IQR = Q3 - Q1

# Calculating value of upper whisker
Upper_Whisker = Q3 + 1.5 * IQR
Upper_Whisker

179.55

# assigning the outliers the value of upper whisker
data.loc[data["avg_price_per_room"] >= 500, "avg_price_per_room"] = Upper_Whisker

Now, let's check percentage of each category for some variables

Question 2:

Write the code to check the percentage of each category for columns mentioned below (cat_cols) (2 Marks)
Write your observations (2 Marks)

cat_cols = ['no_of_adults', 'no_of_children', 'no_of_week_nights', 'no_of_weekend_nights', 'required_car_parking_space',
'type_of_meal_plan', 'room_type_reserved', 'arrival_month', 'market_segment_type', 'no_of_special_requests',
'booking_status']:

#Write your code here

Observations:________

Replacing values 9 and 10 for the number of children with 3 and encoding the target variable

# replacing 9, and 10 children with 3
data["no_of_children"] = data["no_of_children"].replace([9, 10], 3)

data["booking_status"] = data["booking_status"].apply(lambda x: 1 if x == "Canceled" else 0)

We are done with univariate analysis and data preprocessing. Let's explore the data a bit more with bivariate analysis.

Let's check the relationship of market segment type with the average price per room.

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="market_segment_type", y="avg_price_per_room")
plt.show()

Rooms booked online have the highest variations in prices.
The distribution for offline and corporate room prices are almost similar except for some outliers.
Complementary market segment gets the rooms at very low prices, which makes sense.

Let's see how booking status varies across different market segments. Also, how lead time impacts booking status

plt.figure(figsize=(10, 6))
sns.countplot(x='market_segment_type', hue='booking_status', data=data)
plt.show()

Online bookings have the highest number of cancellations.
Bookings made offline are less prone to cancellations.
Corporate and complementary segment shows very low cancellations.

plt.figure(figsize=(10, 6))
sns.boxplot(data=data, x="booking_status", y="lead_time")
plt.show()

There's a big difference in the median value of lead time for bookings that were canceled and bookings that were not canceled. The higher the lead time higher is the chances of a booking being canceled.

Now, let's check how the arrival month impacts the booking status

plt.figure(figsize=(10, 6))
sns.countplot(x='arrival_month', hue='booking_status', data=data)
plt.show()

We observed earlier that the month of October has the highest number of bookings but the above plot shows that October has the highest number of cancellations as well.
Bookings made for December and January are less prone to cancellations.

Repeating guests are the guests who stay in the hotel often and are important to brand equity. Let's see what percentage of repeating guests cancel?

plt.figure(figsize=(10, 6))
sns.countplot(x='repeated_guest', hue='booking_status', data=data)
plt.show()

There are very few repeat customers but the cancellation among them is very less. This is a good indication as repeat customers are important for the hospitality industry as they can help in spreading the word of mouth.

We have explored different combinations of variables. Now, let's see the pairwise correlations between all the variables.

plt.figure(figsize=(12, 7))
sns.heatmap(data.corr(), annot=True, fmt=".2f")
plt.show()

There's a positive correlation between the number of customers (adults and children) and the average price per room. This makes sense as more the number of customers more rooms they will require thus increasing the cost.
There's a negative correlation between average room price and repeated guests. The hotel might be giving some loyalty benefits to the customers.
Repeated guests have a positive correlation with the number of previous bookings canceled and previous bookings not canceled. This implies that repeated customers are also likely to cancel their bookings.
There's a positive correlation between lead time and the number of weeknights a customer is planning to stay in the hotel.
There's a positive correlation between booking status and lead time, indicating higher the lead time higher are the chances of cancellation.
There's a negative correlation between the number of special requests from the customer and the booking status, indicating if a customer has made some special requests the chances of cancellation might decrease

Now that we have explored our data, let's prepare it for modeling.

Preparing data for modeling

Models cannot take non-numeric inputs. So, we will first create dummy variables for all the categorical variables.
We will then split the data into train and test sets.

Question 3:

Drop the target variable from the original data and store it in a separate dataframe X (1 Mark)
Store the target variable in a separate series Y (1 Mark)

#Remove the blanks and complete the below code
X=___________
Y=___________

#Creating dummy variables
#drop_first=True is used to avoid redundant variables
X = pd.get_dummies(X, drop_first=True)

#Splitting the data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X, Y, test_size=0.30, random_state=1)

Building Classification Models

Before training the model, let's choose the appropriate model evaluation criterion as per the problem on hand.

Model evaluation criterion

Model can make wrong predictions as:

Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.

Which case is more important?

Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of unsold rooms. The hotel might also have to bear an additional cost of advertising the room again on different distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.

How to reduce the losses?

Hotel would want F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.

Also, let's create a function to calculate and print the classification report and confusion matrix so that we don't have to rewrite the same code repeatedly for each model.

#function to print classification report and get confusion matrix in a proper format

def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))
    sns.heatmap(cm, annot=True, fmt='.2f', xticklabels=['Not Canceled', 'Canceled'], yticklabels=['Not Canceled', 'Canceled'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

Logistic Regression

Question 4: Fit the logistic regression model on the train dataset using random_state=1 (2 Marks)

#define logistic regression model
log_reg= #write your code here

#fit the model
#write you code here

Let's check the coefficient of each dependent variable in the data

Question 5: Write your observations on the below coefficients obtained from the logistic regression model (3 Marks)

pd.Series(log_reg.coef_[0], index=X_train.columns).sort_values(ascending=False)

market_segment_type_Online              0.624071
type_of_meal_plan_Not Selected          0.321587
no_of_weekend_nights                    0.180871
avg_price_per_room                      0.019686
lead_time                               0.015407
no_of_adults                            0.011404
no_of_week_nights                       0.006240
arrival_date                            0.001224
type_of_meal_plan_Meal Plan 3           0.000400
room_type_reserved_Room_Type 3          0.000341
arrival_year                           -0.001730
room_type_reserved_Room_Type 2         -0.004286
market_segment_type_Complementary      -0.008782
room_type_reserved_Room_Type 5         -0.011135
room_type_reserved_Room_Type 7         -0.017821
no_of_previous_cancellations           -0.024671
market_segment_type_Corporate          -0.031280
room_type_reserved_Room_Type 4         -0.032748
repeated_guest                         -0.043413
room_type_reserved_Room_Type 6         -0.045607
no_of_children                         -0.053844
arrival_month                          -0.057248
type_of_meal_plan_Meal Plan 2          -0.094291
required_car_parking_space             -0.139098
no_of_previous_bookings_not_canceled   -0.212323
market_segment_type_Offline            -0.598263
no_of_special_requests                 -1.550049
dtype: float64

Observations:_________

Question 6: Write your interpretations of the odds calculated from the logistic regression model coefficients (3 Marks)

odds = np.exp(log_reg.coef_[0]) #finding the odds

# adding the odds to a dataframe and sorting the values
pd.DataFrame(odds, X_train.columns, columns=['odds']).sort_values(by='odds', ascending=False)

                                          odds
market_segment_type_Online            1.866510
type_of_meal_plan_Not Selected        1.379315
no_of_weekend_nights                  1.198261
avg_price_per_room                    1.019881
lead_time                             1.015527
no_of_adults                          1.011469
no_of_week_nights                     1.006260
arrival_date                          1.001225
type_of_meal_plan_Meal Plan 3         1.000400
room_type_reserved_Room_Type 3        1.000341
arrival_year                          0.998271
room_type_reserved_Room_Type 2        0.995723
market_segment_type_Complementary     0.991256
room_type_reserved_Room_Type 5        0.988927
room_type_reserved_Room_Type 7        0.982337
no_of_previous_cancellations          0.975631
market_segment_type_Corporate         0.969204
room_type_reserved_Room_Type 4        0.967782
repeated_guest                        0.957516
room_type_reserved_Room_Type 6        0.955417
no_of_children                        0.947580
arrival_month                         0.944360
type_of_meal_plan_Meal Plan 2         0.910018
required_car_parking_space            0.870143
no_of_previous_bookings_not_canceled 0.808703
market_segment_type_Offline           0.549766
no_of_special_requests                0.212238

Observations:_________

Now, let's check the performance of the model on the training set

Question 7: Check the performance on the training data and write your observations from the classification report and confusion matrix for the training set (3 Marks)

# Checking performance on the training data
y_pred_train = ________
metrics_score__________

Reading confusion matrix (clockwise):

True Positive: Predicting the customer will not cancel the booking and the customer does not cancel the booking
False Negative: Predicting the customer will cancel the booking but the customer does not cancel the booking
True Negative: Predicting the customer will cancel the booking and the customer cancels the booking
False Positive: Predicting the customer will not cancel the booking but the customer cancels the booking

Observations:_____

Precision-Recall Curve for Logistic Regression

y_scores=log_reg.predict_proba(X_train) #predict_proba gives the probability of each observation belonging to each class

precisions, recalls, thresholds = precision_recall_curve(y_train, y_scores[:,1])

#Plot values of precisions, recalls, and thresholds
plt.figure(figsize=(10,7))
plt.plot(thresholds, precisions[:-1], 'b--', label='precision')
plt.plot(thresholds, recalls[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc='upper left')
plt.ylim([0,1])
plt.show()

Observations:

We can see that precision and recall are balanced for the threshold of about 0.4.
Let's try to calculate the exact threshold where precision and recall are equal.

#calculating the exact threshold where precision and recall are equal.
for i in np.arange(len(thresholds)):
if precisions[i]==recalls[i]:
print(thresholds[i])

0.4174887148772929

The threshold of 0.42 would give a balanced precision and recall.

Question 8: Compare the performance of the model on training and testing sets after changing the threshold (2 Marks)

optimal_threshold1 = 0.42
metrics_score(y_train, y_scores[:,1]>optimal_threshold1)

              precision    recall f1-score   support

           0       0.84      0.84      0.84     17029
           1       0.68      0.68      0.68      8363

    accuracy                           0.79     25392
   macro avg       0.76      0.76      0.76     25392
weighted avg       0.79      0.79      0.79     25392

Let's check the performance of the model on the test data

#Checking performance on the testing data
y_pred_test = log_reg.predict_proba(X_test)
metrics_score(y_test, y_pred_test[:,1]>optimal_threshold1)

              precision    recall f1-score   support

           0       0.85      0.85      0.85      7361
           1       0.68      0.68      0.68      3522

    accuracy                           0.79     10883
   macro avg       0.77      0.77      0.77     10883
weighted avg       0.79      0.79      0.79     10883

Observations:_____

K - Nearest Neighbors (KNN)

KNN is a distance based algorithm and all distance based algorithms are affected by the scale of the data.
We will scale the attributes (dataframe X defined above) before building the KNN model.
Then We need to identify the value of K to be used in KNN. We will use GridSearchCV to find the optimal value of K along with other hyperparameters.

# scaling the data
scaler=StandardScaler()
X_train_scaled=pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns) #fit_transform the training data
X_test_scaled=pd.DataFrame(scaler.transform(X_test), columns=X_test.columns) #transform the testing data

Using GridSearchCV for find the value of K and hyperparameter tuning

Points to note:

Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation.
Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
Grid search is an exhaustive search of values that tries many iterations to compute the optimum values of hyperparameters. So, it might take up to 30 minutes for the code to run depending on the number of values and hyperparameters passed.
The hyperparameters that we are tuning are:
- n_neighbors: Number of neighbors to use.
- weights={'uniform', 'distance'}
  - uniform : uniform weights. All points in each neighborhood are weighted equally.
  - distance : weight points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than neighbors that are further away.
- p: When p = 1, this is equivalent to using Manhattan_distance (L1), and Euclidean_distance (L2) is used for p = 2.

knn = KNeighborsClassifier()

params_knn = {'n_neighbors':np.arange(2,20,2), 'weights':['uniform','distance'], 'p':[1,2]}

grid_knn = GridSearchCV(estimator=knn, param_grid=params_knn, scoring='f1', cv=10)

model_knn=grid_knn.fit(X_train_scaled,y_train)

knn_estimator = model_knn.best_estimator_
print(knn_estimator)

KNeighborsClassifier(n_neighbors=14, p=1, weights='distance')

Question 9:

Fit the KNN model on the scaled training data using the optimal values of hyperparameters obtained from GridSearchCV (1 mark)
Check the performance of the model on the scaled training and testing sets (2 Marks)
Compare the performance and write your observations (1 Marks)

#Fit the KNN model on the scaled training data
___________

#Make predictions on the scaled training data and check the performance (using metrics_score function)

y_pred_train = ___________
___________

#Make predictions on the scaled testing data and check the performance (using metrics_score function)
y_pred_test = __________
____________

Observations:_________

Question 10: Write the conclusion on the key factors that are driving the cancellations and write your recommendations to the business on how can they minimize the number of cancellations. (5 Marks)

menu

Project: Classification - Hotel Booking Cancellation Prediction Marks: 30 Welcome to the project on classification

Computer Science

Project: Classification - Hotel Booking Cancellation Prediction

Marks: 30

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions