Fill This Form To Receive Instant Help
Homework answers / question archive / Individual Assignment M3
Individual Assignment M3.1
In this assignment, you will try to replicate the Play Golf dataset on a familiar dataset - Boston Housing! But we will simplify the dataset a bit and give four candidate columns for you to choose from. You are only going to go for one depth. To be completed ON YOUR OWN - any plagiarism (i.e. copying codes, comments or submitting work that is not your own) will be dealt with according to Graduate School policy.
Each student will enter their student ID and get a different set of rows to compute their calculations. Good luck!
Rubric:
100 pts: Student has a clearly labeled notebook with no errors. Your headers should match the class example, but the numbers and symbols should be updated to match your example. Calculations for reduction in global standard deviation mimic class example and are correct. Plot the tree at the end with a max_depth=1 to ensure you got the same answer.
80 pts: A minor error is carried throughout the notebook, lack of comments or headers, and/or no decision tree visualization to check the work.
50 pts: Major error, sloppy code and/or no decision tree visualization to check the work.
Data Prep
Let's read in the Boston Housing data and subset a few columns to make things more intuitive.
# enter your Student ID here
studentID = 1234567 # update this based on your student ID!
import pandas as pd
import numpy as np
# read in the Boston Housing data
df = pd.read_csv('https://raw.githubusercontent.com/michelpf/mlnd-boston-housing/master/housing.csv')
df.info() # note that this version only has a few columns (RM, LSTAT, PTRATIO and MEDV)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 RM 489 non-null float64
1 LSTAT 489 non-null float64
2 PTRATIO 489 non-null float64
3 MEDV 489 non-null float64
dtypes: float64(4)
memory usage: 15.4 KB
So that you don't have to evalaute ALL possible combinations - let's recode RM, LSTAT and PTRATIO based on median values.
df['RM'] = np.where(df['RM'] > np.median(df['RM']), 1, 0)
df['LSTAT'] = np.where(df['LSTAT'] > np.median(df['LSTAT']), 1, 0)
df['PTRATIO'] = np.where(df['PTRATIO'] > np.median(df['PTRATIO']), 1, 0)
df # we leave the target variable as is, we are doing regression!
RM LSTAT PTRATIO MEDV
0 1 0 0 504000.0
1 1 0 0 453600.0
2 1 0 0 728700.0
3 1 0 0 701400.0
4 1 0 0 760200.0
... ... ... ... ...
484 1 0 1 470400.0
485 0 0 1 432600.0
486 1 0 1 501900.0
487 1 0 1 462000.0
488 0 0 1 249900.0
489 rows × 4 columns
This is where every student gets different datasets for modeling.
df = df.sample(n=15, random_state=studentID)
df
RM LSTAT PTRATIO MEDV
99 1 0 0 697200.0
280 1 0 1 783300.0
167 0 1 0 401100.0
264 1 0 0 680400.0
388 0 1 1 105000.0
309 1 0 1 499800.0
254 1 0 0 651000.0
427 0 1 1 226800.0
390 0 1 1 585900.0
408 0 1 1 174300.0
118 0 1 0 428400.0
156 0 1 0 275100.0
113 0 1 0 392700.0
344 0 0 1 432600.0
322 1 0 1 466200.0
Good luck!
Now that you have your data, you can replicate and update the notebook from class. Your subheaders should look like this (with your numbers):
example of DTR from scratch.PNG
# don't forget the viz at the end to check your work - all of the numbers should match!
Please download the answer file using this link
https://drive.google.com/file/d/1rYpi4CF5caj6Q93mdqcYJf-r5CDELGAw/view?usp=sharing