Individual Assignment M3.2 (DTC from scratch)

In this assignment, you will try to replicate the Play Golf dataset on a familiar dataset - Boston Housing! Related to M3.1 but redone for classification. We will simplify the dataset, recode the target variable to 0/1, and only give a few candidate columns for you to us. You are only going to perform one split and then stop (this is enough to evaluate your proficiency).

To be completed ON YOUR OWN - any plagiarism (i.e. copying codes, comments or submitting work that is not your own) will be dealt with according to Graduate School policy.

Each student will enter their student ID and get a different set of rows to compute their calculations. Good luck!

Rubric:

100 pts: Student has a clearly labeled notebook with no errors. Your headers should match the class example, but the numbers and symbols should be updated to match your example. Calculations for information gain mimic class example and are correct. Plot the tree at the end with a max_depth=1 to ensure you got the same answer.
80 pts: A minor error is carried throughout the notebook, lack of comments or headers, and/or no decision tree visualization to check the work.
50 pts: Major error, sloppy code and/or no decision tree visualization to check the work.

Data Prep

Let's read in the Boston Housing data and subset a few columns to make things more intuitive.

# enter your Student ID here
studentID = 1234567 # update this based on your student ID!

import pandas as pd
import numpy as np
# read in the Boston Housing data
df = pd.read_csv('https://raw.githubusercontent.com/michelpf/mlnd-boston-housing/master/housing.csv')
df.info() # note that this version only has a few columns (RM, LSTAT, PTRATIO and MEDV)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 489 entries, 0 to 488
Data columns (total 4 columns):
#   Column   Non-Null Count Dtype
--- ------   -------------- -----
0   RM       489 non-null    float64
1   LSTAT    489 non-null    float64
2   PTRATIO 489 non-null    float64
3   MEDV     489 non-null    float64
dtypes: float64(4)
memory usage: 15.4 KB

So that you don't have to evalaute ALL possible combinations - let's recode RM, LSTAT and PTRATIO based on median values.

df['RM'] = np.where(df['RM'] > np.median(df['RM']), 1, 0)
df['LSTAT'] = np.where(df['LSTAT'] > np.median(df['LSTAT']), 1, 0)
df['PTRATIO'] = np.where(df['PTRATIO'] > np.median(df['PTRATIO']), 1, 0)
df['MEDV'] =np.where(df['MEDV'] > np.median(df['MEDV']), 1, 0) # we recoded the target variable!

This is where every student gets different datasets for modeling.

df = df.sample(n=15, random_state=studentID)
df

     RM LSTAT PTRATIO MEDV
322   1      0        1     1
388   0      1        1     0
113   0      1        0     0
427   0      1        1     0
408   0      1        1     0
118   0      1        0     0
390   0      1        1     1
344   0      0        1     0
309   1      0        1     1
264   1      0        0     1
254   1      0        0     1
167   0      1        0     0
280   1      0        1     1
156   0      1        0     0
99    1      0       0     1

Now that you have your data, you can replicate and update the notebook from class. Your subheaders should look like this (with your numbers):

# don't forget the viz at the end to check your work - all of the numbers should match!

pur-new-sol

Individual Assignment M3

Computer Science

Individual Assignment M3.2 (DTC from scratch)

Rubric:

Data Prep

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions

menu

Individual Assignment M3

Computer Science

Individual Assignment M3.2 (DTC from scratch)

Rubric:

Data Prep

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions