Fill This Form To Receive Instant Help
Homework answers / question archive / CSCI433/CSCI933: Machine Learning - Algorithms and Applications Assignment Problem Set #1 Motivation The goal of this assignment is to design and compare two classifiers using the data set provided
Problem Set #1
The goal of this assignment is to design and compare two classifiers using the data set provided. Support vector machine and logistic regression are two well known classifiers worthy of your attention. The idea is to use them as vehicle of learning how to design a classifier using a library such as ScikitLearn. You will need to read the documentation of the library (https://scikit-learn.org/stable/user_guide.html) and understand how to select the various options.
The first step to take is to study the data carefully by reading about the features (variables), particularly the range of plausible values, meaning, method of measurement, etc. It is expected that a good deal of effort will need to be expended on data preparation (scaling, imputation, etc.).
The Machine Learning/Python books provided on Moodle will be of great help in this regard. These books could also be used as de facto reference manual for Python modules (ScikitLearn, matplotlib, numpy, scipy, etc.) for Machine Learning. You should refer to the books on Machine Learning (also provided on Moodle) for the theory underlying the various classifiers that you may choose to use in your experiment.
This dataset was originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim of gathering this data was to design a classifier (or predictor) that will use the diagnostic measurements (features) and classify a subject (person) as having/not having diabetes. All subjects in the dataset are females at least 21 years old of Pima Indian heritage.
Features/variables
The dataset is organised such each row contains the features for a subject. The columns contain the following features:
The following should be taken as the specifications of this assignment.
python3 classify -d data_file
and output your comparative result on standard output. The following code snippet will allow you to pasre command line arguments
import argparse
parser = argparse.ArgumentParser
(description='Comparative study of logistic regression and svm') parser.add_argument('-d',metavar='dataset',
required=True,dest='dataset', action='store', help='path and name of
dataset') args = parser.parse_args()
Your report should be according to the following format (i.e. headings):
Title (5 marks) - Give your report a nice and meaningful title. Write your name and student number on the title page.
Introduction (10 marks)- Describe the dataset in your own words and highlight various statistics (mean, variance, etc.) along with any significant observation that could be gleaned from the data. Are there missing values? Is the class imbalance?
Data preparation (20 marks) - Describe the various methods and implications of the data preparations you undertook. Note that this is very important as it would have significant impact on the accuracy obtained from your classifier. You should discuss how you split the data for training, validation and testing.
Classifiers (40 marks) - Describe the two classifiers you have tested in your experimentation. In addition to some default values for parameters, you are required to experiment with other parameters values and report on their effecton the results. Read the documentation of the respective functions in ScikitLearn. This is very important because it shows how well you understand the properties of the classifiers It is expected that you will write mathematical equations that describe the classifier model (i.e. short derivation).
Evaluation (20 marks) - Describe and justify the methods of performance evaluation you have adopted. State the comparative evaluation estimates and justify the differences. You can use a table to present your results.
Conclusions (30 marks) - You are required to reflect and write about the differences amongst the various classifier models relative to their parameters, amount of data required for training, nature/format of data required and the accuracy obtained. In addition, you are required to reflect and describe any significant trend/observation you discovered with regards to what features may be dominant in determining whether subject will have diabetes. For example, is there a subgroup of subject that are more likely to have diabetes?
PLEASE READ VERY CAREFULLY
Please download the answer file using this link
https://drive.google.com/file/d/1XurPANE2GMT21nC4zCV39_5A_8q5LUPn/view?usp=sharing