Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / CSCI433/CSCI933: Machine Learning - Algorithms and Applications Assignment Problem Set #1 Motivation The goal of this assignment is to design and compare two classifiers using the data set provided

CSCI433/CSCI933: Machine Learning - Algorithms and Applications Assignment Problem Set #1 Motivation The goal of this assignment is to design and compare two classifiers using the data set provided

Computer Science

CSCI433/CSCI933: Machine Learning - Algorithms and Applications Assignment

Problem Set #1

Motivation

The goal of this assignment is to design and compare two classifiers using the data set provided. Support vector machine and logistic regression are two well known classifiers worthy of your attention. The idea is to use them as vehicle of learning how to design a classifier using a library such as ScikitLearn. You will need to read the documentation of the library (https://scikit-learn.org/stable/user_guide.html) and understand how to select the various options.

The first step to take is to study the data carefully by reading about the features (variables), particularly the range of plausible values, meaning, method of measurement, etc. It is expected that a good deal of effort will need to be expended on data preparation (scaling, imputation, etc.).

The Machine Learning/Python books provided on Moodle will be of great help in this regard. These books could also be used as de facto reference manual for Python modules (ScikitLearn, matplotlib, numpy, scipy, etc.) for Machine Learning. You should refer to the books on Machine Learning (also provided on Moodle) for the theory underlying the various classifiers that you may choose to use in your experiment.

 

About the data

This dataset was originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim of gathering this data was to design a classifier (or predictor) that will use the diagnostic measurements (features) and classify a subject (person)  as  having/not  having diabetes. All subjects in the dataset are females at least 21 years old of Pima Indian heritage.

 

Features/variables

The dataset is organised such each row contains the features for a subject. The columns contain the following features:

 

  1. Pregnancies: Number of times pregnant

 

  1. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance tested

 

  1. BloodPressure: Diastolic blood pressure (mm Hg)

 

  1. SkinThickness: Triceps skin fold thickness (mm)

 

  1. Insulin: 2-Hour serum insulin (mu U/ml)

 

  1. BMI: Body mass index (weight in kg/(height in m)2)

 

  1. DiabetesPedigreeFunction: Diabetes pedigree function

 

  1. Age: Age (years)

 

  1. Outcome: Class variable (0 or 1)

 

 

Task

The following should be taken as the specifications of this assignment.

 

  1. This is an individual assignment but you can discuss with your classmates.

 

  1. The two classifiers to be implemented and used in this assignment are support vector machine and logistic regression.
  2. Design and test your classifers on the dataset provided.

 

  1. You must report of the test accuracy for a number of classifier parameters.

 

  1. Your code must be written using the Python programming language.

 

  1. You can use the API exposed in the machine learning librray scikit-learn. This should simplify your work significantly.
  2. Your code must run on the command line as:

 

python3 classify -d data_file

and output your comparative result on standard output. The following code snippet will allow you to pasre command line arguments

 

import argparse

parser = argparse.ArgumentParser

 

(description='Comparative study of logistic regression and svm') parser.add_argument('-d',metavar='dataset',

required=True,dest='dataset', action='store', help='path and name of

dataset') args = parser.parse_args()

 

  1. Submit a four-page (i.e 2,000 words) report on your results for grading. See specifications of the report below. You should use figures and tables where they make your results clearer and intelligible. The four pages does not include the title page.
  2. Ability to communicate your work using good English will be marked. This implies that if your submitted report is not intelligible, you will lose marks.

 

Report

Your report should be according to the following format (i.e. headings):

 

Title (5 marks) - Give your report a nice and meaningful title. Write your name and student number on the title page.

Introduction (10 marks)- Describe the dataset in your own words and highlight various statistics (mean, variance, etc.) along with any significant observation that could be gleaned from the data. Are there missing values? Is the class imbalance?

Data preparation (20 marks) - Describe the various methods and implications of the data preparations you undertook. Note that this is very important as it would have significant impact on the accuracy obtained from your classifier. You should discuss how you split the data for training, validation and testing.

Classifiers (40 marks) - Describe the two classifiers you have tested in your experimentation. In addition to some default values for parameters, you are required to experiment with other parameters values and report on their effecton the results. Read the documentation of the respective functions in ScikitLearn. This is very important because it shows how well you understand the properties  of the classifiers It is expected that you will write mathematical equations that describe the classifier model (i.e. short derivation).

 

Evaluation (20 marks) - Describe and justify the methods of performance evaluation you have adopted. State the comparative evaluation estimates and justify the differences. You can use a table to present your results.

 

Conclusions (30 marks) - You are required to reflect and write about the differences amongst the various classifier models relative to their parameters, amount of data required for training, nature/format of data required and the accuracy obtained. In addition, you are required to reflect and describe any significant trend/observation you discovered with regards to what features may be dominant in determining whether subject will have diabetes. For example, is there a subgroup  of subject  that are more likely to have diabetes?

 

What needs to be submitted?

PLEASE READ VERY CAREFULLY

 

  • You are required to submit your 4-page (2,000 words) report using  the section/heading format specified above. The report should be typed (or typeset using LaTeX) with 11-point font, one-and-half spacing and 1.5 cm all round margin. The four pages does not include the title page. Submitted report MUST be a PDF file. Any WORD document should have been converted to PDF before submission. Non-PDF reports will not be marked.

 

  • You must submit the code for the classifiers you implemented for this assignment and it must produce the results quoted in your report.

 

  • Place your report and your source code in a folder with your name and and ”zip” or “rar” the folder before submission. This is important because of the way Moodle submission works.

Option 1

Low Cost Option
Download this past answer in few clicks

24.99 USD

PURCHASE SOLUTION

Already member?


Option 2

Custom new solution created by our subject matter experts

GET A QUOTE