Homework answers / question archive / This assignment is inspired by a real-life scenario

This assignment is inspired by a real-life scenario

Computer Science

Share With

This assignment is inspired by a real-life scenario. Imagine you have been hired as a Data

Scientist by a major e-commerce retailer. Your responsibility is to analyse customer reviews to determine the rating of new products.

For this assignment you will be given a dataset of 5000 customer reviews from Amazon.

Each review consists of a short text, and one of five ratings: a number from 1 to 5. You are required to evaluate various supervised machine learning methods using a variety of features and settings.

You are required to apply three machine learning methods discussed in the lecture: Naïve

Bayes (NB), Decision Tree (DT) and k-Nearest Neighbour (KNN). These methods will be trained on the first 4000 reviews and tests on the remaining 1000. An incomplete solution is provided to minimize the required coding effort so you can focus on the result analysis.

For all models, consider a review to be a collection of words, where a word is a string of at least two letter, numbers or the symbols / (slash), - hyphen), $ or %, delimited by a space,

afterreplacingthreedots(…)byaspace. The(…)parthasbeendone,butyouneedtodo

the other characters.

The default parameters in your code should be as follows:

• All models: no stemming, no conversation of text to lowercase (lowercase=False in

CountVectorizer), stop words removed (stop_words=“english” in CountVectorizer), first 4000 tweets used for training, last 1000 used for testing

• DecisionTree:max_depth=None,criterion=“entropy”,random_state=0.Maxdepth set to none means full Decision Tree is used.

• K-Nearest Neighbour: n_neighbors = 5

Provided for this assignment:

1. Dataset.tsv: 5000 customer reviews in the tsv format (ins_number, text, rating)

2. Incomplete_solution.py: an incomplete solution that you can use to develop your

code based on.

Deliverables:

1. ZIP file containing:

a. Report in PDF format (.pdf) answering the questions specified below. The report should contain answers to each question clearly numbered with tables showing results, if required. Do not use screenshots of classification_report. Questions may require studying additional material in the textbooks listed in the Outline of this course and experimenting with the code.

b. All code developed for this assignment in one file (.py)

Questions to answer in the report:

In questions 2-6 metrics means the following set of metrics obtained on the test part of the

data (values up to 3 decimal points):

Precision Recall F1

Micro: all ratings

Macro: all ratings

This table can be used to present the results.

1. [1 marks] Show the distribution in the tabular or chart form of all instances overthe

ratings. How this distribution might affect training/testing and what metrics would

be suitable to measure model performance. Briefly justify your answer.

2. [1 mark] Show metrics for all three methods with default parameters. Do not take

screenshots from terminal.

3. [1 mark] Change the default parameters in the code by not removing stop words.

Show metrics for all three methods, compare to the default setting, and briefly

explain the differences.

4. [1 mark] Change the default parameters by enabling stemming and lowercasing in

the code. Show metrics for all three methods, compare to the default setting and

briefly explain the differences.

5. [2 marks] Modify k in the KNN from 1 to 10 and choose the best k as measured by

macro F1. Compare this metric to the default setting, and briefly explain the

difference. You can use Python loop to run KNN for different k.

6. [3 marks] Limiting the depth of Decision Tree is one of the ways to prune and

balance bias with variance. Modify max_depth in DT from 5 to 15 and choose the

best model as measured by macro F1. Compare to the default setting, and briefly

explain the difference. You can use Python loop to run DT for differentmax_depth.

7. [1 marks] One of the first steps of the provided incomplete_solution.py is it creates

a new column, Please refer to function “stentiment_converter” and explain what

does the data in the new columns represents.

Instead of training all the standard models on the rating, train them on the new

column. Show metrics of the three methods with default parameters on the new

column. Compare these results with the results on the “rating” data. Briefly, explain

the difference and provide what you believe is the reason of this difference.

8. [5marks]Forthissection,youarerequiredtodescribeyourchosen“best”methods

for rating predication. You are free to choose from existing methods and tune

parameters or you can introduce a new method that you found to be useful in this

context.

You need to give new experimental results for your trained method on the training

set 4000 of reviews and tested on the test set of the last 1000 reviews. Explain how

this experimental evaluation justifies your choice of model, including settings and

parameters, against a range of alternatives. Provide new experiments and

justifications: do not just refer to previous answers.

Option 1

Low Cost Option

Download this past answer in few clicks

36.99 USD

PURCHASE SOLUTION

Already member? Sign In

Option 2

Custom new solution created by our subject matter experts

GET A QUOTE

Dataset.tsv

incomplete_solution.py

Assignment1.pdf

rated 5 stars

Purchased 3 times

Completion Status 100%

Google (5.0)

This assignment is inspired by a real-life scenario

Computer Science