Fill This Form To Receive Instant Help
Homework answers / question archive / This assignment is inspired by a real-life scenario
This assignment is inspired by a real-life scenario. Imagine you have been hired as a Data
Scientist by a major e-commerce retailer. Your responsibility is to analyse customer reviews to determine the rating of new products.
For this assignment you will be given a dataset of 5000 customer reviews from Amazon.
Each review consists of a short text, and one of five ratings: a number from 1 to 5. You are required to evaluate various supervised machine learning methods using a variety of features and settings.
You are required to apply three machine learning methods discussed in the lecture: Naïve
Bayes (NB), Decision Tree (DT) and k-Nearest Neighbour (KNN). These methods will be trained on the first 4000 reviews and tests on the remaining 1000. An incomplete solution is provided to minimize the required coding effort so you can focus on the result analysis.
For all models, consider a review to be a collection of words, where a word is a string of at least two letter, numbers or the symbols / (slash), - hyphen), $ or %, delimited by a space,
afterreplacingthreedots(…)byaspace. The(…)parthasbeendone,butyouneedtodo
the other characters.
The default parameters in your code should be as follows:
• All models: no stemming, no conversation of text to lowercase (lowercase=False in
CountVectorizer), stop words removed (stop_words=“english” in CountVectorizer), first 4000 tweets used for training, last 1000 used for testing
• DecisionTree:max_depth=None,criterion=“entropy”,random_state=0.Maxdepth set to none means full Decision Tree is used.
• K-Nearest Neighbour: n_neighbors = 5
Provided for this assignment:
1. Dataset.tsv: 5000 customer reviews in the tsv format (ins_number, text, rating)
2. Incomplete_solution.py: an incomplete solution that you can use to develop your
code based on.
Deliverables:
1. ZIP file containing:
a. Report in PDF format (.pdf) answering the questions specified below. The report should contain answers to each question clearly numbered with tables showing results, if required. Do not use screenshots of classification_report. Questions may require studying additional material in the textbooks listed in the Outline of this course and experimenting with the code.
b. All code developed for this assignment in one file (.py)
Questions to answer in the report:
In questions 2-6 metrics means the following set of metrics obtained on the test part of the
data (values up to 3 decimal points):
Precision Recall F1
Micro: all ratings
Macro: all ratings
1
2
3
4
5
This table can be used to present the results.
1. [1 marks] Show the distribution in the tabular or chart form of all instances overthe
ratings. How this distribution might affect training/testing and what metrics would
be suitable to measure model performance. Briefly justify your answer.
2. [1 mark] Show metrics for all three methods with default parameters. Do not take
screenshots from terminal.
3. [1 mark] Change the default parameters in the code by not removing stop words.
Show metrics for all three methods, compare to the default setting, and briefly
explain the differences.
4. [1 mark] Change the default parameters by enabling stemming and lowercasing in
the code. Show metrics for all three methods, compare to the default setting and
briefly explain the differences.
5. [2 marks] Modify k in the KNN from 1 to 10 and choose the best k as measured by
macro F1. Compare this metric to the default setting, and briefly explain the
difference. You can use Python loop to run KNN for different k.
6. [3 marks] Limiting the depth of Decision Tree is one of the ways to prune and
balance bias with variance. Modify max_depth in DT from 5 to 15 and choose the
best model as measured by macro F1. Compare to the default setting, and briefly
explain the difference. You can use Python loop to run DT for differentmax_depth.
7. [1 marks] One of the first steps of the provided incomplete_solution.py is it creates
a new column, Please refer to function “stentiment_converter” and explain what
does the data in the new columns represents.
Instead of training all the standard models on the rating, train them on the new
column. Show metrics of the three methods with default parameters on the new
column. Compare these results with the results on the “rating” data. Briefly, explain
the difference and provide what you believe is the reason of this difference.
8. [5marks]Forthissection,youarerequiredtodescribeyourchosen“best”methods
for rating predication. You are free to choose from existing methods and tune
parameters or you can introduce a new method that you found to be useful in this
context.
You need to give new experimental results for your trained method on the training
set 4000 of reviews and tested on the test set of the last 1000 reviews. Explain how
this experimental evaluation justifies your choice of model, including settings and
parameters, against a range of alternatives. Provide new experiments and
justifications: do not just refer to previous answers.
Please download the answer file using this link
https://drive.google.com/file/d/19Bxcm6HtyIkpc0o9d8S9HhQvA7lFV4Uq/view?usp=sharing