Homework answers / question archive / Assessment 2 - Individual report of the data analysis Your task: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it

Assessment 2 - Individual report of the data analysis Your task: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it

Mechanical Engineering

Share With

Assessment 2 - Individual report of the data analysis Your task: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it. As part of this process, you are required to subsequently apply two or more machine learning technique(s) of your choice to perform classification, association, numerical prediction and/or clustering tasks (or combinations thereof). You will present the outcome of the above tasks in the form of a technical report containing the five sections listed in Table 1 below: Table showing assignment 2 sections and weighting Section Weighting Recommended Pages per Section 1. Introduction and Business Context 0.1 1 2. Data Selection and PreProcessing 0.3 3 3. Machine Learning Method(s) and their Implementation 0.3 3 4. Evaluation of Results 0.2 2 5. Discussion 0.1 1 Table 1: Report sections, weightings and recommended pages per section. As shown in Table 1, a page limit of 10 pages for your report is recommended. The report, in total, however, must NOT exceed 13 pages (excluding title page, contents page, references, bibliography and appendices) with a minimum font size of 10 pitch and single spaced with 1 inch margin in all sides. A penalty of a single grade will be incurred if you exceed the 13-page limit. Further information (supporting experimental results) can be added as appendices. You are free to select the style of the report (i.e., section headings and format, etc.) although it must address the content listed in Table 1 therefore it is recommended you use the section headings provided above. You are expected to submit the following electronic files to Canvas by the submission deadline: 1. Training, validation and test sets (before and after pre-processing). Note that if cross-validation is used, only the training and test sets are required, AND; 2. Report (MUST be in MS Word or PDF format). Your report will be assessed according to the assessment criteria specified in the module handbook. The remainder of this section provides you with detailed requirements for each area of content – you should READ IT VERY CAREFULLY. Section 1: Introduction and Business Context • A brief narrative as to how your organisation or company currently performs data analytics and how the CRISP-DM methodology may help your organisation to better meet its strategic priorities with respect to data analytics and business intelligence. • A brief overview of the data analytics task you are going to perform. • A clear justification as to why the task you are attempting is of value to your business or, more broadly, industry, government, university research and/or the community. You should support your justification with references to the appropriate industry or academic literature. • State the insight you intend to gain. Section 2: Data Selection and Pre-Processing • Select a data set consisting of at least 2000 observations/records and preferably above 10,000. You are strongly encouraged to identify an anonymised data set relevant to either your role at work or, more broadly, the strategic objectives of the business. However, if this is not possible then you are advised to select a data set from one of the following sources: o Kaggle - Datasets (Links to an external site.) o WEKA Wiki - Datasets (Links to an external site.) o The MNIST Database (Links to an external site.) o Springboard - Find Free Public Data Sets for Your Data Science Project (Links to an external site.) • Briefly describe your data set and reference its origin. • If you have 15 or less attributes, table your attributes with attribute name, description and data type and then show the minimum/average/maximum and stdev values for the training set and test set. For nominal variables, then show the most and least frequently occurring nominal value(s). If you have more than 15 attributes, then group attributes into themes (e.g., customer, orders, employees) and describe the type of information and data types in each theme including the number of each variable type (e.g., nominal, interval, ratio, etc.). You may want to highlight significant variables identified by some attribute selection algorithm. • Briefly table the following characteristics of the entire data set: number of instances, missing values, outliers/erroneous values. • Explain how you have sampled your data to create the ‘in sample’ and ‘out of sample’ data sets. If you have used instance weightings to balance your data set(s), explain how the weightings were determined. • Provide a statistical summary in tabular form for the resulting ‘in sample’ (training/validation set) and ‘out of sample’ (test set). Also, state whether or not there was any overlap in training and test set instances and if so, justify why your test set is not compromised. • What pre-processing and transformation was performed on the variables and why? (e.g., standardising numerical variables and/or using scaling, taking logs to reduce skewness, or log differences to reduce nonstationarity; converting numerical variables to discrete ones; converting numerical or symbolic patterns into bit patterns; removing patterns with missing or outlier values; adding noise or jitter to patterns to expand the data set; adding instance weightings or replicating certain pattern classes to improve class distributions; transforming time-series data into static training/test patterns). • How did you ensure that your pre-processing did not compromise your test set (e.g., use of standardisation). • For those seeking a higher Distinction, you must clearly show how you have addressed the ‘curse of dimensionality’ issue, i.e., if you reduced the number of dimensions (e.g., from 30 attributes to 10 attributes), how did you do this? Autoencoder? PCA? Filter using InfoGain measurement? A clusterer? How do these methods work and what are their advantages /disadvantages? Also, if you increased the number of training instances, how did you do this? Section 3: Machine Learning Method(s) and their Implementation • Clearly state the machine learning methods you will be using and the function(s) you will be expecting them to perform (e.g., classification, association, regression, clustering or combinations thereof for selfsupervised learning). You must describe the expected ‘input to’ and ‘output from’ each model. • Explain and justify the machine learning method(s) chosen for the task. You must also use a simple benchmark model with which to compare your chosen machine learning model(s) (e.g., benchmark a neural network trained with back-propagation against a simple OneR or Naive Bayes approach). • Briefly highlight the strengths and weaknesses of the chosen learning method(s). • Describe your ‘model fitting’ and ‘model selection’ process (e.g., leave-oneout validation, cross-validation, bagging and boosting, etc.). You must state and justify the hyper-parameters used for model fitting and how ‘overtraining’ will be minimised. • Describe the tool you used to implement the machine learning method(s) (e.g Weka, NOTE: Only Weka should be used for this report). • For those seeking grades within the Distinction band, you must either: o Use advanced features of the chosen analytics tool including (though not limited to) clear evidence of meaningful programming/scripting activity to use machine learning and/or pre-processing tools in a bespoke way (e.g., install and use advanced Weka packages via Package Manager – examples might be simple recurrent networks, convolutional neural networks, self-organising maps, time series processing with ARIMA models). OR • Provide an in-depth mathematical treatment of the chosen machine learning method(s) with clear explanations as to how you will optimise them using the built-in features of the data analytics tool. Section 4: Evaluation of Results • Table the resulting ‘in sample’ (training) and ‘out of sample’ (test) performance of your model for the different model configurations and trial runs (e.g., a neural net with different numbers of hidden nodes, different random starting weights and or different learning rates). You should use (at least) one or more of the performance metrics (as appropriate): o Percent correct/incorrect o Confusion matrix o Recall and precision o Evaluating numeric prediction (e.g., mean squared error (MSE), root mean squared error (RMSE), correlation coefficients) o ROC curve • Critically review the performance of the different models. Which type of pre-processing appeared to be most advantageous and why? For each model, which hyper-parameter settings (e.g., learning rate, prune tree, momentum term) were most effective? • Critically compare models – was there a model or model class whose performance on the test set was statistically significantly better than the other models/model classes (with a p-value < 0.05) (may be using Experimenter in Weka)? Section 5: Discussion • Briefly summarise your task and your findings (i.e., whether the model learnt the problem). • How do your findings relate to similar tasks found in the relevant industry or academic literature? • Did you gain the insight you intended to? If not, what else could you do to enhance the usefulness of your analytics? • How did you decide on the most appropriate machine learning method and what do you understand about appropriateness? • Finally, briefly state how you are going to use the knowledge and skills you have developed in the module to further your professional ambitions and/or the strategic objectives of your organisation. Assessment Criteria The overall rubric used for marking your work will be the Keele University - Generic Assessment Criteria Level 7 (PGT) which you can find here. References The assignment should make extensive use of research and reading to provide evidence to support arguments and conclusions. Any reading or use of other resources should be appropriately referenced using the Harvard Style of referencing.

Option 1

Low Cost Option

Download this past answer in few clicks

38.99

PURCHASE SOLUTION

Already member? Sign In

Option 2

Custom new solution created by our subject matter experts

GET A QUOTE

Assessment_2.pdf

rated 5 stars

Purchased 3 times

Completion Status 100%

Google (5.0)

Assessment 2 - Individual report of the data analysis Your task: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it

Mechanical Engineering

Option 1

Low Cost Option

Download this past answer in few clicks

38.99

PURCHASE SOLUTION

Option 2

Custom new solution created by our subject matter experts

GET A QUOTE

rated 5 stars

View Answer

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Related Questions

menu

Assessment 2 - Individual report of the data analysis Your task: You are required to access a large data set and apply the CRISP-DM methodology to meaningfully clean, transform, analyse and evaluate it

Mechanical Engineering

Option 1

Low Cost Option

Download this past answer in few clicks

38.99

PURCHASE SOLUTION

Option 2

Custom new solution created by our subject matter experts

GET A QUOTE

rated 5 stars

View Answer

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions