Project 1 contributes 30% to the total assessments of this u

Homework answers / question archive / Project 1 contributes 30% to the total assessments of this unit as an individual effort

Project 1 contributes 30% to the total assessments of this unit as an individual effort

Project Management

Share With

Project 1 contributes 30% to the total assessments of this unit as an individual effort. The deadlines are recorded on cssubmit. This consists of
•   5% environment setup check off - demonstrate your Week 2 and 3 lab outputs to any of the lab demonstrators during a lab session by the end of Week 4.
o   Note, if you have job commitment and could not turn up to a lab, you can submit a short video to cssubmit (or a link to the video) by the end of Week 4.
o   ALREADY DONE
•   25% for the data warehousing solution. Dataset and descriptions see below.
Clarifications (when necessary) and FAQs will be available here as the semester progresses.

The overall objectives of this project are to build a data warehouse from real-world datasets, and to carry out a basic data mining activity, in this case, association rule mining.
Datasets and Problem Domain
For the datasets and problem domain, you have one of the following two options:
•   Prescribed datasets: the source data to design and populate data warehouse in this project is based on the Crime in Atlanta (2009-2017) dataset.
o   A local copy of an augmented dataset is available here:
crime.csv (only the first 25470 records have been enriched). Below is a list of data segments that have been enriched. Note you do not need to analyse the entire dataset. For example, if you device has limited capacility, using 10,000 records demonstrate the DW process is acceptable. A comprehensive process making use of the entire dataset will attract bonus marks.
?   crime_25471_50000.csv
crime_50001_75000.csv
crime_75001_100000.csv
crime_100001_125000.csv
crime_125001_150000.csv
crime_150001_175000.csv
crime_175001_200000.csv
crime_200001_225000.csv
?   more data segments to come.

o   Note this dataset has been augmented and enriched by using the geopy python package.
o   Data Dictionary: there isn't a data dictionary accompanying the original dataset. Most of the field names are self-explanatory. Below are explanations to the two fields that may cause confusion:
?   npu: NPU stands for Neighborhood Planning Unit (NPU). In the Atlanta area, neighborhoods and NPUs represent the unit of local community organization as well as local identity and pride. Reference : ( https://data.bloomberglp.com/company/sites/2/2016/09/paper_27.pdf )
?   beat: In police terminology, a beat is the territory that a police officer is assigned to patrol (from wikipedia

Data Warehousing Design and Implementation
Following four steps below of dimensional modelling (i.e. Kimball's four steps), design a data warehouse for the dataset(s).
1.   Identify the process being modelled.
2.   Determine the grain at which facts can be stored.
3.   Choose the dimensions
4.   Identify the numeric measures for the facts.
To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.
1.   Think about a few business questions that your data warehouse could help answer.
2.   Draw a StarNet with the aim to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level information you have access to.
3.   Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.
4.   Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under the Azure Education Link: https://aka.ms/devtoolsforteaching) or a drawing program of your own choice. Paste it onto a Power BI Dashboard.
5.   Implement a star or snowflake schema using SQL Server Management Studio (SSMS). Paste the database ER diagram generated by SSMS onto Power BI Dashboard.
6.   Load the data from the csv files to populate the tables. You may need to create separate data files for your dimension tables.
7.   Use SQL Server Data Tools to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design. Paste the cube diagram to your Power BI Dashboard.
8.   Use Power BI to visualise the data returned from your business queries.
Association Rule Mining
Make sure you complete the relevant lab before attempting this task. We are using this example to show how association mining can be applied to this Crime dataset, there are certainly other more suitable scenarios for association rule mining, in particular, if the market basket analogy fits well.
If we treat each neigbourhood and the crime instances of each day's records as a transaction. Each transaction contains multiple crime types. For example, the neigbourhood Downtown in the 31st October 2010 have multiple crime types, e.g. "LARCENY-NON VEHICLE" "RAPE" and "LARCENY-FROM VEHICLE".
Process the dataset into a case table and a nested table for association rule mining. In your submission PDF, provide screenshots of the rules mined in both Rules view and Dependency Network view. Do you get some insights from the mining results? If no meaningful rules found, what could be the reason?
Meanwhile, in the submitted PDF, you need to:
o   Explain the top k rules (according to importance or probability) that have the "crime" type (or other suitable columns) on the right-hand-side, where k>=1.
o   Explain the meaning of the k rules in plain English.
NOTE: you are not allowed to use Python, R, or Weka to do association rule mining in this project. Microsoft Visual Studio is the only option (for association rule mining) in this project.
NOTE: Visual Studio seems to have problem displaying the dependency network. In the past, students have seen only two nodes displayed when they have multiple rules returned. This is acceptable, so long as the rules are discovered, the process is valid - there will be no marks deducted if the dependency network is not reflecting the rule set.

Can I use ChatGPT
You are more than welcome to use ChatGPT, but please include the ChatGPT suggestions into your report and justify why you adopt or not adopt its suggestions. It is not compulsory to use ChatGPT. However, as a data science professional, it is strongly encouraged to learn to criticise and be conversant in modern new tools that can potentially enhance your productivity. Here is my attempt of getting ChatGPT to design a concept hiearchy. Not really a hiearchy, but good starting point with nice reasonable explanations.

What to submit
•   A PDF report generated from PowerBI structured according to the above 8 steps to explain
o   the fact table and dimension tables design;
o   the concept hiearchies for each dimension and how they can support the granularity of your data warehouse queries;
o   the Extraction Trasnformation and Loading (ETL) process to show how the source file can be loaded into the data warehouse;
o   how multi-dimensional cubes are used to facilitate the roll-up and drill down analysis;
o   how PowerBI can assist with query result visualisation.
•   The SQL Script file and the CSV files for building and populating your database.
•   The solution project file (and its folder) of the SSDT analysis service multi-dimensional project.
•   The scripts of the data cleaning/preprocessing/ETL process for data transformation.
•   The Power BI file (.pbix).
•   Include in the submssion a PDF discribing the associate rule mining process and results.
All files need to be zipped up in a single zip file and submitted to cssubmit.

Marking scheme
[50 marks]
[5 marks] Schema of each Dimension and Concept Hiearchies for each Dimension
[5 marks] Corresponding StarNet to illustrate query capabilities of the DW
[5 marks] At least 5 types of business queries that the StarNet can answer
[5 marks] Star/SnowFlake Schema (Fact Table) for DW design
[5 marks] Description of the ETL process for data transformation with code or screenshots
[5 marks] SQL Script file for building the database and loading the datasets
[5 marks] Power BI visualisation corresponding to the 5 business queries
[5 marks] Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution
[5 marks] Association rule mining meaningful set up
[5 marks] Interpretation of top rules

*********************
Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:
5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)
4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)
3 - Satisfactory (managed to meet most of the project requirement)
2 - Developing (some skills are demonstrated but need revision)
1 - Not yet Satisfactory (minimal effort)
0 - Not attempted.
*********************
Bonus Marks (5 marks)

Up to 5 marks of bonus marks can be awarded for (but not limited to)
•   Outstanding data manipulation
•   Intuitive and innovative visualisation
•   Exemplary data integration with other data sources

pur-new-sol