The overall objectives of this project are to build a data w

Homework answers / question archive / The overall objectives of this project are to build a data warehouse from real-world datasets, and to carry out a basic data mining activity, in this case, association rule mining

The overall objectives of this project are to build a data warehouse from real-world datasets, and to carry out a basic data mining activity, in this case, association rule mining

Project Management

Share With

The overall objectives of this project are to build a data warehouse from real-world datasets, and to carry out a basic data mining activity, in this case, association rule mining.

Datasets and Problem Domain

For the datasets and problem domain, you have one of the following two options:

• Prescribed datasets: the source data to design and populate data warehouse in this project is based on the Crime in Atlanta (2009-2017) dataset.

o A local copy of an augmented dataset is available here:

crime.csv (only the first 25470 records have been enriched). Below is a list of data segments that have been enriched. Note you do not need to analyse the entire dataset. For example, if you device has limited capacility, using 10,000 records demonstrate the DW process is acceptable. A comprehensive process making use of the entire dataset will attract bonus marks.

? crime_25471_50000.csv

crime_50001_75000.csv

crime_75001_100000.csv

crime_100001_125000.csv

crime_125001_150000.csv

crime_150001_175000.csv

crime_175001_200000.csv

crime_200001_225000.csv

? more data segments to come.

o Note this dataset has been augmented and enriched by using the geopy python package.

o Data Dictionary: there isn't a data dictionary accompanying the original dataset. Most of the field names are self-explanatory. Below are explanations to the two fields that may cause confusion:

? npu: NPU stands for Neighborhood Planning Unit (NPU). In the Atlanta area, neighborhoods and NPUs represent the unit of local community organization as well as local identity and pride. Reference : ( https://data.bloomberglp.com/company/sites/2/2016/09/paper_27.pdf )

? beat: In police terminology, a beat is the territory that a police officer is assigned to patrol (from wikipedia

Data Warehousing Design and Implementation

Following four steps below of dimensional modelling (i.e. Kimball's four steps), design a data warehouse for the dataset(s).

1. Identify the process being modelled.

2. Determine the grain at which facts can be stored.

3. Choose the dimensions

4. Identify the numeric measures for the facts.

To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.

1. Think about a few business questions that your data warehouse could help answer.

2. Draw a StarNet with the aim to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level information you have access to.

3. Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.

4. Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under the Azure Education Link: https://aka.ms/devtoolsforteaching) or a drawing program of your own choice. Paste it onto a Power BI Dashboard.

5. Implement a star or snowflake schema using SQL Server Management Studio (SSMS). Paste the database ER diagram generated by SSMS onto Power BI Dashboard.

6. Load the data from the csv files to populate the tables. You may need to create separate data files for your dimension tables.

7. Use SQL Server Data Tools to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design. Paste the cube diagram to your Power BI Dashboard.

8. Use Power BI to visualise the data returned from your business queries.

Association Rule Mining

Make sure you complete the relevant lab before attempting this task. We are using this example to show how association mining can be applied to this Crime dataset, there are certainly other more suitable scenarios for association rule mining, in particular, if the market basket analogy fits well.

If we treat each neigbourhood and the crime instances of each day's records as a transaction. Each transaction contains multiple crime types. For example, the neigbourhood Downtown in the 31st October 2010 have multiple crime types, e.g. "LARCENY-NON VEHICLE" "RAPE" and "LARCENY-FROM VEHICLE".

Process the dataset into a case table and a nested table for association rule mining. In your submission PDF, provide screenshots of the rules mined in both Rules view and Dependency Network view. Do you get some insights from the mining results? If no meaningful rules found, what could be the reason?

Meanwhile, in the submitted PDF, you need to:

o Explain the top k rules (according to importance or probability) that have the "crime" type (or other suitable columns) on the right-hand-side, where k>=1.

o Explain the meaning of the k rules in plain English.

NOTE: you are not allowed to use Python, R, or Weka to do association rule mining in this project. Microsoft Visual Studio is the only option (for association rule mining) in this project.

NOTE: Visual Studio seems to have problem displaying the dependency network. In the past, students have seen only two nodes displayed when they have multiple rules returned. This is acceptable, so long as the rules are discovered, the process is valid - there will be no marks deducted if the dependency network is not reflecting the rule set.

Can I use ChatGPT

You are more than welcome to use ChatGPT, but please include the ChatGPT suggestions into your report and justify why you adopt or not adopt its suggestions. It is not compulsory to use ChatGPT. However, as a data science professional, it is strongly encouraged to learn to criticise and be conversant in modern new tools that can potentially enhance your productivity. Here is my attempt of getting ChatGPT to design a concept hiearchy. Not really a hiearchy, but good starting point with nice reasonable explanations.

What to submit

• A PDF report generated from PowerBI structured according to the above 8 steps to explain

o the fact table and dimension tables design;

o the concept hiearchies for each dimension and how they can support the granularity of your data warehouse queries;

o the Extraction Trasnformation and Loading (ETL) process to show how the source file can be loaded into the data warehouse;

o how multi-dimensional cubes are used to facilitate the roll-up and drill down analysis;

o how PowerBI can assist with query result visualisation.

• The SQL Script file and the CSV files for building and populating your database.

• The solution project file (and its folder) of the SSDT analysis service multi-dimensional project.

• The scripts of the data cleaning/preprocessing/ETL process for data transformation.

• The Power BI file (.pbix).

• Include in the submssion a PDF discribing the associate rule mining process and results.

All files need to be zipped up in a single zip file and submitted to cssubmit.

Marking scheme

[50 marks]

[5 marks] Schema of each Dimension and Concept Hiearchies for each Dimension

[5 marks] Corresponding StarNet to illustrate query capabilities of the DW

[5 marks] At least 5 types of business queries that the StarNet can answer

[5 marks] Star/SnowFlake Schema (Fact Table) for DW design

[5 marks] Description of the ETL process for data transformation with code or screenshots

[5 marks] SQL Script file for building the database and loading the datasets

[5 marks] Power BI visualisation corresponding to the 5 business queries

[5 marks] Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution

[5 marks] Association rule mining meaningful set up

[5 marks] Interpretation of top rules

*********************

Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:

5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)

4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)

3 - Satisfactory (managed to meet most of the project requirement)

2 - Developing (some skills are demonstrated but need revision)

1 - Not yet Satisfactory (minimal effort)

0 - Not attempted.

*********************

Bonus Marks (5 marks)

Up to 5 marks of bonus marks can be awarded for (but not limited to)

• Outstanding data manipulation

• Intuitive and innovative visualisation

• Exemplary data integration with other data sources

1. Do you have any advice on how many dimensions we need, and which dimensions we should use for project 1?

[Wei] This is your design decision. As a starting point, you will at least need to have a Date dimension, a Location dimension, and a Crime Type dimension.

2. Is “neighborhood” = NPU in the dataset??

[Wei] No, check this link: https://citycouncil.atlantaga.gov/other/npu-by-neighborhood/neighborhood-planning-unit

3. Can I use Python, R, Excel, or SQL to clean the dataset?

[Wei] Yes

4. Do I need to use both “neighborhood” and “neighborhood_lookup”?

[Wei] Your design choice. Note the neighbourhood_look may contain different values than the neighborhood column, as the lookup is based on the longitude and latitude information of the location.

5. What is ETL?

[Wei] Extract Transform Load, the typical actions that you need to perform to pre-process the raw data. In other words, the single CSV file needs to go through ETL to multiple CSV files that represent fact table, dimension tables before loading into the corresponding SQL database tables.

6. What does it mean by the data segments being enriched, and is crime_25471_50000.csv part of crime.csv?

[Wei] The original data only has {crime, number, date, location, beat, neighborhood, npu, lat, long} the enriched dataset has added information look up using the lat and long, with extra columns {type, road, neighbourhood_lookup, city, county, state, postcode, country}.

The crime.csv has only 25470 rows with enriched data, the rest of the rows have empty values for the extra columns. You need to remove those rows with missing data and append the rest of the csv files (e.g. crime_25471_50000.csv) to obtain more complete records.

7. For our project are we allowed to use neighborhood column as a value to search up other statistics for example poverty level in the neighborhood?

[Jichunyang] I encourage you to do more research for this project, as long as your results are correlated with your design and can improve it. Please remember to indicate the new data source in your report, and clearly explain how you performed ETL (Extract, Transform, Load) on the new data.

8. How do you recommend we choose numerical measures since the dataset contains few quantitative values?

[Jichunyang] If you are unable to find sufficient measures from the dataset, you may want to consider trying several types of fact tables, or even a factless fact table. Alternatively, consider incorporating numerical data from external sources.

9. When data is missing in a row am I correct in assuming that we do not delete the row but rather try to fill in the missing information

[Jichunyang] When making a design choice, you may choose to either delete a row or fill in missing information. However, it is important to explain the reasons why you chose to do so in your report. If you decide to fill in the missing information, you should also specify the source of the data and provide references or data sources to support your decision.

10. What is a view?

[Jichunyang] Please check the link: https://learn.microsoft.com/en-us/sql/relational-databases/views/views?view=sql-server-ver16

11. Creating new views in SSMS is necessary for association rules mining. How can this be done?

[Jichunyang] Please check this link: https://learn.microsoft.com/en-us/sql/relational-databases/views/create-views?view=sql-server-ver16

12. Must we consider association rule mining prior to making the fact table and dimension table and completing the power BI visualisation or is it completely separate?

[Jichunyang] You need to design a case table and a nested table for association rules mining, but your design may depend on your fact table and dimension tables. I encourage you to complete the week 6 lab first and then do some research on how to design a case table and a nested table. I have posted some useful links on the lab channel of our Teams group.

13. Can I use different sets of data for Project 1? For instance, could I use the first 1,000 to 2,000 rows of data for designing the data warehouse, and then use rows 2,001 to 5,000 for data mining?

[Jichunyang] I strongly advise that you don’t do that. Using ONLY one dataset for the project is a smart solution. Using different sets of data is unconventional. Firstly, the pre-processing effort might be reused for DM; Secondly, the insights from DW can either be complementary or reinforce the DM findings. It is beneficial to use the same dataset to follow the KDD process.

14. Are we allowed to coarser grain if we need to? For example, if we need to split one of the columns into smaller parts?

[Jichunyang] Yes, you can.

15. What is a nested table and a case table?

[Jichunyang] Please check this link: https://learn.microsoft.com/en-us/analysis-services/data-mining/nested-tables-analysis-services-data-mining?view=asallproducts-allversions

16. In the Microsoft Association Algorithm, support, probability, and importance are three important concepts. However, lift and confidence cannot be found in the Microsoft Association Algorithm. Therefore, how can I evaluate the rules generated by the algorithm?

[Jichunyang] Please check this link: https://learn.microsoft.com/en-us/analysis-services/data-mining/microsoft-association-algorithm-technical-reference?view=asallproducts-allversions

17. How does the Microsoft Association Algorithm work? What data is required for Microsoft Association Models?

[Jichunyang] Please check this link: https://learn.microsoft.com/en-us/analysis-services/data-mining/microsoft-association-algorithm?view=asallproducts-allversions

18. If there are two “attributes” (not sure the word) that are of similar scope but are otherwise incomparable e.g. Police patrol beats and zip codes. How are we meant to represent them in a concept hierarchy or StarNet? Do we omit one from our Data Warehouse?

[Original Poster] Zip Codes and Patrol beats are incomparable, they have different boundaries but are of similar size, they both are a dimension when dealing with location – however do not have an easily identifiable hierarchy between the two (neither of them contain/involve the other), how would we deal with this?

[Pascal] That’s a good question, glad you think about this. That basically why we kind of include this information in for Master students. You need to think about how you should handle it, you can ignore one of them, only use one of them, or try to cooperate them together to use both of them. Put your justification about why you made the decision in the reports.

19. Can I have more dots in the crime dimension?

[Jichunyang] Yes, you can. For instance, you may differentiate between crime types according to their severity, such as considering homicide to be more serious than auto theft. This would require adding an additional category (hierarchy) to your crime dimension.

20. If we are using a snowflake schema, how do we represent nested dimensions on a StarNet diagram?

[Jichunyang] To represent nested dimensions on a StarNet diagram for a snowflake schema, you would typically use a series of dots to represent the levels in the hierarchy. In a snowflake schema, you can represent a hierarchy of nested dimensions by including all the levels in a single dimension table or by splitting them into separate sub-dimension tables. In the case of a date dimension, you could include all the levels (date, month, quarter, and year) in a single dimension table or split them into separate sub-dimension tables.

Regardless of whether you include all the levels in a single dimension table or split them into sub-dimension tables, the hierarchy of nested dimensions remains the same. You would not add any new levels to the hierarchy; instead, each level in the hierarchy would be represented by a column or set of columns in the appropriate table.

So, in the case of a date dimension, the hierarchy would still consist of the same levels (date, month, quarter, and year), even if they are split into separate sub-dimension tables. Each sub-dimension table would have a relationship to the main date dimension table, linking the levels together in the hierarchy.

21. Is there any of these columns (crime - number - date - location - beat - neighborhood - npu - lat - long - type - road - neighbourhood_lookup - city - county - state - postcode – country) can we use directory as a measure ? Or we should come up with our one measures, for example: crime_count?

[Jichunyang] While using "count (=1)" as a measure is a common practice in fact table design, it may not always be the best approach. To design an effective fact table, it's important to research and consider various measures that align with the business requirements and data being captured.

22. Can I use MS Word to generate the final report as a pdf? I will make clear references to the power BI dashboard (labeled them) and attach sufficient screenshots to demonstrate the visualization report.

[Jichunyang] Yes, you can.

pur-new-sol