Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / 3 CMM534 Reassessment Coursework Instructions 2020-21 Preliminaries The ‘Assessment Information’ topic on the CampusMoodle page includes the dropbox for this assessment and the files: Holiday

3 CMM534 Reassessment Coursework Instructions 2020-21 Preliminaries The ‘Assessment Information’ topic on the CampusMoodle page includes the dropbox for this assessment and the files: Holiday

Communications

3 CMM534 Reassessment Coursework Instructions 2020-21 Preliminaries The ‘Assessment Information’ topic on the CampusMoodle page includes the dropbox for this assessment and the files: Holiday.txt (graph create script used in element 1) Gas.tsv (data used in element 3) Oil.tsv (data used in element 3) First download these files to your own computer. Element 1 – Neo4j Coding Exercise Preliminaries In order to complete this exercise, you will need to test your code on Neo4j. We recommend that you use the Desktop Tool, but you may also work in the online sandbox. Run Holiday.txt on Neo4j to create a graph. You are going to apply Cypher commands to this graph. Your commands should be copied to a text file Element1.txt. Scenario The graph represents a database for planning holidays in India. The following domain model shows the node and relationship types. The following table gives a brief description of each type of node and relationship, along with any applicable attributes. 4 Node Label Description Attributes Attraction For example, a beach or bird sanctuary. name Category Type of resort, e.g. family. name; description Facility Facilities offered by resorts, such as a spa or childcare. name Person A person. name; email Place An area within a state. name; details Stars The number of quality stars given to a resort. stars Resort A hotel or other place to stay. The average daily rate in INR is avg_rate. name; address; avg_rate Season A calendar month. name State An Indian state – currently only Kerala is included. name; details Relationship Description Attributes ATTRACTION_IN Links an attraction to its place. BEST_SEASON Links a place to the months it is good to visit. CATEGORY Links a resort to the appropriate categories. FACILITY Links a resort to the facilities it provides. PLACE Links a resort to its place. STARS Links a resort to its stars. STATE Links a place to its state. VISITED Links a person to a resort they have stayed at in the given month. month Your Task You are required to write a number of Cypher queries to meet the specifications below. Simple Retrieval Queries Write queries to return: R1) All Facility nodes. R2) The name and description of each Category. R3) The name and average rate of those resorts with an average rate less than 5000 INR. Order the output by average rate, cheapest first. R4) The names of all persons who have visited ‘The Zuri Kumarakom’, together with the month in which they visited. [R1-R4 = 30 marks] More Advanced Retrieval Queries R5) For each area (i.e. Place) with more than two resorts, return the name of the area, the number of resorts and the average rate for the area. The average rate for an area is calculated as the average of the individual average rates of its resorts. 5 R6) For each attraction, return its name, the name and details of its area and a list of the months in which it is good to visit. R7) For recommendation purposes, it is useful to consider the social network of persons linked by resorts. If two persons have visited the same resort, we shall call them ‘friends’. Find the shortest ‘friend of a friend’ path connecting ‘Geeta’ to ‘Sutanu’. Your query should return the emails of all persons and names of all resorts in the path. [R5-R7 = 50 marks] Updating Queries These queries should add data or alter existing data as instructed. U1) Add a person called ‘Rob Lothian’ with email: ‘r.m.lothian@rgu.ac.uk’. U2) Add the information that the Place ‘Chennai’ is in the State of ‘Tamil Nadu’. Your query should work whether this place and state already exist or not and should not create any duplicate nodes or relationships. If ‘Chennai’ has to be created, then its details should be set to ‘A great city with a fine IIT.’ [U1 + U2 = 20 marks] Total Marks for Element 1 = 100. The mark will be converted to a grade A-F. Element 2 – Essay on NoSQL Databases Write a short essay (maximum 500 words, excluding references) on the topic below. Any references consulted should be listed at the bottom of the essay and cited within it where appropriate (RGU Harvard style). An environmental protection agency is looking for advice regarding which databases to use. For one of the four main types of NoSQL database* (Tabular, Key-Value, Document, or Graph): Select two (possibly hybrid) databases of that type that the agency could use, giving examples of environmental protection data that each of them could contain. Compare and contrast these databases using examples from an environmental protection scenario to illustrate your explanations. Here are some questions you might address: What are the main differences between the two databases? What advantages does each have over the other? Are the databases suited to different applications? The list of questions above is for guidance; it is not intended to be exhaustive. You should submit an essay, not simply answer each question in turn. Element 2 will return a grade in the range A-F. 6 Element 3 – Spark Background The New York City Housing Authority (NYCHA) has provided some data about its operations. NYCHA is responsible for a large number of buildings distributed across several sites (‘developments’) in New York City. The data concerns energy consumption in NYCHA developments over the period 2010-20. The file Gas.tsv contains information about heating gas consumption. The file Oil.tsv contains information about heating oil consumption. Data Upload Copy the files to your virtual machine. Create a directory in the hdfs with name ‘CWR21’. Place copies of the files in this directory. As a result of this operation, the paths to your files should be: /CWR21/Gas.tsv and /CWR21/Oil.tsv. Start PySpark and create a notebook called Element3.ipynb. The notebook should contain the commands necessary to carry out the tasks below. Initial Processing Create RDDs that have as elements the lines of the files. For each file, output the number of elements and the first three elements. You should see that each line consists of a number of data values and has a header line identifying the data fields. From the file Gas.tsv, create an RDD with each element a list of the data values corresponding to the following fields: [TDS #, Funding Source, AMP number, Current Charges]. From the file Oil.tsv, create an RDD with elements corresponding to the following list of fields: [TDS #, Funding Source, AMP number, Current Charges]. These RDDs contain the data we need for further analysis. For each RDD, output its first three elements. Data Description The data fields selected in the initial processing stage are explained below. Fields Common to Oil and Gas Data TDS # the Tenant Data System number is a unique identifier for developments. Funding Source indicates the source of funding for the development. AMP number the Asset Management Project number is an asset tracking number that may cover more than one development. Gas Data The gas data has one line per bill for gas consumed. Each development has paid several gas bills in the period 2010-20. Consequently, each TDS number occurs multiple times in the data. Current Charges is the cost (in dollars) of the gas consumed in the billing period. 7 Oil Data The oil data has one line per bill for oil consumed. Each development has paid several oil bills in the period 2010-20. Consequently, each TDS number occurs multiple times in the data. Current Charges is the cost (in dollars) of the oil consumed in the billing period. Further Pre-processing We are only interested in data for developments that are federally funded. These are indicated by the Funding Source value ‘FEDERAL’. From each of your initial RDDs, remove the data that does not correspond to federally funded projects. At this stage, you should also remove the header lines. Finally, convert the Current Charges field from a string to a floating point number. Your new RDDs should be called gasdata and oildata. The analysis tasks should start from these RDDs. Output a count of each RDD and its first three elements. Analysis Tasks (A) Use gasdata to create a list of AMP numbers and the developments they cover. First create a pair RDD with value a TDS number and key the corresponding AMP number. Each pair should occur only once in this RDD. Output a count of the RDD and its first six elements. Now create a pair RDD with key an AMP number and value the list of TDS numbers corresponding to that AMP number. All AMP numbers should occur exactly once in the RDD. The RDD should be sorted in descending order of the number of developments listed. Output a count of the RDD and its first six elements. (B) This task uses both oildata and gasdata. Create an RDD that gives the total expenditure on oil and gas by each development over the entire period covered by the data. Each element of the RDD should be a list [TDS number, Total Cost of Gas, Total Cost of Oil]. If a development uses only one energy source, the data value for the other source should be None (this is Spark’s NULL value). Each TDS number should occur exactly once in the RDD. Output a count of the RDD and its first five elements. [25 marks for data upload and initial processing.] [15 marks for further pre-processing.] [30 marks each for analysis tasks A and B.] Full marks will only be awarded for code which is neatly laid out and adequately commented. Total Marks for Element 3 = 100. The mark will be converted to a grade A-F. 8 Element 4 – Data Visualisation This question concerns the data visualisation shown in Appendix A. Critically appraise the visualisation using the approach described in the lectures for this module. Your discussion should at least include: • Appraisal against general guidelines and appropriate metrics. • Discussion of the marks and channels used in the visualisation. • Suggested improvements to the given visualisation. • Alternative presentations of the data. For the purposes of this exercise, you should assume that the explanatory text at the bottom is a caption and ignore the logos below it. Everything above the caption should be treated as part of the visualisation. Your discussion should be a maximum of 500 words in length. Submit as a word document or pdf file. Element 4 will return a grade in the range A-F. 9 Appendix A: Data Visualisation for Element 4 (sourced at: https://www.statista.com/chart/24965/share-of-smokers-and-world-populationby-country/)

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions