Homework answers / question archive / Data Translation Challenge This sheet will describe the data translation challenge assignment

Data Translation Challenge This sheet will describe the data translation challenge assignment

Sociology

Share With

Data Translation Challenge This sheet will describe the data translation challenge assignment. In this assignment, you will be given two datasets from the July monthly release from SafeGraph, which is a company that works with cell phone location data. If you’ve pressed “OK” to the agreements in certain apps, it sends anonymized information about where you are at a given time to location-data services, who then sell it to places like SafeGraph. You can see documentation for this data in the “Patterns” section of the SafeGraph documentation. You will be taking this data, examining it, finding some interesting stories to tell, and telling them using visualizations and (a small amount of) text. I expect your document will contain at least five well-designed and well-executed visualizations, telling a coherent and interesting story. You will turn in an RMarkdown notebook containing your visualizations and a small amount of text linking them together or helping to tell the story. The result will be a report-style document that you might deliver in a business setting. Just because you’re using RMarkdown doesn’t mean your visualizations necessarily have to be in R. You can use Excel or (if you know it already) Tableau, export your visualizations as images, and include them as images using !. To get the data into Excel, open it up in R, do whatever cleaning or prepreparation you want on it (don’t forget to save your code in an .R file!) and then export it using the write_excel_csv() function in the tidyverse. You can then open that CSV file in Excel. In Tableau you can just open the .Rdata file directly (or open the file in R, do whatever preprocessing you like, then use saveRDS() to save it back as a different .Rdata file to open in Tableau). You might find it worthwhile to peruse the RMarkdown Cheat Sheet in any case. The Data I have already done some pre-processing on the data, and what results is two data sets that you can choose between. king_dailyvisits.Rdata king_dailyvisits.Rdata can be loaded in using the readRDS() function. Put the file in your working directory (generally, the same folder you’ve saved your .RMD file in; you can then go Session →→ Set Working Directory to be sure), and load it in: dv % arrange(brands, naics_title, date) %>% head() ## date naics_code naics_title brands visits_by_day ## 1 2020-07-01 336411 Aircraft Manufacturing 10 ## 2 2020-07-02 336411 Aircraft Manufacturing 8 ## 3 2020-07-03 336411 Aircraft Manufacturing 3 ## 4 2020-07-04 336411 Aircraft Manufacturing 1 ## 5 2020-07-05 336411 Aircraft Manufacturing 0 ## 6 2020-07-06 336411 Aircraft Manufacturing 3 This data set contains five variables: • • • • date, the date (in July 2020) naics_code, an industry coding system, which you can learn more about here naics_title, an explanatory title for the associated naics_code brands, a specific brand. Note that this is only present for large brands: McDonald’s, Macy’s, etc.. • Smaller brands have no value here and are all lumped together visits_by_day, the number of cell phones that SafeGraph recorded as visiting any location in brand brand and NAICS code naics_code on date date in King County. So what we have here is information on foot traffic to various brands and industries in King County over a period of one month. This can be used to get a sense of the relative popularity of different brands or industries (or brands within industries), to see how popularity changes over time, to see how different traffic is on weekends versus weekdays - thinking up what to look at is up to you! king_originvisits.Rdata king_originvisits_andmap.Rdata can be read in using the readRDS() function. Put the file in your working directory (generally, the same folder you’ve saved your .RMD file in; you can then go Session →→ Set Working Directory to be sure), and load it in: ov % as.data.frame() %>% select(GEOID, naics_code, naics_title, brands, visits) %>% head() ## tle GEOID naics_code naics_ti ## 1 530330057001 ies 312120 Brewer ## 2 530330317033 ors 238220 Plumbing, Heating, and Air-Conditioning Contract ## 3 530330069002 NA> 6233 < ## 4 530330319071 NA> 3231 < ## 5 530330207002 ies 312130 Winer ## 6 530330050003 ion 111421 Nursery and Tree Product ## brands visits ## 1 4 ## 2 8 ## 3 4 ## 4 4 ## 5 5 ## 6 4 The important variables we have here are: • • • • • GEOID, the census block group identifier. This can be broken up into the state (53, Washington), the county (033, King County), the census tract (005700 in that first row), and the census block group (1 in the first row). These are common mapping distinctions created by the Census, generally used over zip codes because zip codes overlap counties and aren’t really intended for geographic use. naics_code, an industry coding system, which you can learn more about here naics_title, an explanatory title for the associated naics_code brands, a specific brand. Note that this is only present for large brands: McDonald’s, Macy’s, etc.. Smaller brands have no value here and are all lumped together visits, the number of cell phone visits on cell phones owned by people who appear to live in GEOID that SafeGraph recorded as visiting any location in brand brand and NAICS code naics_code in the entire month of July (note that unlike king_dailyvisits, the whole month is lumped together). Also note that GEOID is the location where the phone doing the visiting appears to live (spends most of its nights there over the past six weeks), not the location where the visit occurs. You can use this data to see how different kinds of industries and/or brands are visited by people from different parts of the county. king_originvisits_andmap contains additional columns related to GEOID: ov %>% as.data.frame() %>% select(GEOID, ALAND, AWATER, NEIGHBORHOOD_DISTRICT_NAME) %>% head() ## GEOID ALAND AWATER NEIGHBORHOOD_DISTRICT_NAME ## 1 530330057001 3055837 2247577 ## 2 530330317033 1892598 Magnolia/Queen Anne 18118 ## 3 530330069002 255072 0 Magnolia/Queen Anne ## 4 530330319071 326893 0 ## 5 530330207002 481273 0 ## 6 530330050003 174683 0 Lake Union (there are some others I left out - STATEFP for example, since it’s the same for everyone) • • • • ALAND: total land area in this GEOI AWATER: total water area in this GEOID NEIGHBORHOOD_DISTRICT_NAME: the Seattle “neighborhood” this GEOID is in (only applies to GEOIDs in Seattle - anywhere else in King County, this is missing) geometry (not pictured because it prints out super long): A description of how to draw this GEOID With the exception of NEIGHBORHOOD_DISTRICT_NAME, which I merged in myself from some Seattle City government files, these variables came from running: library(tigris) mapdata % # Only label each neighborhood once group_by(NEIGHBORHOOD_DISTRICT_NAME) %>% mutate(neighborhood_label = case_when( row_number() == 1 ~ NEIGHBORHOOD_DISTRICT_NAME, TRUE ~ NA_character_ )) %>% ungroup() %>% # THE IMPORTANT PART: ggplot() + geom_sf(aes(fill = NEIGHBORHOOD_DISTRICT_NAME)) + theme_void() + guides(fill = FALSE) + labs(title = 'Map of Seattle Neighborhoods') + geom_sf_label_repel(aes(label = neighborhood_label)) ## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not ## give correct results for longitude/latitude data ## Warning: Removed 470 rows containing missing values (geom_label_repel). theme_void() can be nice for this, unless you want a grid in the background or the latitude/longitude markers for some reason. Grading • • • • 5%: Completing all tasks (five visualizations in an RMarkdown doc, with a story and text connecting them) 15%: Coherent and well-explained story 60%: Solid visual design following the principles we’ve discussed, graphs that actually relate to the story and reveal information about it 20%: Clear and concise writing Notes on Working with Data • • You can use one data set for all your visualizations, or both. No bonus points for picking one over the other, or for doing both, just use whatever is effective. You’re also not required to do a map if you use king_originvisits_andmap. The industry codes are at the very fine six-digit level. However, they are designed such that the first two digits are broad categories, the first four digits are more specific categories, and the full six digits are the most specific. If you want to look at a broader category (say, “retail”), you can look up the two- (or four-) digit broad categories here. Then, you can turn the six-digit categories into two-digit categories using dv % mutate(naics_two_digit = floor(naics_code/10000)) or to four-digit using dv % • • • mutate(naics_four_digit = floor(naics_code/100)) A given naics_code is in there multiple times per day - once for each brand there is in the data for it, plus one for locations without brand information in the data (brand == ''). If you don’t care about the different brands and just want to use a full naics_code, you should collapse to the naics_code level using dv_naics % group_by(naics_code, naics_title, date) %>% summarize(visits_by_day = sum(visits_by_day)) (drop the naics_title part if you’re using naics_two_digit or naics_four_digit instead of naics_code). Or use a pivot table if you’re in Excel. We can do the same with with ov (be sure to include GEOID in the group_by()), but it will make the mapping not work any more. If you want to summarize() ov and then map it, do the same sort of ov_naics % group_by(GEOID, naics_code, naics_two_digit) %>% summarize(visits = sum(visits)) as in the last step, but be sure to load in library(tigris), and then do mapdata % summarize(visits = sum(visits)), followed byov_naics

Data Translation Challenge This sheet will describe the data translation challenge assignment

Sociology

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Download Attached File

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions

menu

Data Translation Challenge This sheet will describe the data translation challenge assignment

Sociology

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Download Attached File

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions