Homework answers / question archive / Lab 2: Data Cleaning/Preparation and Visualization Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2022

Lab 2: Data Cleaning/Preparation and Visualization Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2022

Statistics

Share With

Lab 2: Data Cleaning/Preparation and Visualization
Stats 10: Introduction to Statistical Reasoning
Spring 2022
All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2022.
Do not post, share, or distribute anywhere or with anyone without explicit permission.
Some exercises based on labs by Nicolas Christou.
Objectives
1. Understand logical statements and subsetting
2. Reinforce knowledge on visualization techniques
Collaboration Policy
In Lab you are encouraged to work in pairs or small groups to discuss the concepts on the
assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work
you submit must be entirely your own. If you have a question in lab, feel free to reach out to
other groups or talk to your TA if you get stuck.
Intro Logical Statements/Relational Operators
Logical Expressions: Type ?Comparison to see the R documentation on the list of all relational
operators you can apply. Many logical expressions in R use these relational operators.
Try running the lines of code below that use the relational operators >, >=, <=, ==, !=:
4 > 3 # Is 4 greater than 3?
c(3, 8) >= 3 # Is 3 or 8 greater than or equal to 3?
c(3, 8) <= 3 # Is 3 or 8 less than or equal to 3?
c(1, 4, 9) == 9 # Is 1, 4, or 9 exactly equal to 9?
c(1, 4, 9) != 9 # Is 1, 4, or 9 not (exactly) equal to 9?
Notice that the output is a logical vector (i.e., uses TRUE and FALSE) that has the length of the
vector on the left of the relational statement.
Applications of logical statements: calculations
We can perform certain calculations on logical vectors because R reads TRUE as 1 and FALSE
as 0. Create the NCbirths object from last lab and try these examples:
sum(NCbirths$weight > 100) #the number of babies that weighed more than 100 ounces
mean(NCbirths$weight > 100) #the proportion of babies that weighed more than 100 ounces
mean(NCbirths$gender == "Female") #the proportion of female babies
mean(NCbirths$gender != "Male") #gives the proportion of babies not assigned male
Applications of logical statements: subsets
We can combine logical statements with square brackets to subset data based on conditions.
Examples with NCbirths:
fem_weights <- NCbirths$weight[NCbirths$gender == "Female"]
With the line above we created a vector called fem_weights that contains the weights of all the
female babies. We can combine multiple conditions using &&, and |, but these will be discussed
in future labs.
Good coding practices
Please consider implementing the following in your code:
1. Use the pound symbol (#) often to comment on different code sections. Consider using
them to label your exercise numbers and question parts, and to help describe what your
code does.
2. Use good spacing. Adding a space between arguments and inside of functions makes
your code easier to read. You can also skip lines for clarity.
3. Create as many objects as you like to make it easier to follow. For example, consider my
line above creating the fem_weights object. An alternative way to code this using best
practices is below:
## Create an object with the baby weights from NCbirths
baby_weight <- NCbirths$weight
## Create an object with the baby genders from NCbirths
baby_gender <- NCbirths$gender
## Create a logical vector to describe if the gender is female
is_female <- baby_gender == "Female"
## Create the vector of weights containing only females
fem_weights <- baby_weight[is_female]
Exercise 1
We will be working with college graduate’s data obtained from American Community Survey
2010-2012 Public Use Microdata Series. You can learn more about the data and its relevant
analysis from the [The Economic Guide To Picking A College Major]
(https://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/).
a. Download the data ‘recent-grads.csv’ from CCLE Week 3 and read it into R. When you read
in the data, name your object “grads”. How many variables and observations does the data have?
b. The Bureau of Labor Statistics, U.S. Department of Labor reports the unemployment rate for
the college graduates was 2.3 percent in January 2022. What proportion of the majors had lower
unemployment rates than 2.3%?
c. Report the mean and standard deviation of the ‘Median’ earnings of the majors in
‘Engineering’ major category.
d. Report the mean and standard deviation of the ‘Median’ earnings of all majors that are NOT
in ‘Engineering’ major category. How are they different from the results in c)?
e. Create a box plot for the ‘Median’ earning of all observations in the data with a good title.
f. Based on what you see in part (e), describe the shape of the distribution. Does the mean seem
to be a good measure of center for the data? Report a more useful statistic for this data.
Exercise 2
The data here represent life expectancies (Life) and per capita income (Income) in 1974 dollars
for 101 countries in the early 1970’s. The source of these data is: Leinhardt and Wasserman
(1979), New York Times (September, 28, 1975, p. E-3). They also appear on Regression
Analysis by Ashish Sen and Muni Srivastava. You can access these data in R using:
life <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/countries_life.txt",
header = TRUE)
a. Construct a scatterplot of Life against Income. Note: Income should be on the horizontal axis.
How does income appear to affect life expectancy?
b. Construct the boxplot and histogram of Income. Describe the distribution based on shape,
center and variability. Are there any outliers found in the boxplot?
c. Report the center (typical value) of ‘Income’ variables. Use the appropriate measures to find
the center.
d. Split the data set into two parts: One for which the Income is strictly below $1000, and one for
which the Income is at least $1000. Come up with your own names for these two objects.
e. Use the data for which the Income is at least $1000. Plot Life against Income and compute the
correlation coefficient. Describe the association of the two variables. Hint: use the function cor()
Exercise 3
Use R to access the Maas river data. These data contain the concentration of lead and zinc in
ppm at 155 locations at the banks of the Maas river in the Netherlands. You can read the data in
R as follows:
maas <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/soil.txt", header = TRUE)
a. Compute the summary statistics for lead and zinc using the summary() function.
b. Plot two histograms: one of lead and one of zinc. Describe the shapes of the two distributions.
c. Plot two histograms: one of log(lead) and one of log(zinc). How are they different from the
results in (b)?
d. Plot log(lead) against log(zinc) and compute the correlation coefficient. Describe the
association of the two variables.
e. According to CDC guideline, Lead-contaminated soil can pose a risk through direct ingestion,
uptake in vegetable gardens, or tracking into homes. Soil contains lead concentrations less than
50 parts per million (ppm), but soil lead levels in many urban areas exceed 200 ppm [AAP
1993]. The EPA’s standard for lead in bare soil in play areas is 400 ppm by weight and 1200
ppm for non-play areas [EPA 2000a].
The level of risk for surface soil based on lead concentration in ppm is given on the table below:
Mean concentration (ppm) Level of risk
Below 120 Lead-free
Between 120-400 Lead-safe
Above 400 Significant environmental lead hazard
Use techniques similar to last lab to give different colors and sizes to the lead concentration at
these 155 locations. You do not need to use the maps package create a map of the area. Just plot
the points without a map.
Exercise 4
The data for this exercise represent approximately the centers (given by longitude and latitude)
of each one of the City of Los Angeles neighborhoods. See also the Los Angeles Times project
on the City of Los Angeles neighborhoods at: http://projects.latimes.com/mappingla/
neighborhoods/. You can access these data at:
LA <- read.table("http://www.stat.ucla.edu/~nchristo/statistics12/la_data.txt", header = TRUE)
a. Plot the data point locations. Use good formatting for the axes and title. Then add the outline
of LA County by typing:
map("county", "california", add = TRUE)
b. Do you see any relationship between income and school performance? Hint: Plot the variable
Schools against the variable Income and describe what you see. Ignore the data points on the plot
for which Schools = 0. Use what you learned about subsetting with logical statements to first
create the objects you need for the scatter plot. Then, create the scatter plot. Alternate methods
may only receive half credit.

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Please download the answer file using this link

https://drive.google.com/file/d/1W3ybSJ5nKQFnL7ccz-LmMNrZOrF873C8/view?usp=sharing

Lab 2: Data Cleaning/Preparation and Visualization Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2022

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions

menu

Lab 2: Data Cleaning/Preparation and Visualization Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2022

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions