Homework answers / question archive / Lab 4: Simulation, Sampling, and the Central Limit Theorem Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2020

Lab 4: Simulation, Sampling, and the Central Limit Theorem Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2020

Statistics

Share With

Lab 4: Simulation, Sampling, and the Central Limit Theorem
Stats 10: Introduction to Statistical Reasoning
Spring 2022
All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2020.
Do not post, share, or distribute anywhere or with anyone without explicit permission.
Objectives
1. Reinforce understanding of simulating and random sampling
2. Understand using confidence intervals to estimate population proportions
3. Demonstrate the Central Limit Theorem’s application to proportions
Collaboration Policy
In Lab you are encouraged to work in pairs or small groups to discuss the concepts on the
assignments. However, DO NOT copy each other’s work as this constitutes cheating. The work
you submit must be entirely your own. If you have a question in lab, feel free to reach out to
other groups or talk to your TA if you get stuck.
Calculating normal and binomial distribution probabilities
The functions pnorm(), dbinom(), and pbinom() allow us to calculate theoretical probabilities
using certain assumptions about the distribution. To use the pnorm() function, R assumes a
normal distribution with a mean, sd, and some observation we want to find a probability for. The
binom functions assume a binomial distribution with sample size n, the probability of success p,
and some observation. Try running the below examples.
## Coin flipping scenario. Probability of getting 4 heads when 7 coins are tossed
dbinom(4, size = 7, prob = 0.5)
## Probability of getting 4 heads or less when 7 coins are tossed
pbinom(4, size = 7, prob = 0.5)
## Probability of getting a number less than 4 from a normal distribution with mean 2, sd of 7
pnorm(4, mean = 2, sd = 0.7)
Exercise 1
It is reported that 36% of US population have blood type O-positive. We randomly select 50 US
people and want to know the number of people with blood type O-positive.
a. Write down n and p if we are to use the binomial distribution for the number of people with
blood type O-positive.
b. Calculate the mean and standard deviation of the number of people with blood type O-positive
in 50 randomly selected US people using the binomial model. Use R or a calculator.
c. Find the probability that there are exactly 17 people with blood type O-positive.
d. Find the probability that the number of people with blood type O-positive will be between 13
and 24.
e. Consider the normal approximation to answer d. Is it reasonable to use the approximation in
the case? If so, compare the result with d.
Simple Random Sampling
In Lab 3, you learned about how to simulate the roll of a die by sampling from the numbers 1 to
6 using the sample() function. We can also use the sample() function to conduct simple random
sampling from a population by sampling from the row numbers of a data frame. This is done in
two steps. We use NCbirths as an example:
(1) We will use sample() to randomly select n numbers between 1 and 1992 (the number of
babies in the NCbirths data frame). This represents choosing the babies based on ID numbers.
Note that we typically use the default argument replace = FALSE to ensure we get n unique ID’s
(2) We then use the selected numbers from Step (1) as an index for the rows of observations in
the NCbirths data frame that we want to extract as our sample.
As an example, try out the code below, which takes a simple random sample of size 5 from the
NCbirths data frame.
# Set the seed for reproduceability
set.seed(123)
# Select 5 numbers from 1 to 1992.
sample_index <- sample(1992, size = 5)
sample_index # Display the indices we sampled.
## [1] 573 1570 814 1757 1870
# Extract the rows in NCbirths that correspond to sample_index.
NCbirths[sample_index, ]

Exercise 2
We revisit college graduate’s data obtained from American Community Survey 2010-2012
Public Use Microdata Series. Download the data ‘new_recent_grads.csv’ from CCLE Week 7
and read it into R. When you read in the data, name your object “grads”. We assume the data
represents the population of interest in our study.
a. How many observations does the data have? Note that it represents the population size.
b. Report the mean ‘Median’ income from the population. Also report the proportion of
‘Engineering’ major from the population.
c. Set the seed to 1245 and take a simple random sample of size 50 from the entire grads data
frame. Save the random sample as a separate R object, and print the first few lines to make sure
you saved it correctly.
d. Report the mean ‘Median’ income from the sample you took in c. Also report the proportion
of ‘Engineering’ major from your sample. Compare the results with b.
e. Now, let’s generate confidence intervals for our sample proportion using the sample results.
Are the three conditions satisfied to assume CLT? Check the three conditions and write how the
conditions are satisfied in the sample.
f. Produce 90%, 95%, and 99% confidence intervals for the true population proportion of
‘Engineering’ major. You can use R and/or a calculator for this question, but please include code
or calculations to show your work.
g. Discuss whether each of the confidence interval in f was able to capture the population
proportion.
Simulations of samples and sample distributions using a for loop
We want to illustrate the sampling variability (also called sample-to-sample variability) of the
sample mean and the sample proportion. That is, when we take different random samples, how
does the sample mean of the Median income vary from sample to sample? How does the sample
proportion of engineering major vary from sample to sample?
We can simulate sampling many (1000) random samples of size 50 from the population of recent
graduates data using a for loop. For each random sample, we can compute the
mean Median income and the proportion of the sample who has ‘Engineering’ major.
Here the code is a for loop to simulate the sample proportions from 1000 random samples of size
50. Please carefully read the comments for each line to understand the code. We will also discuss
the code in lab section.
# We first create objects for common quantities we will use for this exercise.
n <- 50 # The sample size
N <- nrow(grads) # The population size
M <- 1000 # Number of samples/repetitions
# Create vectors to store the simulated proportions from each repetition.
phats <- numeric(M) # for sample proportions
# Set the seed for reproduceability
set.seed(123)
# Always set the seed OUTSIDE the for loop.
# Now we start the loop. Let i cycle over the numbers 1 to 1000 (i.e., iterate 1000 times).
for(i in seq_len(M)){
# The i-th iteration of the for loop represents a single repetition.
# Take a simple random sample of size n from the population of size N.
index <- sample(N, size = n)
# Save the random sample in the sample_i vector.
sample_i <- grads[index, ]
# Compute the proportion of the i-th sample with engineering major.
phats[i] <- mean(sample_i$ Major_category == "Engineering")}
Note that the replicate() function from the last lab could have been used here, but for loops are
much more versatile and can be used in a wider variety of settings.
Exercise 3 - Proportions
a. Run the entire chunk of code in the previous page to run a for loop that creates a vector of
sample proportions. Using the results, create a relative frequency histogram of the sampling
distribution of sample proportions.
Superimpose a normal curve to your histogram with following instructions:
• If you use the histogram() function from the mosaic package, add the argument: fit =
"normal".
• If you use the hist() function from base R, add the argument: prob = TRUE, then run the
command: curve(dnorm(x, mean(phats), sd(phats)), add = TRUE).
b. What is the mean and standard deviation of the simulated sample proportions?
c. Do you think the simulated distribution of sample proportions is approximately normal?
Explain why or why not.
d. Using the theory-based method (i.e., normal approximation by invoking the Central Limit
Theorem), what would you predict the mean and standard deviation of the sampling distribution
of sample proportions to be? How close are these predictions to your answers from Part B?

Exercise 4 - Means
a. Create a new for loop to create a vector of sample means of the Median income. Use n = 50,
N = nrow(grads), and M =1000 just like before, and set the seed to 1234.
b. Create a relative frequency histogram of the sampling distribution of sample means for
Median income. Superimpose a normal curve by following the instructions given in 3(a).
c. Do you think the simulated distribution of sample means is approximately normal? Explain
why or why not. If your answer was different from your answer to Exercise 3(c), why do you
think this is the case?

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Please download the answer file using this link

https://drive.google.com/file/d/1SEIVXfZElGQVG4G8_-n3r3Bdrk8IZVmC/view?usp=sharing

Lab 4: Simulation, Sampling, and the Central Limit Theorem Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2020

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions

menu

Lab 4: Simulation, Sampling, and the Central Limit Theorem Stats 10: Introduction to Statistical Reasoning Spring 2022 All rights reserved, Adam Chaffee, Michael Tsiang and Maria Cha, 2017-2020

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions