Why Choose Us?
0% AI Guarantee
Human-written only.
24/7 Support
Anytime, anywhere.
Plagiarism Free
100% Original.
Expert Tutors
Masters & PhDs.
100% Confidential
Your privacy matters.
On-Time Delivery
Never miss a deadline.
MATH20461:Probability and Statistical Inference Part 1: Plasma ferritin concentration study In this coursework you will assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes
MATH20461:Probability and Statistical Inference
Part 1: Plasma ferritin concentration study
In this coursework you will assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes. The file “Sports Data CW 2023.csv” contains the data on the plasma ferritin concentration as well as a selection of demographic variables of 202 male and female athletes. In particular, the data set comprises observations on the following eleven variables:
Variable Description
Sport Types of sport
Sex Male or female
LBM Lean body mass
RCC Red cell count
WCC White cell count
Hc (%) Hematocrit (Hc) is the volume percentage (vol%) of red blood cells in blood. It is normally 47% ±5% for men and 42% ±5% for women.[1]
Hg(g/dl) Hemoglobin (Hg) is the protein contained in red blood cells that is responsible for delivery of oxygen to the tissues. The normal Hg level for males is 14 to 18 g/dl; that for females is 12 to 16 g/dl.
BMI Body mass index = weight/height^2
SSF (mm) Sum of skin folds
% Bfat % body fat
Ferr (μmol/L) Plasma ferritin concentration
Task 1: Using R, read the data into a data frame called e.g. Athletes and:
Produce a table of summary statistics and draw appropriate plots to visually investigate the relationship between these eleven variables. Comment on your table and plots. Explain how you might use your plots to identify potential nonlinearity or multicollinearity in a linear regression model.
Explore the distribution of the Ferr using a histogram and Q-Q plot. Comment on them. Is the distribution of Ferr close to the normal distribution? If not, how would you recommend addressing this issue?
(14 marks)
Randomly divide the dataset into two sets, training (n1 = 141) and testing (n2 = 61) (see Appendix 1 for explanation how to do this).
Task 2: Use the training dataset to
Write down the equation of a regression model with Ferr as the response and other ten variables as predictors.
Fit the model in (a), identify insignificant predictors and remove them from the model. Is a full model better than a smaller model? Use appropriate test or score to support your argument.
(12 marks)
Check the constant variance, independence and normality assumptions of the errors and linearity for the model in part (b). Do these assumptions hold for your model? If not, choose an appropriate transformation of the response variable and repeat steps (a)-(c)(i) for the transformed response variable.
Check for outliers, observations with large leverage and influential points. How would you deal with any possible outliers, observations with large leverage or influential points? Comment on the presence of multicollinearity in the model. How do you recommend to address possible multicollinearity in the model?
(16 marks)
For the model obtained in part (c), determine which of the significant predictors has the largest estimated effect on Ferr. Is this effect also the most statistically significant? Interpret the effect of explanatory variables on the response variable, Ferr.
Comment on model goodness of the fit and how you would improve the model.
(10 marks)
Task 3: Model evaluation
Use the testing dataset to evaluate your model by predicting the Ferr in the testing subset (see how to do this in Appendix 2). Using appropriate plots or statistical tests show whether predictions are close to the observed Ferr in the testing set. Comment on how you would improve the model accuracy.
(8 marks)
In both parts 2 and 3 use the significance level of 0.05.
Total: 60 marks
Part 2: Bayesian Inference
An unknown value θ has been transmitted to you over a noisy channel. Let us assume the noise, \epsilon, is normally distributed with mean 0 and a known variance 4. Therefore, you will receive the value X that is modelled by N(θ,4). Based on previous communications, your prior knowledge on θ is N(12,9).
Suppose a value is transmitted to you and you receive it as x = 13.25. Obtain the posterior distribution function for θ.
Using R, plot the prior, likelihood and posterior distribution curves on the same plot. Explain how your belief about theta will be updated after adding information from the data to the prior information.
Suppose the same value θ is transmitted to you n times. You receive these signals plus noise as x_1,…,x_n with sample mean x ?. Assuming θ ~ N(θ_0,σ_0^2), obtain a formula for the posterior mean and variance of the mean parameter.
Suppose the same value θ is sent to you 20 times. You receive these signals plus noise as x_1,…,x_20 with sample mean x ?=11.85. Using the same prior and known variance σ^2 as in part (a), obtain the posterior distribution for θ. Plot the prior, likelihood and posterior on the same graph. Describe how the data changes your belief about the true value of θ.
How do the posterior mean and variance change if more data is received? What is gained by sending the same signal multiple times? Answer this question by addressing the formulae you obtain in part (c).
Total: 30 marks
Appendix 1
Random division of the data set into two subsets in R
n1<- sample(1:nrow(Athletes), 141,replace=FALSE)
# randomly divide the Athlete dataset into samples of size n1 = 141
training <- Athletes [n1,]
# and n2 = 61
testing<- Athletes [-n1,]
Appendix 2
Prediction using R
You can use R to produce predicted values for Linear Model Fits:
predict(model, newdata, interval = "prediction")
model Object of class inheriting from "lm".
newdata An optional data frame in which to look for variables with which to predict. In the current case set newdata = testing.
Interval Type of interval calculation.
Expert Solution
PFA
Need this Answer?
This solution is not in the archive yet. Hire an expert to solve it for you.





