MATH20461:Probability and Statistical Inference Part 1: Plasma ferritin concentration study In this coursework you will assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes

Question

MATH20461:Probability and Statistical Inference
Part 1: Plasma ferritin concentration study

In this coursework you will assess the effect of a collection of explanatory variables on the plasma ferritin concentration (Ferr) in 202 Australian athletes. The file “Sports Data CW 2023.csv” contains the data on the plasma ferritin concentration as well as a selection of demographic variables of 202 male and female athletes. In particular, the data set comprises observations on the following eleven variables:

Variable Description

Sport Types of sport

Sex Male or female

LBM Lean body mass

RCC Red cell count

WCC White cell count

Hc (%) Hematocrit (Hc) is the volume percentage (vol%) of red blood cells in blood. It is normally 47% ±5% for men and 42% ±5% for women.[1]

Hg(g/dl) Hemoglobin (Hg) is the protein contained in red blood cells that is responsible for delivery of oxygen to the tissues. The normal Hg level for males is 14 to 18 g/dl; that for females is 12 to 16 g/dl.

BMI Body mass index = weight/height^2

SSF (mm) Sum of skin folds

% Bfat % body fat

Ferr (μmol/L) Plasma ferritin concentration

Task 1: Using R, read the data into a data frame called e.g. Athletes and:

Produce a table of summary statistics and draw appropriate plots to visually investigate the relationship between these eleven variables. Comment on your table and plots. Explain how you might use your plots to identify potential nonlinearity or multicollinearity in a linear regression model.

Explore the distribution of the Ferr using a histogram and Q-Q plot. Comment on them. Is the distribution of Ferr close to the normal distribution? If not, how would you recommend addressing this issue?

(14 marks)

Randomly divide the dataset into two sets, training (n1 = 141) and testing (n2 = 61) (see Appendix 1 for explanation how to do this).

Task 2: Use the training dataset to

Write down the equation of a regression model with Ferr as the response and other ten variables as predictors.

Fit the model in (a), identify insignificant predictors and remove them from the model. Is a full model better than a smaller model? Use appropriate test or score to support your argument.

(12 marks)

Check the constant variance, independence and normality assumptions of the errors and linearity for the model in part (b). Do these assumptions hold for your model? If not, choose an appropriate transformation of the response variable and repeat steps (a)-(c)(i) for the transformed response variable.

Check for outliers, observations with large leverage and influential points. How would you deal with any possible outliers, observations with large leverage or influential points? Comment on the presence of multicollinearity in the model. How do you recommend to address possible multicollinearity in the model?

(16 marks)

For the model obtained in part (c), determine which of the significant predictors has the largest estimated effect on Ferr. Is this effect also the most statistically significant? Interpret the effect of explanatory variables on the response variable, Ferr.

Comment on model goodness of the fit and how you would improve the model.

(10 marks)

Task 3: Model evaluation

Use the testing dataset to evaluate your model by predicting the Ferr in the testing subset (see how to do this in Appendix 2). Using appropriate plots or statistical tests show whether predictions are close to the observed Ferr in the testing set. Comment on how you would improve the model accuracy.

(8 marks)

In both parts 2 and 3 use the significance level of 0.05.

Total: 60 marks

Part 2: Bayesian Inference

An unknown value θ has been transmitted to you over a noisy channel. Let us assume the noise, \epsilon, is normally distributed with mean 0 and a known variance 4. Therefore, you will receive the value X that is modelled by N(θ,4). Based on previous communications, your prior knowledge on θ is N(12,9).

Suppose a value is transmitted to you and you receive it as x = 13.25. Obtain the posterior distribution function for θ.

Using R, plot the prior, likelihood and posterior distribution curves on the same plot. Explain how your belief about theta will be updated after adding information from the data to the prior information.

Suppose the same value θ is transmitted to you n times. You receive these signals plus noise as x_1,…,x_n with sample mean x ?. Assuming θ ~ N(θ_0,σ_0^2), obtain a formula for the posterior mean and variance of the mean parameter.

Suppose the same value θ is sent to you 20 times. You receive these signals plus noise as x_1,…,x_20 with sample mean x ?=11.85. Using the same prior and known variance σ^2 as in part (a), obtain the posterior distribution for θ. Plot the prior, likelihood and posterior on the same graph. Describe how the data changes your belief about the true value of θ.

How do the posterior mean and variance change if more data is received? What is gained by sending the same signal multiple times? Answer this question by addressing the formulae you obtain in part (c).

Total: 30 marks

Appendix 1

Random division of the data set into two subsets in R

n1<- sample(1:nrow(Athletes), 141,replace=FALSE)

# randomly divide the Athlete dataset into samples of size n1 = 141

training <- Athletes [n1,]

# and n2 = 61

testing<- Athletes [-n1,]

Appendix 2

Prediction using R

You can use R to produce predicted values for Linear Model Fits:

predict(model, newdata, interval = "prediction")