Fill This Form To Receive Instant Help
Homework answers / question archive / STATS 330: Advanced Statistical Modelling 1
STATS 330: Advanced Statistical Modelling
1. The dataset for this question concerns survival time for patients undergoing a particular type of liver operation. The dataset contains the following variables:
bcs blood clotting score.
pindex prognostic index.
enzyme test enzyme function test score.
liver test liver function test score.
age age, in years.
gender indicator variable for gender (0 = male, 1 = female).
alc mod indicator variable for history of alcohol use (0 = None, 1 = Moderate).
alc heavy indicator variable for history of alcohol use (0 = None, 1 = Heavy).
y survival time in days.
This data is included the the R package “olsrr.” To access the dataset:
• install the olsrr package.
• use library(olsrr) to access this package.
• the data set will be in a data frame called surgical.
1
(a) [2 marks] Produce the output from str(surgical) and summary(surgical).
Note that the variables gender, alc mod and alc heavy are factors. Since their levels have already been specified in the manner of indicator variables we don’t need to specify them as factors in R. However, when we come to interpret their coefficients for a fitted model we need to remember they are indicator variables rather than numeric variables.
(b) [4 marks] Create a box plot for survival time. Comment on what this plot and
the output from summary() indicate about the distribution of survival times.
(c) [8 marks] Create a pairs() plot. What do these plots tell you about the relationship between survival time and the other variables? Also comment on relationships between the explanatory variables.
(d) [8 marks] Fit the linear model for survival time y that uses all of the other variables in the dataset as explanatory variables. Include the output from summary()
and look at the usual set of diagnostic plots. Do these plots indicate any problems with this model?
(e) [5 marks] Now create a “gam” plot to investigate the relationships the each of
the numeric explanatory variables and survival time. Use this plot to evaluate
whether it is reasonable to model each of these relationships as being linear.
(f) [12 marks] Now consider the possibility that using log(y) as the response will
improve the model. Repeat parts (d) and (e) for the model that uses log(y) as
the response.
(g) [3 marks] Based on the above results is it more appropriate to use y or log(y)
as the response? Explain your answer.
(h) [5 marks] For the model you selected in (g) use dredge() to search for the best
model according to the AICc criterion.
i. Which explanatory variables are included in the best model from this search.
ii. Are there any other models which are supported by the data almost as much
as the best model? Explain you answer.
iii. Based on the results of your search, divide the explanatory variables in three
groups: (i) those that should definitely be included as an explanatory variable, (ii) those that possibly could be included and (iii) those that should
not be included.
(i) [3 marks] Repeat the model search using BIC as the model selection criterion.
Does this change which model is identified as best?
(j) [10 marks] For the best model from the search you did in (h), describe the
impact that each of the explanatory variables has on the expected survival time.
Note that you need to quantify the impact – simply stating that expected survival
time increases or decreases is not a sufficient answer.
Already member? Sign In