Fill This Form To Receive Instant Help

STATS 330: Advanced Statistical Modelling 1

Statistics

1. The dataset for this question concerns survival time for patients undergoing a particular type of liver operation. The dataset contains the following variables:

bcs blood clotting score.

pindex prognostic index.

enzyme test enzyme function test score.

liver test liver function test score.

age age, in years.

gender indicator variable for gender (0 = male, 1 = female).

alc mod indicator variable for history of alcohol use (0 = None, 1 = Moderate).

alc heavy indicator variable for history of alcohol use (0 = None, 1 = Heavy).

y survival time in days.

This data is included the the R package “olsrr.” To access the dataset:

• install the olsrr package.

• use library(olsrr) to access this package.

• the data set will be in a data frame called surgical.

1

(a) [2 marks] Produce the output from str(surgical) and summary(surgical).

Note that the variables gender, alc mod and alc heavy are factors. Since their levels have already been specified in the manner of indicator variables we don’t need to specify them as factors in R. However, when we come to interpret their coefficients for a fitted model we need to remember they are indicator variables rather than numeric variables.

(b) [4 marks] Create a box plot for survival time. Comment on what this plot and

the output from summary() indicate about the distribution of survival times.

(c) [8 marks] Create a pairs() plot. What do these plots tell you about the relationship between survival time and the other variables? Also comment on relationships between the explanatory variables.

(d) [8 marks] Fit the linear model for survival time y that uses all of the other variables in the dataset as explanatory variables. Include the output from summary()

and look at the usual set of diagnostic plots. Do these plots indicate any problems with this model?

(e) [5 marks] Now create a “gam” plot to investigate the relationships the each of

the numeric explanatory variables and survival time. Use this plot to evaluate

whether it is reasonable to model each of these relationships as being linear.

(f) [12 marks] Now consider the possibility that using log(y) as the response will

improve the model. Repeat parts (d) and (e) for the model that uses log(y) as

the response.

(g) [3 marks] Based on the above results is it more appropriate to use y or log(y)

(h) [5 marks] For the model you selected in (g) use dredge() to search for the best

model according to the AICc criterion.

i. Which explanatory variables are included in the best model from this search.

ii. Are there any other models which are supported by the data almost as much

as the best model? Explain you answer.

iii. Based on the results of your search, divide the explanatory variables in three

groups: (i) those that should definitely be included as an explanatory variable, (ii) those that possibly could be included and (iii) those that should

not be included.

(i) [3 marks] Repeat the model search using BIC as the model selection criterion.

Does this change which model is identified as best?

(j) [10 marks] For the best model from the search you did in (h), describe the

impact that each of the explanatory variables has on the expected survival time.

Note that you need to quantify the impact – simply stating that expected survival

time increases or decreases is not a sufficient answer.

32.99 USD

Option 2

rated 5 stars

Purchased 3 times

Completion Status 100%