MAS8404 Practical 5: Subset Selection and Cross Validatio

Homework answers / question archive / MAS8404 Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

MAS8404 Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

Statistics

Share With

MAS8404

Practical 5: Subset Selection and Cross Validation

This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation.

Work through the practical sheet and if you need any help, just ask.

Diabetes data
1. Background

These data were collected in a study concerning patients with diabetes. The response variable of interest was disease progression one year after taking baseline measurements on various clinical variables. For each of n = 442 patients, the data comprise a quantitative measure of disease progression (dis) and measurements on p = 10 baseline (explanatory) variables: age (age), sex (sex), body mass index (bmi), average blood pressure (map) and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu). We will use multiple linear regression to develop a model for predicting disease progression on the basis of one or more of the baseline variables.

The data can be found in the diabetes data set which is part of the nclSLR package. They can then be loaded into R and inspected as follows:

## Load the nclSLR package

library

(

nclSLR

)

## Load the data

data

(

diabetes

)

## Check size

dim

(

diabetes

)

## [1] 442 11

## Store n and p:

(

nrow

(

diabetes

))

## [1] 442

(

ncol

(

diabetes

)

## [1] 10

## Print first few rows

head

(

diabetes

)

age

sex

bmi

map

## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498

## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724

## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451

## 4 -0.089062939 -0.04464164 -0.01159501 -0.036656447 0.012190569

## 5 0.005383060 -0.04464164 -0.03638469 0.021872355 0.003934852

## 6 -0.092695478 -0.04464164 -0.04069594 -0.019442093 -0.068990650

ldl

hdl

tch

ltg

glu dis

## 1 -0.03482076 -0.043400846 -0.002592262 0.019908421 -0.017646125 151 ## 2 -0.01916334 0.074411564 -0.039493383 -0.068329744 -0.092204050 75

## 3 -0.03419447 -0.032355932 -0.002592262 0.002863771 -0.025930339 141 ## 4 0.02499059 -0.036037570 0.034308859 0.022692023 -0.009361911 206

## 5 0.01559614 0.008142084 -0.002592262 -0.031991445 -0.046640874 135

## 6 -0.07928784 0.041276824 -0.076394504 -0.041180385 -0.096346157 97

We see that the response variable (dis) is stored in column 11 and that the explanatory variables appear in columns 1–10. It appears that the explanatory variables are not in their “raw” form, i.e. they seem to have been transformed. To investigate we can check the means and standard deviations of the explanatory variables:

colMeans

(

diabetes[,

])

age

sex

bmi

map

## -3.653192e-16 1.288091e-16 -8.008723e-16 1.289911e-16 -9.112498e-17

ldl

hdl

tch

ltg

glu

## 1.301526e-16 -4.557682e-16 3.865922e-16 -3.861479e-16 -3.402226e-16

apply

(

diabetes[,

]

, sd)

age

sex

bmi

map

ldl

## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905

hdl

tch

ltg

glu

## 0.04761905 0.04761905 0.04761905 0.04761905

Although the explanatory variables have been transformed to have mean zero, the standard deviations are not equal to one. It turns out that, instead, each variable has been transformed so that the sum of the squared values is equal to one.

(Optional) exercise: can you write some R code to verify that this is the case?

We will work with these standardised explanatory variables.

1. Exploratory data analysis

Before proceeding with our regression analysis, we will perform some exploratory analysis to get a feel for the relationships between the response and explanatory variables, and the relationships amongst the explanatory variables. We can produce a scatterplot matrix as follows:

pairs

(

diabetes

)

which produces the plot in Figure 1. A few of the variables seem to show a weak linear relationship with the response, most notably bmi and ltg. For other variables, such as age and sex, there is little evidence of a linear relationship with the response. However, it is also clear that a few of the explanatory variables are themselves highly correlated, for example, the blood serum measurements tc and ldl. This immediately suggests that we may be able to omit some explanatory variables when we fit our multiple linear regression model. By executing the R expression

cor

(

diabetes

)

compute the sample correlation matrix and confirm that our observations are borne out through the sample correlations.

Figure 1: Scatterplot matrix for the diabetes data.

Least squares
1. Model fitting

Let y_idenote disease progression for patient i and denote by x_ijthe (standardised) measurement on the j-th baseline variable for patient i where j = 1,...,p and i = 1,...,n. (Here p = 10 and n = 442). Consider the multiple linear regression model:

y_i= β₀+ β₁x_i1 + ... + β_px_ip+ ε_ifor i = 1,2,... ,n.

We will begin by fitting the model via least squares using the lm function:

## Fit model using least squares

lsq_fit

(

dis

data

diabetes

)

We can then obtain a summary of the fitted model by passing the returned object to the summary function:

## Summarise fitted model

(

lsq_summary

summary

(

lsq_fit

))

## Call:

## lm(formula = dis ~ ., data = diabetes)

## Residuals:

Min

Q Median

Max

## -155.829 -38.534 -0.227 37.806 151.355

## Coefficients:

Estimate Std. Error t value Pr(>|t|)

2.576 59.061 < 2e-16 ***

## (Intercept) 152.133

## age

-10.012

59.749 -0.168 0.867000

## sex

-239.819

61.222 -3.917 0.000104 ***

519.840

## bmi

66.534 7.813 4.30e-14 ***

## map

324.390

65.422 4.958 1.02e-06 ***

## tc

-792.184

416.684 -1.901 0.057947 .

476.746

## ldl

339.035 1.406 0.160389

## hdl

101.045

212.533 0.475 0.634721

## tch

177.064

161.476 1.097 0.273456

## ltg

751.279

171.902 4.370 1.56e-05 ***

## glu

67.625

65.984 1.025 0.305998

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 54.15 on 431 degrees of freedom

## Multiple R-squared: 0.5177,Adjusted R-squared: 0.5066

## F-statistic: 46.27 on 10 and 431 DF, p-value: < 2.2e-16

The column of p-values (i.e. the Pr(>|t|) column) for the t-tests

H₀: β_i= 0 versus H₁: β_i6= 0

confirms our observations from the exploratory analysis; if we consider the explanatory variables one-at-a-time, there are several which would not be needed in a model containing all others. We also see that the coefficient of determination, R², is equal to 51.77%.

(Optional) exercise: how would this be interpreted?

Note that we can obtain the values ˆy₁,...,yˆ_nthat our fitted model would predict for the explanatory variables on the 1st, ..., nth rows of our matrix of explanatory variables by using the predict function. We pass as arguments the object returned by the lm function, and the data frame containing the explanatory variables at which we want predictions. In this case we have:

yhat

predict

(

lsq_fit, diabetes

)

## Print first few elements:

head

(

yhat

)

## 206.11707 68.07235 176.88406 166.91797 128.45984 106.34909

## Compare with actual values:

head

(

diabetes

dis)

## [1] 151 75 141 206 135 97

1. Cross validation

To get an assessment of the predictive performance of the least squares model, we can estimate the test error using cross validation. In this practical we will consider 10-fold cross validation. First we will set the seed of R’s random number generator so that repeating the analysis later will produce exactly the same results:

## Set the seed to make the analysis reproducible

set.seed

(

)

Next we randomly divide the data into 10 folds of approximately equal size. We do this by sampling a fold from {1,2,... ,10} where each fold has the same probability of being selected. We can do this for each observation i = 1,...,n as follows:

## 10-fold cross validation

nfolds

## Sample fold-assignment index

fold_index

sample

(

nfolds, n,

replace

TRUE

)

## Print first few fold-assignments

head

(

fold_index

)

## [1] 3 4 6 10 3 9

So observation 1 is assigned to fold 3, observation 2 is assigned to fold 4, and so on.

To estimate the average mean squared error (MSE) by general K-fold cross validation we will write a function:

reg_cv

function

(

fold_ind

)

{

data.frame

(

X1,

)

nfolds

max

(

fold_ind

)

(

all.equal

(

sort

(

unique

(

fold_ind)),

nfolds))

stop

(

"Invalid fold partition."

)

cv_errors

numeric

(

nfolds

)

for

(

fold

nfolds)

{

tmp_fit

(

data

Xy[fold_ind

fold,])

yhat

predict

(

tmp_fit, Xy[fold_ind

fold,])

yobs

y[fold_ind

fold]

cv_errors[fold]

mean

((

yobs

yhat)

)

}

fold_sizes

numeric

(

nfolds

)

for

(

fold

nfolds) fold_sizes[fold]

length

(

which

(

fold_ind

fold))

test_error

weighted.mean

(

cv_errors,

fold_sizes

)

return

(

test_error

)

}

Let’s consider each part of the function in turn.

On lines 2 and 3 in the main body of the function we verify that the fold partition, foldind, contains a consecutive set of integers, starting at 1. On line 4, we create a vector to hold the average MSE computed over each fold. Then we loop over the folds.

For fold k = 1,2,... ,K, we first fit the model to the “training” data (i.e. all the data except that in the kth fold) using least squares. Next, we use the predict function to compute the values ˆy_ifor the response variable that our fitted model would predict for the “test” set (i.e. the data in the kth fold). The mean squared difference between the predicted values ˆy_ifor the test set and the actual observations y_ithat occurred gives an estimate of the average MSE from the kth fold.

After the for loop, on lines 11–12, we compute the number of observations assigned to fold 1,2,... ,K. Note that the which function operates on a vector of booleans (i.e. a vector of TRUEs and FALSEs), returning the indices of the TRUEs. So, for the fold partition we simulated earlier, for example which(fold_index==3)

## [1] 1 5 11 22 25 54 62 66 86 89 91 103 106 128 129 143 144

## [18] 159 160 161 181 195 201 202 204 209 221 227 246 257 263 275 277 290

## [35] 291 315 329 332 346 356 364 374 375 385 386 442

will return a vector telling us which observations are assigned to fold number 3. Hence the expression on line 12 gives the number of observations assigned to each of the K folds. Finally, on line 13, we compute the average MSE over the entire data set by averaging the K estimates from the K folds. To allow for the fact that the K estimates may have been computed using test sets (i.e. folds) of different sizes, the MSE estimates are weighted by the sizes of the folds.

We can now apply our function to the diabetes data as follows:

(

lsq_final_mse

reg_cv

(

diabetes[,

]

, diabetes[,

]

, fold_index

))

## [1] 3022.336

Make sure you understand what each part of the regcv function does. If you have any questions, just ask.

Best subset selection

Both our exploratory data analysis and the results of the least squares model fit suggest that we do not need to include all the explanatory variables in our fitted model. If we can eliminate some of the explanatory variables we might be able to estimate the effects of those that remain more precisely, thereby improving the predictive performance of the model. To this end, we will consider best subset selection.

1. Model fitting

We can perform best subset selection using the regsubsets function from the leaps R package. We begin by loading the package

library

(

leaps

)

Next we use the regsubsets function to apply the algorithm, specifying method="exhaustive" to indicate that we want to perform best subset selection and nvmax=p to indicate that we want to consider models with 1,2,... ,p explanatory variables. The function identifies the best model that contains each number of predictors. Denote these models by M1,...,M_p. Again, we can summarise the results of the model fitting exercise using the summary function which prints the predictor variables in M1,...,M_p.

## Apply the best subset selection algorithm

bss_fit

regsubsets

(

dis

data

diabetes,

method

"exhaustive"

nvmax

)

## Summarise the results

(

bss_summary

summary

(

bss_fit

))

## Subset selection object

## Call: regsubsets.formula(dis ~ ., data = diabetes, method = "exhaustive",

nvmax = p)

## 10 Variables (and intercept)

Forced in Forced out

## age

FALSE

## sex

FALSE

## bmi

FALSE

## map

FALSE

## tc

FALSE

## ldl

FALSE

## hdl

FALSE

## tch

FALSE

## ltg

FALSE

## glu

FALSE

## 1 subsets of each size up to 10

## Selection Algorithm: exhaustive

age sex bmi map tc ldl hdl tch ltg glu

##1 (1) """""*"""""""""""""""

##2 (1) """""*""""""""""""*"""

##3 (1) """""*""*""""""""""*"""

##4 (1) """""*""*""*""""""""*"""

## 5 ( 1 ) " " "*" "*" "*" " " " " "*" " " "*" " "

## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " " " "*" " "

## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "

## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"

## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"

## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

The bsssummary object contains a lot of useful information. Examine its structure by executing the R expression

str

(

bss_summary

)

For example, it contains the coefficient of determination for the models M1,...,M_pin the rsq component. We can extract these values using the usual $ notation:

bss_summary

rsq

## [1] 0.3439238 0.4594852 0.4800828 0.4920162 0.5086325 0.5148848 0.5162912

## [8] 0.5174714 0.5177180 0.5177494

As explained in lectures, we see that as more predictors are added to the model R²cannot decrease.

To compare the fits of the models M1,...,M_p, we can use adjusted-R², Mallow’s C_pstatistic or the BIC, which can be extracted as follows:

bss_summary

adjr2

## [1] 0.3424327 0.4570228 0.4765217 0.4873665 0.5029975 0.5081936 0.5084895

## [8] 0.5085563 0.5076705 0.5065603

bss_summary

## [1] 148.352561 47.072229 30.663634 21.998461 9.148045 5.560162

## [7] 6.303221 7.248522 9.028080 11.000000

bss_summary

bic

## [1] -174.1108 -253.6592 -264.7407 -268.9126 -277.5210 -277.0899 -272.2819

## [8] -267.2702 -261.4049 -255.3424

The best models, according to each criteria are therefore

(

best_adjr

which.max

(

bss_summary

adjr2))

## [1] 8

(

best_cp

which.min

(

bss_summary

cp))

## [1] 6

(

best_bic

which.min

(

bss_summary

bic))

## [1] 5

For instance, adjusted-R²suggests we use M8, i.e. the model with 8 explanatory variables.

To reconcile the differences implied by the different model choice criteria, it is helpful to plot how each criterion varies with the number of predictors:

## Create multi-panel plotting device par(mfrow=c(2,2))

## Produce plots, highlighting optimal value of k

plot(1:p, bss_summary$adjr2, xlab="Number of predictors", ylab="Adjusted Rsq", type="b") points(best_adjr2, bss_summary$adjr2[best_adjr2], col="red", pch=16) plot(1:p, bss_summary$cp, xlab="Number of predictors", ylab="Cp", type="b") points(best_cp, bss_summary$cp[best_cp], col="red", pch=16)

plot(1:p, bss_summary$bic, xlab="Number of predictors", ylab="BIC", type="b") points(best_bic, bss_summary$bic[best_bic], col="red", pch=16)

This gives the plots in Figure 2. In all cases, the optimal point on the graph is at or close to 6 which suggests choosing M6. Referring to the output from the summary function, this model uses the variables sex, bmi, map, tc, ldl and ltg. The associated regression coefficients are:

coef

(

bss_fit,

)

0.350.400.450.50

Adjusted Rsq

020406080100140

Number of predictors Number of predictors

−280−260−240−220−200−180

BIC

Number of predictors

Figure 2: Best subset selection for the diabetes data.

## (Intercept)

sex

bmi

map

ldl

152.1335 -226.5106

529.8730

327.2198 -757.9379

538.5859

ltg

804.1923

1. Cross validation

In the previous section, we compared the models M1,...,M_pusing criteria such as the BIC. An alternative approach is to use cross validation. Again, in this practical, we will focus on 10-fold cross validation.

To do this, we need to repeat our cross validation procedure for models fitted by least squares for each of M1,...,M_p. We then compare the average MSE for each of these models, and the one minimising the test error is judged “best”. To this end we need to wrap our function regcv in another which loops over the p models

reg_bss_cv

function

(

best_models

fold_index

)

{

ncol

(

test_errors

numeric

(

)

for

(

{

test_errors[k]

reg_cv

(

X1[,best_models[k,]], y, fold_index

)

}

return

(

test_errors

)

}

Here the argument bestmodels should be a (p×p) matrix of booleans whose (i,j)th entry is TRUE if M_iuses the jth predictor and FALSE otherwise. As we saw in the previous section, when we apply the summary function to the object returned by the regsubsets function, we get another object containing lots of useful components. One of these is the which component bss_summary$which

## (Intercept) age sex bmi map tc ldl hdl tch ltg glu

## 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## 2 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

## 3 TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE

## 4 TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE

## 5 TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE

## 6 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE

## 7 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE ## 8 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE

## 9 TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## 10 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

If we omit the first column for the intercept, this is a matrix of precisely the form we want. We can therefore compute the cross-validation error for each model M1,...,M_pas follows:

## Apply the function to the diabetes data

bss_mse

reg_bss_cv

(

diabetes[,

]

, diabetes[,

]

, bss_summary

which[,

]

, fold_index

)

## Identify model with the lowest error

(

best_cv

which.min

(

bss_mse

))

## [1] 6

i.e. the model with 6 explanatory variables minimises the test error.

We can also visualise how the test error varies with the number of explanatory variables and add it to our plots for the other model choice criteria:

## Create multi-panel plotting device par(mfrow=c(2,2))

## Produce plots, highlighting optimal value of k

plot(1:p, bss_summary$bic, xlab="Number of predictors", ylab="BIC", type="b") points(best_bic, bss_summary$bic[best_bic], col="red", pch=16)

plot(1:p, bss_mse, xlab="Number of predictors", ylab="10-fold CV Error", type="b") points(best_cv, bss_mse[best_cv], col="red", pch=16)

0.350.400.450.50

Adjusted Rsq

020406080100140

Number of predictors Number of predictors

−280−260−240−220−200−180

BIC

30003200340036003800

fold CV Error

10−

Number of predictors Number of predictors

Figure 3: Best subset selection for the diabetes data including comparison of cross validation errors.

This gives the plots in Figure 3. It still seems that a model with 6 explanatory variables would be a good choice.

Notice that if we were to perform cross validation again, with a different set of 10-folds, our estimates of the test error would change. So if we wanted to compare the test error from models fitted with best subset selection to that obtained by fitting models using other methods, we would need to use the same set of folds.

(Optional) exercise: repeat the model-fitting and cross validation above using forward or backward stepwise selection instead of best subset selection. How do the best-fitting models of each size compare to those obtained using best subset selection? How do the test errors compare?

pur-new-sol

MAS8404 Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions

menu

MAS8404 Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions