Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / MAS8404  Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

MAS8404  Practical 5: Subset Selection and Cross Validation This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation

Statistics

MAS8404 

Practical 5: Subset Selection and Cross Validation

This practical is designed to give you some experience working with subset selection methods in R and in writing your own R code to perform cross validation.

Work through the practical sheet and if you need any help, just ask.

  1. Diabetes data
    1. Background

These data were collected in a study concerning patients with diabetes. The response variable of interest was disease progression one year after taking baseline measurements on various clinical variables. For each of n = 442 patients, the data comprise a quantitative measure of disease progression (dis) and measurements on p = 10 baseline (explanatory) variables: age (age), sex (sex), body mass index (bmi), average blood pressure (map) and six blood serum measurements (tc, ldl, hdl, tch, ltg, glu). We will use multiple linear regression to develop a model for predicting disease progression on the basis of one or more of the baseline variables.

The data can be found in the diabetes data set which is part of the nclSLR package. They can then be loaded into R and inspected as follows:

 

## Load the nclSLR package

library

(

nclSLR

)

## Load the data

data

(

diabetes

)

## Check size

dim

(

diabetes

)

## [1] 442 11

## Store n and p:

(

n

=

nrow

(

diabetes

))

## [1] 442

(

p

=

ncol

(

diabetes

)

-

1

)

## [1] 10

## Print first few rows

head

(

diabetes

)

##

age

sex

bmi

map

tc

## 1 0.038075906 0.05068012 0.06169621 0.021872355 -0.044223498

## 2 -0.001882017 -0.04464164 -0.05147406 -0.026327835 -0.008448724

## 3 0.085298906 0.05068012 0.04445121 -0.005670611 -0.045599451

## 4 -0.089062939 -0.04464164 -0.01159501 -0.036656447 0.012190569

## 5 0.005383060 -0.04464164 -0.03638469 0.021872355 0.003934852

## 6 -0.092695478 -0.04464164 -0.04069594 -0.019442093 -0.068990650

##

ldl

hdl

tch

ltg

glu dis

 

## 1 -0.03482076 -0.043400846 -0.002592262 0.019908421 -0.017646125 151 ## 2 -0.01916334 0.074411564 -0.039493383 -0.068329744 -0.092204050 75

## 3 -0.03419447 -0.032355932 -0.002592262 0.002863771 -0.025930339 141 ## 4 0.02499059 -0.036037570 0.034308859 0.022692023 -0.009361911 206

## 5 0.01559614 0.008142084 -0.002592262 -0.031991445 -0.046640874 135

## 6 -0.07928784 0.041276824 -0.076394504 -0.041180385 -0.096346157 97

We see that the response variable (dis) is stored in column 11 and that the explanatory variables appear in columns 1–10. It appears that the explanatory variables are not in their “raw” form, i.e. they seem to have been transformed. To investigate we can check the means and standard deviations of the explanatory variables:

 

colMeans

(

diabetes[,

1

:

10

])

##

age

sex

bmi

map

tc

## -3.653192e-16 1.288091e-16 -8.008723e-16 1.289911e-16 -9.112498e-17

##

ldl

hdl

tch

ltg

glu

## 1.301526e-16 -4.557682e-16 3.865922e-16 -3.861479e-16 -3.402226e-16

apply

(

diabetes[,

1

:

10

]

,

2

, sd)

##

age

sex

bmi

map

tc

ldl

## 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905 0.04761905

##

hdl

tch

ltg

glu

## 0.04761905 0.04761905 0.04761905 0.04761905

 

Although the explanatory variables have been transformed to have mean zero, the standard deviations are not equal to one. It turns out that, instead, each variable has been transformed so that the sum of the squared values is equal to one.

(Optional) exercise: can you write some R code to verify that this is the case?

We will work with these standardised explanatory variables.

    1. Exploratory data analysis

Before proceeding with our regression analysis, we will perform some exploratory analysis to get a feel for the relationships between the response and explanatory variables, and the relationships amongst the explanatory variables. We can produce a scatterplot matrix as follows:

 

pairs

(

diabetes

)

 

which produces the plot in Figure 1. A few of the variables seem to show a weak linear relationship with the response, most notably bmi and ltg. For other variables, such as age and sex, there is little evidence of a linear relationship with the response. However, it is also clear that a few of the explanatory variables are themselves highly correlated, for example, the blood serum measurements tc and ldl. This immediately suggests that we may be able to omit some explanatory variables when we fit our multiple linear regression model. By executing the R expression

 

cor

(

diabetes

)

 

compute the sample correlation matrix and confirm that our observations are borne out through the sample correlations.

Figure 1: Scatterplot matrix for the diabetes data.

  1. Least squares
    1. Model fitting

Let yi denote disease progression for patient i and denote by xij the (standardised) measurement on the j-th baseline variable for patient i where j = 1,...,p and i = 1,...,n. (Here p = 10 and n = 442). Consider the multiple linear regression model:

                                                yi = β0 + β1xi1 + ... + βpxip + εi                    for i = 1,2,... ,n.

We will begin by fitting the model via least squares using the lm function:

 

## Fit model using least squares

lsq_fit

=

lm

(

dis

~

.,

data

=

diabetes

)

 

We can then obtain a summary of the fitted model by passing the returned object to the summary function:

 

## Summarise fitted model

(

lsq_summary

=

summary

(

lsq_fit

))

##

## Call:

## lm(formula = dis ~ ., data = diabetes)

##

## Residuals:

##

Min

1

Q Median

3

Q

Max

## -155.829 -38.534 -0.227 37.806 151.355

##

## Coefficients:

##

Estimate Std. Error t value Pr(>|t|)

2.576 59.061 < 2e-16 ***

## (Intercept) 152.133

## age

-10.012

59.749 -0.168 0.867000

## sex

-239.819

61.222 -3.917 0.000104 ***

519.840

## bmi

66.534 7.813 4.30e-14 ***

## map

324.390

65.422 4.958 1.02e-06 ***

## tc

-792.184

416.684 -1.901 0.057947 .

476.746

## ldl

339.035 1.406 0.160389

## hdl

101.045

212.533 0.475 0.634721

## tch

177.064

161.476 1.097 0.273456

## ltg

751.279

171.902 4.370 1.56e-05 ***

## glu

67.625

65.984 1.025 0.305998

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##

## Residual standard error: 54.15 on 431 degrees of freedom

## Multiple R-squared: 0.5177,Adjusted R-squared: 0.5066

## F-statistic: 46.27 on 10 and 431 DF, p-value: < 2.2e-16

 

The column of p-values (i.e. the Pr(>|t|) column) for the t-tests

                                                                 H0 : βi = 0      versus     H1 : βi 6= 0

confirms our observations from the exploratory analysis; if we consider the explanatory variables one-at-a-time, there are several which would not be needed in a model containing all others. We also see that the coefficient of determination, R2, is equal to 51.77%.

(Optional) exercise: how would this be interpreted?

Note that we can obtain the values ˆy1,...,yˆn that our fitted model would predict for the explanatory variables on the 1st, ..., nth rows of our matrix of explanatory variables by using the predict function. We pass as arguments the object returned by the lm function, and the data frame containing the explanatory variables at which we want predictions. In this case we have:

 

yhat

=

predict

(

lsq_fit, diabetes

)

## Print first few elements:

head

(

yhat

)

##

1

2

3

4

5

6

## 206.11707 68.07235 176.88406 166.91797 128.45984 106.34909

## Compare with actual values:

head

(

diabetes

$

dis)

 

 

## [1] 151 75 141 206 135 97

 

    1. Cross validation

To get an assessment of the predictive performance of the least squares model, we can estimate the test error using cross validation. In this practical we will consider 10-fold cross validation. First we will set the seed of R’s random number generator so that repeating the analysis later will produce exactly the same results:

 

## Set the seed to make the analysis reproducible

set.seed

(

1

)

 

Next we randomly divide the data into 10 folds of approximately equal size. We do this by sampling a fold from {1,2,... ,10} where each fold has the same probability of being selected. We can do this for each observation i = 1,...,n as follows:

 

## 10-fold cross validation

nfolds

=

10

## Sample fold-assignment index

fold_index

=

sample

(

nfolds, n,

replace

=

TRUE

)

## Print first few fold-assignments

head

(

fold_index

)

## [1] 3 4 6 10 3 9

 

So observation 1 is assigned to fold 3, observation 2 is assigned to fold 4, and so on.

To estimate the average mean squared error (MSE) by general K-fold cross validation we will write a function:

 

reg_cv

=

function

(

X1

,

y

,

fold_ind

)

{

Xy

=

data.frame

(

X1,

y

=

y

)

nfolds

=

max

(

fold_ind

)

if

(

!

all.equal

(

sort

(

unique

(

fold_ind)),

1

:

nfolds))

stop

(

"Invalid fold partition."

)

cv_errors

=

numeric

(

nfolds

)

for

(

fold

in

1

:

nfolds)

{

tmp_fit

=

lm

(

y

~

.,

data

=

Xy[fold_ind

!=

fold,])

yhat

=

predict

(

tmp_fit, Xy[fold_ind

==

fold,])

yobs

=

y[fold_ind

==

fold]

cv_errors[fold]

=

mean

((

yobs

-

yhat)

^

2

)

}

fold_sizes

=

numeric

(

nfolds

)

for

(

fold

in

1

:

nfolds) fold_sizes[fold]

=

length

(

which

(

fold_ind

==

fold))

test_error

=

weighted.mean

(

cv_errors,

w

=

fold_sizes

)

return

(

test_error

)

}

 

Let’s consider each part of the function in turn.

On lines 2 and 3 in the main body of the function we verify that the fold partition, foldind, contains a consecutive set of integers, starting at 1. On line 4, we create a vector to hold the average MSE computed over each fold. Then we loop over the folds.

For fold k = 1,2,... ,K, we first fit the model to the “training” data (i.e. all the data except that in the kth fold) using least squares. Next, we use the predict function to compute the values ˆyi for the response variable that our fitted model would predict for the “test” set (i.e. the data in the kth fold). The mean squared difference between the predicted values ˆyi for the test set and the actual observations yi that occurred gives an estimate of the average MSE from the kth fold.

After the for loop, on lines 11–12, we compute the number of observations assigned to fold 1,2,... ,K. Note that the which function operates on a vector of booleans (i.e. a vector of TRUEs and FALSEs), returning the indices of the TRUEs. So, for the fold partition we simulated earlier, for example which(fold_index==3)

## [1] 1 5 11 22 25 54 62 66 86 89 91 103 106 128 129 143 144

## [18] 159 160 161 181 195 201 202 204 209 221 227 246 257 263 275 277 290

## [35] 291 315 329 332 346 356 364 374 375 385 386 442

will return a vector telling us which observations are assigned to fold number 3. Hence the expression on line 12 gives the number of observations assigned to each of the K folds. Finally, on line 13, we compute the average MSE over the entire data set by averaging the K estimates from the K folds. To allow for the fact that the K estimates may have been computed using test sets (i.e. folds) of different sizes, the MSE estimates are weighted by the sizes of the folds.

We can now apply our function to the diabetes data as follows:

 

(

lsq_final_mse

=

reg_cv

(

diabetes[,

1

:

10

]

, diabetes[,

11

]

, fold_index

))

## [1] 3022.336

 

Make sure you understand what each part of the regcv function does. If you have any questions, just ask.

  1. Best subset selection

Both our exploratory data analysis and the results of the least squares model fit suggest that we do not need to include all the explanatory variables in our fitted model. If we can eliminate some of the explanatory variables we might be able to estimate the effects of those that remain more precisely, thereby improving the predictive performance of the model. To this end, we will consider best subset selection.

    1. Model fitting

We can perform best subset selection using the regsubsets function from the leaps R package. We begin by loading the package

 

library

(

leaps

)

 

Next we use the regsubsets function to apply the algorithm, specifying method="exhaustive" to indicate that we want to perform best subset selection and nvmax=p to indicate that we want to consider models with 1,2,... ,p explanatory variables. The function identifies the best model that contains each number of predictors. Denote these models by M1,...,Mp. Again, we can summarise the results of the model fitting exercise using the summary function which prints the predictor variables in M1,...,Mp.

 

## Apply the best subset selection algorithm

bss_fit

=

regsubsets

(

dis

~

.,

data

=

diabetes,

method

=

"exhaustive"

,

nvmax

=

p

)

## Summarise the results

(

bss_summary

=

summary

(

bss_fit

))

## Subset selection object

## Call: regsubsets.formula(dis ~ ., data = diabetes, method = "exhaustive",

##

nvmax = p)

## 10 Variables (and intercept)

Forced in Forced out

##

## age

FALSE

FALSE

## sex

FALSE

FALSE

## bmi

FALSE

FALSE

## map

FALSE

FALSE

## tc

FALSE

FALSE

FALSE

## ldl

FALSE

FALSE

## hdl

FALSE

## tch

FALSE

FALSE

FALSE

## ltg

FALSE

FALSE

## glu

FALSE

## 1 subsets of each size up to 10

## Selection Algorithm: exhaustive

##

age sex bmi map tc ldl hdl tch ltg glu

##1 (1) """""*"""""""""""""""

##2 (1) """""*""""""""""""*"""

##3 (1) """""*""*""""""""""*"""

##4 (1) """""*""*""*""""""""*"""

## 5 ( 1 ) " " "*" "*" "*" " " " " "*" " " "*" " "

## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " " " "*" " "

## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "

## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"

## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"

## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"

 

The bsssummary object contains a lot of useful information. Examine its structure by executing the R expression

 

str

(

bss_summary

)

 

For example, it contains the coefficient of determination for the models M1,...,Mp in the rsq component. We can extract these values using the usual $ notation:

 

bss_summary

$

rsq

## [1] 0.3439238 0.4594852 0.4800828 0.4920162 0.5086325 0.5148848 0.5162912

## [8] 0.5174714 0.5177180 0.5177494

 

As explained in lectures, we see that as more predictors are added to the model R2 cannot decrease.

To compare the fits of the models M1,...,Mp, we can use adjusted-R2, Mallow’s Cp statistic or the BIC, which can be extracted as follows:

 

bss_summary

$

adjr2

## [1] 0.3424327 0.4570228 0.4765217 0.4873665 0.5029975 0.5081936 0.5084895

## [8] 0.5085563 0.5076705 0.5065603

bss_summary

$

cp

## [1] 148.352561 47.072229 30.663634 21.998461 9.148045 5.560162

## [7] 6.303221 7.248522 9.028080 11.000000

bss_summary

$

bic

## [1] -174.1108 -253.6592 -264.7407 -268.9126 -277.5210 -277.0899 -272.2819

## [8] -267.2702 -261.4049 -255.3424

 

The best models, according to each criteria are therefore

 

(

best_adjr

2

=

which.max

(

bss_summary

$

adjr2))

## [1] 8

(

best_cp

=

which.min

(

bss_summary

$

cp))

## [1] 6

(

best_bic

=

which.min

(

bss_summary

$

bic))

## [1] 5

 

For instance, adjusted-R2 suggests we use M8, i.e. the model with 8 explanatory variables.

To reconcile the differences implied by the different model choice criteria, it is helpful to plot how each criterion varies with the number of predictors:

## Create multi-panel plotting device par(mfrow=c(2,2))

## Produce plots, highlighting optimal value of k

plot(1:p, bss_summary$adjr2, xlab="Number of predictors", ylab="Adjusted Rsq", type="b") points(best_adjr2, bss_summary$adjr2[best_adjr2], col="red", pch=16) plot(1:p, bss_summary$cp, xlab="Number of predictors", ylab="Cp", type="b") points(best_cp, bss_summary$cp[best_cp], col="red", pch=16)

plot(1:p, bss_summary$bic, xlab="Number of predictors", ylab="BIC", type="b") points(best_bic, bss_summary$bic[best_bic], col="red", pch=16)

This gives the plots in Figure 2. In all cases, the optimal point on the graph is at or close to 6 which suggests choosing M6. Referring to the output from the summary function, this model uses the variables sex, bmi, map, tc, ldl and ltg. The associated regression coefficients are:

 

coef

(

bss_fit,

6

)

 

 

10

6

4

2

8

0.350.400.450.50

Adjusted Rsq

10

2

4

6

8

020406080100140

Cp

 

                                         Number of predictors                                                 Number of predictors

 

2

4

6

8

10

−280−260−240−220−200−180

BIC

 

Number of predictors

Figure 2: Best subset selection for the diabetes data.

 

## (Intercept)

sex

bmi

map

tc

ldl

##

152.1335 -226.5106

529.8730

327.2198 -757.9379

538.5859

##

ltg

##

804.1923

 

    1. Cross validation

In the previous section, we compared the models M1,...,Mp using criteria such as the BIC. An alternative approach is to use cross validation. Again, in this practical, we will focus on 10-fold cross validation.

To do this, we need to repeat our cross validation procedure for models fitted by least squares for each of M1,...,Mp. We then compare the average MSE for each of these models, and the one minimising the test error is judged “best”. To this end we need to wrap our function regcv in another which loops over the p models

 

reg_bss_cv

=

function

(

X1

,

y

,

best_models

,

fold_index

)

{

p

=

ncol

(

X

1)

test_errors

=

numeric

(

p

)

for

(

k

in

1

:

p)

{

test_errors[k]

=

reg_cv

(

X1[,best_models[k,]], y, fold_index

)

}

return

(

test_errors

)

}

 

Here the argument bestmodels should be a (p×p) matrix of booleans whose (i,j)th entry is TRUE if Mi uses the jth predictor and FALSE otherwise. As we saw in the previous section, when we apply the summary function to the object returned by the regsubsets function, we get another object containing lots of useful components. One of these is the which component bss_summary$which

##     (Intercept) age sex bmi map tc ldl hdl tch ltg glu

## 1  TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## 2  TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE

## 3   TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE

## 4    TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE

## 5     TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE

## 6 TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE FALSE

## 7  TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE ## 8    TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE

## 9   TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

## 10   TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

If we omit the first column for the intercept, this is a matrix of precisely the form we want. We can therefore compute the cross-validation error for each model M1,...,Mp as follows:

 

## Apply the function to the diabetes data

bss_mse

=

reg_bss_cv

(

diabetes[,

1

:

10

]

, diabetes[,

11

]

, bss_summary

$

which[,

-

1

]

, fold_index

)

## Identify model with the lowest error

(

best_cv

=

which.min

(

bss_mse

))

## [1] 6

 

i.e. the model with 6 explanatory variables minimises the test error.

We can also visualise how the test error varies with the number of explanatory variables and add it to our plots for the other model choice criteria:

## Create multi-panel plotting device par(mfrow=c(2,2))

## Produce plots, highlighting optimal value of k

plot(1:p, bss_summary$adjr2, xlab="Number of predictors", ylab="Adjusted Rsq", type="b") points(best_adjr2, bss_summary$adjr2[best_adjr2], col="red", pch=16) plot(1:p, bss_summary$cp, xlab="Number of predictors", ylab="Cp", type="b") points(best_cp, bss_summary$cp[best_cp], col="red", pch=16)

plot(1:p, bss_summary$bic, xlab="Number of predictors", ylab="BIC", type="b") points(best_bic, bss_summary$bic[best_bic], col="red", pch=16)

plot(1:p, bss_mse, xlab="Number of predictors", ylab="10-fold CV Error", type="b") points(best_cv, bss_mse[best_cv], col="red", pch=16)

 

10

6

4

2

8

0.350.400.450.50

Adjusted Rsq

10

2

4

6

8

020406080100140

Cp

 

                                         Number of predictors                                                 Number of predictors

 

10

2

4

6

8

−280−260−240−220−200−180

BIC

10

2

4

6

8

30003200340036003800

fold CV Error

10−

 

                                         Number of predictors                                                 Number of predictors

Figure 3: Best subset selection for the diabetes data including comparison of cross validation errors.

This gives the plots in Figure 3. It still seems that a model with 6 explanatory variables would be a good choice.

Notice that if we were to perform cross validation again, with a different set of 10-folds, our estimates of the test error would change. So if we wanted to compare the test error from models fitted with best subset selection to that obtained by fitting models using other methods, we would need to use the same set of folds.

(Optional) exercise: repeat the model-fitting and cross validation above using forward or backward stepwise selection instead of best subset selection. How do the best-fitting models of each size compare to those obtained using best subset selection? How do the test errors compare?

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE