Homework answers / question archive / MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

Statistics

Share With

MAS8404 Summative Assignment

1 Cluster Analysis

The ISLR gene expression dataset (Ch10Ex11.csv) consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. This dataset is available from Canvas and was used in Practical 3.

## Read in data

filename = "H:/MAS8404/data/Ch1i0Ex11.csv"

filename = "Chi0Ex11.csv"

gexpr = read.csv(filename, header=FALSE)

## Check size

dim(gexpr)

Before we begin to work with the data, note that like the nci data studied in lectures, this gene expression data set is stored the “wrong” way around with rows representing genes and columns representing tissue samples. Therefore we need to transpose gexpr to get our data matrix:

gexpr = t(gexpr)

1. This question concerns hierarchical clustering.

(a) Apply hierarchical clustering with single-linkage using correlation-based distance and plot the dendrogram. Do the genes separate the samples into the two groups?

(b) Repeat 1(a) using complete-linkage and average-linkage. Do your results depend on the type of linkage used?

2. This question concerns the K-means algorithm.

(a) Apply the K-means algorithm for a range of values of K. On the basis of this analysis, do the genes separate the samples into the two groups?

(b) Suppose we choose to use K = 4 clusters to summarise the data. Produce a visual display of the four clusters in a two-dimensional plot. Do the clusters appear well separated?

2 Linear Regression

This question continues the analysis of the diabetes data that was started during Practical 5. Recall that our objective is to develop a model for predicting disease progression (dis) on the basis of one or more of the 10 baseline variables. If you need to remind yourself of the background to these data, please refer back to the practical sheet.

(a) Split the data into a training and validation set. For the purposes of this assignment, take the data from the first 350 patients as training data and the data from the remaining 442 — 350 = 92 patients as test data.

(b) Fit a multiple linear regression model (using all predictors) by least squares to the training data and compute the test error over the validation set. Report the test error.

(c) In Practical 5, using best subset selection, we identified a 6-predictor model as providing a good compromise between model fit and model complexity. Fit this model to the training data using least squares and compute the test error over the validation set. Report the test error.

(d) Ridge regression:

(i) Based on the training data:

I. Use cross-validation to identify an optimal value for the tuning parameter. Report the optimal value of the tuning parameter.

II. With the tuning parameter fixed at its optimal value, fit the model to all the training data and compute the test error over the validation data. Report the test error.

(ii) Based on the full data:

I. Use cross-validation to identify an optimal value for the tuning parameter. Report the optimal value of the tuning parameter.

II. Fit the model to all of the data and generate a plot which shows how the estimates of the regression coefficients change as the tuning parameter is increased. Note that this means you will have to fit the model for a range of values for the tuning parameter around its optimal value. Comment on the plot.

III. Report the estimated regression coefficients for the model associated with the optimal value of the tuning parameter. How do they compare to the estimated coefficients in the full model fitted by least squares (see Practical 5 for parameter estimates)?

(e) Compare the test errors. Which model do you think is best and why?

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Please download the answer file using this link

https://drive.google.com/file/d/1B6OKDaEObQUvk6NMUE-VDIkhyIKywM6i/view?usp=sharing

NOTE: PLEASE ONLY USE IT AS SAMPLE BECAUSE THIS FILE HAS BEEN SUBMITTED BY OTHER STUDENTS AND WOULD COME UP AS PLAGIARISED FOR YOU.

MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions

menu

MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

Statistics

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Answer Preview

Sitejabber (5.0)

BBC (5.0)

Trustpilot (4.9)

Google (5.0)

Related Questions