Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

MAS8404 Summative Assignment 1 Cluster Analysis The ISLR gene expression dataset (Ch10Ex11

Statistics

MAS8404 Summative Assignment

1 Cluster Analysis

The ISLR gene expression dataset (Ch10Ex11.csv) consists of 40 tissue samples with measurements on 1,000 genes. The first 20 samples are from healthy patients, while the second 20 are from a diseased group. This dataset is available from Canvas and was used in Practical 3.

## Read in data

filename = "H:/MAS8404/data/Ch1i0Ex11.csv"

filename = "Chi0Ex11.csv"

gexpr = read.csv(filename, header=FALSE)

## Check size

dim(gexpr)

Before we begin to work with the data, note that like the nci data studied in lectures, this gene expression data set is stored the “wrong” way around with rows representing genes and columns representing tissue samples. Therefore we need to transpose gexpr to get our data matrix:

gexpr = t(gexpr)

1. This question concerns hierarchical clustering.

(a) Apply hierarchical clustering with single-linkage using correlation-based distance and plot the dendrogram. Do the genes separate the samples into the two groups?

(b) Repeat 1(a) using complete-linkage and average-linkage. Do your results depend on the type of linkage used?

(c) Repeat 1(a) with single-linkage using Euclidean distance. Do your results depend on the distance metric used?

2. This question concerns the K-means algorithm.

(a) Apply the K-means algorithm for a range of values of K. On the basis of this analysis, do the genes separate the samples into the two groups?

(b) Suppose we choose to use K = 4 clusters to summarise the data. Produce a visual display of the four clusters in a two-dimensional plot. Do the clusters appear well separated?

2 Linear Regression

This question continues the analysis of the diabetes data that was started during Practical 5. Recall that our objective is to develop a model for predicting disease progression (dis) on the basis of one or more of the 10 baseline variables. If you need to remind yourself of the background to these data, please refer back to the practical sheet.

(a) Split the data into a training and validation set. For the purposes of this assignment, take the data from the first 350 patients as training data and the data from the remaining 442 — 350 = 92 patients as test data.

(b) Fit a multiple linear regression model (using all predictors) by least squares to the training data and compute the test error over the validation set. Report the test error.

(c) In Practical 5, using best subset selection, we identified a 6-predictor model as providing a good compromise between model fit and model complexity. Fit this model to the training data using least squares and compute the test error over the validation set. Report the test error.

(d) Ridge regression:

(i) Based on the training data:

I. Use cross-validation to identify an optimal value for the tuning parameter. Report the optimal value of the tuning parameter.

II. With the tuning parameter fixed at its optimal value, fit the model to all the training data and compute the test error over the validation data. Report the test error.

(ii) Based on the full data:

I. Use cross-validation to identify an optimal value for the tuning parameter. Report the optimal value of the tuning parameter.

II. Fit the model to all of the data and generate a plot which shows how the estimates of the regression coefficients change as the tuning parameter is increased. Note that this means you will have to fit the model for a range of values for the tuning parameter around its optimal value. Comment on the plot.

III. Report the estimated regression coefficients for the model associated with the optimal value of the tuning parameter. How do they compare to the estimated coefficients in the full model fitted by least squares (see Practical 5 for parameter estimates)?

(e) Compare the test errors. Which model do you think is best and why?

 

 

Option 1

Low Cost Option
Download this past answer in few clicks

98.99 USD

PURCHASE SOLUTION

Already member?


Option 2

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions