Fill This Form To Receive Instant Help

Help in Homework
trustpilot ratings
google ratings


Homework answers / question archive / Machine Learning Assignment 2 Deadline: May 11, 2022 at 23:55h

Machine Learning Assignment 2 Deadline: May 11, 2022 at 23:55h

Computer Science

Machine Learning

Assignment 2

Deadline: May 11, 2022 at 23:55h.

Upload your report (report.pdf), your implementation (lsq regression.py) and your figure file (figures.pdf) to the TeachCenter. Please do not zip your files. Please use the provided framework-file for your implementation.

  1. Least Squares and Double Descent Phenomenon (17P)

The goal in this assignment is to learn about linear least squares regression and the double descent phenomenon, shown in Figure 1. In the classical learning setting, the U-shaped risk curve can be observed indicating a bad test error while the training error is very low, i.e. the model does not generalize well to new data. However, a highly over-parameterized model with a large capacity allows the test error to go down again in a second descent (“double descent”), which can sometimes be observed in over-parametrized deep learning settings.

 

 
 
 

 

 

Figure 1: The double descent risk curve, which shows in addition to the classical U-shaped risk curve the decreasing test error for high-capacity risk function classes. Figure taken from [1].

 

 

We want to reproduce this phenomenon in the context of linear least squares regression with random features. Assume we have input data x = {x1, ..., xN }, with xn Rd d-dimensional samples uniformely distributed on the unit sphere such that ?xn?2 = 1. Furthermore, consider associated targets y = {y1, ..., yN } with yn R according to

n

4

d

n

y   =   1 + (1T )2 1 + N (0, σ2),                                                                            (1)

 

where 1d      Ris a d-dimensional vector of ones.

Our goal is to model the underlying relation between training input data x and targets y and apply this to new, unseen test data by minimizing the regularized least squares error function

 

 

 

N

w = arg min  1 Σ

2

 

 

wT φ(xn) yn       +

λ

 

2

?w? .                                        (2)

 

wRM 2

 

2        2

n=1

 

The (possibly) non-linear transform φ(·) lifts the input data xn  to a higher-dimensional space with

M features.  Here, we are going to use random features v = {v1, ..., vM } on the unit sphere with

 

 

1

 

 

 

vm Rd and a nonlinear activation function such that

 

   1

φ(xn) = M

 

 

 

m

max(vT xn, 0)

 

{m=1,...,M }

 

 

(3)

 

 

2

Evaluation metrics for predictions yˆ  can be obtained using the mean squared error

 

 

 

Σ 1L

N

(y, yˆ) =             (yi

N

i=1

 

 

yˆi)  .                                                              (4)

 

 

Tasks

    1. Rewrite eq. (2) in pure matrix/vector notation, such that there are no sums left in the final expression. Use Φ = φ(x) for the feature transform which can be computed prior to the optimization. Additionally, state the matrix/vector dimensions of all occurring variables.
    2. Analytically derive the optimal parameters w from eq. (2).
    3. Give an analytic expression to compute predictions yˆ  given w.

 

∈                         

This can also be interpreted as a small feedforward neural network with one hidden layer for an input x    Rd  and output yˆ     R.  Draw a simple schematic of this neural network and include exemplary labels of its neurons and connections.
    1. Create a training dataset comprised of input data x = {x1, ..., xN } and corresponding targets

y = {y1, ..., yN } with N = 200, d = 5 and σ = 2 according to eq. (1).

In the same manner, create a test dataset with Nt = 50 for both test input data and test targets.

    1. Generate M = 50 d-dimensional random feature vectors v = {v1, ..., vM } on the unit sphere.
    2. Implement the computation of w from the training data using a QR decomposition. Further, compute the mean squared error denoted in eq. (4) for both the training and test data based on the optimal parameters w.
    3.  

number of feature vectors M =

10k + 1 | k ∈ {0, 1, 2, ..., 60}

and save the training and test

Use λ = 1 × 108 to reproduce the double descent behavio}ur. Run this experiment for a

 

loss in each run. For each M , do the experiment r = 5 times to obtain averaged scores.

    1. Plot both the averaged (over the r = 5 runs) train and test errors depending on the number of feature vectors M in the same plot. Include the standard deviation of each setting in addition to the averaged loss. Give an interpretation of your results.
    2.  

{ ×            ×       }

Repeat the same experiment for λ =    1    105, 1    103    and explain the influence of λ. Include the resulting curves containing train and test error for each λ in two additional subplots.

 

Implementation details

 

To efficiently solve a system Ax = b, QR decomposition of the matrix A into an orthogonal matrix Q and an upper triangular matrix R can be used instead of direct matrix inversion (see numpy.linalg.qr). This can be computed as follows:

A = QR

z = QT b

Rx = z x = R1z    (solve this with numpy.linalg.solve)

  • ±
If you want to plot the standard deviation σ in addition to an averaged curve use matplotlib.pyplot.fill between, where y1 and y2 denote µ σ and µ + σ, respectively. You can set an alpha value for blending.

Implement the double descent phenomenon in task12 of lsq regression.py. Do not use any other libraries than numpy and matplotlib.

 

 

2

 

  1. Dual Representation (8P)

The linear least squares problem from Task 1 can be reformulated in its dual representation, where an equivalent solution can be obtained.

The dual problem is given by

a = arg min  1 aT Ka + λ ?a + y?2,                                                                              (5)

aRN  2                   2               2

introducing the kernel matrix K = ΦΦT RN ×N .

Having knowledge on either the feature transform φ(x) or the corresponding kernel matrix

k(x, x) = φ(x)T φ(x) allows us to operate very flexible in either the primal or the dual domain.

Our random ReLU features approximate an ArcCos kernel of the following form

 

 

2πd

k(x, x) =    ?x??x?  sin(θ) + (π θ) cos(θ)  ,                                                                                               (6)

?            'x??x ?

with θ = cos1     xT x'     .  Bear in mind that we designed the input data such that ?x? = 1.  Also

note that the corresponding feature vectors in their explicit form do not have to be known.

Similar to Task 1 the dual solution can be obtained in closed-form and can be subsequently used to make predictions for unseen test data x. The relation between the primal solutions w required for making new predictions and the dual variable a is as follows:

 

w =     1 ΦT a.         (7)

λ

 

Tasks

    1. Analytically compute the optimal parameters a from eq. (5). State the dimension of the resulting matrix that has to be inverted in the process and compare them those required in Task 1. When is it favourable to use the primal and when the dual solution?
    2. Give an analytic expression to compute predictions yˆ  given a  using eq. (7), such that you only rely on K and do not need to compute the features Φ explicitely.
    3. For the train data x compute the kernel matrix as given in eq. (6).  Repeat the same process for the test data, ensuring that the resulting kernel matrices are of dimensionality RN ×N and RNt×N , respectively.
    4. Implement the computation of a and report the mean squared error on the train and test data, using λ = 1 × 108.
    5.  

×

{                }

Use exactly the same datasets as in Task 1. For the train data x, compare the kernel K and ΦΦT .   For different numbers of features =    10, 200, 800  ,  evaluate both terms and plot the row n = 10 from both resulting N N matrices in one plot. Describe the influence of M . Compute for each M the mean absolute error between both 1D arrays, i.e.

 

N

i=1

i

MAE(Kn, (ΦΦT )n) =   1  ΣN      |(Kn)i  (ΦΦT )n  |.

Compare train and test errors obtained with the primal solution for each setting of M with

the dual solution.

 

Implement the dual problem also in task12 of lsq regression.py. Again, do not use any other libraries than numpy and matplotlib.

 

References

[1] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.

 

pur-new-sol

Purchase A New Answer

Custom new solution created by our subject matter experts

GET A QUOTE

Related Questions