Fill This Form To Receive Instant Help
Homework answers / question archive / Neural network The Santa Fe data set is obtained from a chaotic laser which can be described as a nonlinear dynamical system
Neural network
The Santa Fe data set is obtained from a chaotic laser which can be described as a nonlinear dynamical system. Given are 1000
training data points. The aim is to predict the next 100 points (it is forbidden to include these points in the training set!). The training data are stored in lasertrain.dat and are shown in Figure 2a. The test data are contained in laserpred.dat and shown 1n
Figure 2b.
300 300
250 + 250 |
200 | 200 | |
| ‘
| | | I | |
150 | | |) 150 fF I
| } i | ot |
100 | | 100 | . ‘| / | 1. | | |
| |
pi, tote 1 | oe
“ti ! i i WA sat il Y OP
inh hi MA | hy Mh ores ha hae
° 0 200 400 600 800 1000 ° 0 20 40 60 80 100
Discrete time k Discrete time k
(a) Training set (b) Test set
Figure 2
Exercise
Train a MLP with one hidden layer after standardizing the data set. The training is done in feedforward mode:
9. = w' tanh(V [yx—15 ye—2} ---3 Yk—p] + B)- (4)
In order to make predictions, the trained network is used in an iterative way as a recurrent network:
Ge = w* tanh(V[Gn—15 Ge—2; ---3 Fe—p] + 8). (5)
To format the data you can use the provided function get TimeSeriesTrainData. Make sure you understand what the function does by trying it out on a small self-made toy example. To predict the test set you will have to write a for loop that includes the predicted value from the previous timestep in the input vector to predict the next timestep. Investigate the model performance with different lags and number of neurons. Explain clearly how do you tune the parameters and what is the influence
on the final prediction. Which combination of parameters gives the best performance (RMSE) on the test set?
Long short-term memory network
Long Short Term Memory networks, usually just called “LSTMs”, are a special kind of RNN, capable of learning long-term dependencies [2]. LSTMs contain information outside the normal flow of the recurrent network in a gated cell. Information can be stored in, written to, or read from a cell, much like data in a computer’s memory. The cell makes decisions about what to store, and when to allow reads, writes and erasures, via gates that open and close. Those gates act on the signals they receive, and similar to the neural network’s nodes, they block or pass on information based on its strength and importance, which they filter with their own sets of weights. Those weights, like the weights that modulate input and hidden states, are adjusted via the recurrent networks learning process. That is, the cells learn when to allow data to enter, leave or be deleted through the iterative process of making guesses, backpropagating error, and adjusting weights via gradient descent.
Demo
Study the following example, where an LSTM is build to predict the monthly cases of chickenpox by running
openExample (’nnet/TimeSeriesForecastingUsingDeepLearningExample’ ).
Exercise
Based on the previous demo, try to model the Santa Fe data set.
e Train the LSTM model and explain the design process. Discuss how the model looks, the parameters that you tune, ...
What is the effect of changing the lag value for the LSTM network?
e Afterwards try to predict the test set. Use the predictAndUpdateState function to predict time steps one at a time
and update the network state at each prediction. For each prediction, use the previous prediction as input to the function.
e Compare results of the recurrent neural network with the LSTM. Which model do you prefer and why?
[2] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural
computation, 9(8), 1735-1780.
Write a report of maximum 3 pages (including text and figures) to discuss the
exercises of neural network and Long short-term memory network.
e
Convolutional Neural Networks
Convolutional Neural Networks (CNN) is a deep learning technique that uses the concept of local connectivity. In a normal
multilayer neural network all nodes from subsequent layers are connected, we call these models fully connected. The idea is that
in a lot of datasets, points that are close to each other, are likely to be a lot more connected than points that are further away. For
example, in image datasets where the datapoints represent pixels. Pixels that are close are likely to represent the same part of the
image, while pixels that are further away can represent different parts.
Exercise
Run the script CNNex.m, try to understand what is happening.
Take a look at the layers of the downloaded CNN and answer the following questions:
¢ Take a look at the first convolutional layer (layer 2) and at the dimension of the weights
(size (convnet.Layers (2) .Weights) ). What do these weights represent?
¢ Inspect layers 1 to 5. If you know that a ReLU and a Cross Channel Normalization layer do not affect the dimension of the
input, what is the dimension of the input at the start of layer 6 and why?
¢ What is the dimension of the inputs before the final classification part of the network (i.e. before the fully connected
layers)? How does this compare with the initial dimension? Briefly discuss the advantage of CNNs over fully connected
networks for image classification.
The script CNNDigits.m[ ] runs a small CNN on the handwritten digits dataset. Use this script to investigate some CNN
architectures. Try out some different amount of layers, combinations of different kinds of layers, dimensions of the weights, etc.
Briefly discuss your results. Be aware that some architectures will take a long time to train!
Write a report of 1-2 pages (including text and figures) to discuss the exercise of
CNN.
Generative Adversarial Networks
e
Introduction
Generative adversarial networks (GANs) are a class of algorithms used in unsupervised machine learning, implemented by a
system of two neural networks competing with each other in a zero-sum game framework [.']. One neural network, called the
generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity; i.e. the discriminator
decides whether each instance of data belongs to the actual training dataset or not.
To summarize, here are the steps a GAN takes for an image generation example:
1. The generator takes in random numbers and returns an image.
2. This generated image is fed into the discriminator together with a batch of images taken from the actual dataset.
3. The discriminator takes in both real and fake images and returns probabilities, a number between 0 and 1, with | repre-
senting a prediction of authenticity and 0 representing fake.
4. Update the weights of the competing neural networks.
e
Exercises
Upload the file DCGAN . ipynb and go trough the code. Afterwards, answer the following questions:
1. Select one class from the CIFAR dataset and train a Deep convolutional generative adversarial network (DCGAN). Take
into account the architecture guidelines from Radford et al. [>]. Make sure that you train the model long enough, such that
it is able to generate ’real” images. Monitor the loss and accuracy of the generator vs discriminator, and comment on the
stability of the training. Explain this in context of the GAN framework.
Optimal transport
Optimal transport (OT) [‘] theory can be informally described using the words of Gaspard Monge (1746-1818): A worker with
a shovel in hand has to move a large pile of sand lying on a construction site. The goal of the worker is to construct with all that
sand a target pile with a prescribed shape (for example, that of a giant sand castle). Naturally, the worker wishes to minimize her
total effort, quantified for instance as the total distance or time spent carrying shovels of sand. People interested in OT cast that
problem as that of comparing two probability distributions-two different piles of sand of the same volume. They consider all of
the many possible ways to morph, transport or reshape the first pile into the second, and associate a ’ global” cost to every such
transport, using the local” consideration of how much it costs to move a grain of sand from one place to another. In OT, one
analyzes the properties of that least costly transport, as well as its efficient computation. An example of the computation of OT
and displacement interpolation between two 1-D measures is visible on Figure 3.
A common problem that is solved by OT is the assignment problem. Suppose that we have a collection of n factories, and a
collection of n stores which use the goods that the factory produce. Suppose that we have a cost function c, so that c(z, y) is the
cost of transporting one shipment of the factory from z to y. For simplicity, we ignore the time taken to do the transporting and a
factory can only deliver complete goods (no splitting of goods). Let us introduce some notation so we can formally state this as
an optimization problem. Let r be the vector containing the amount of goods every store needs. Similarly, k denotes the vector
of how much goods every factory produces. Often r and k represent marginal probability distributions, hence their values sum
to one. We wish to find the optimal transport plan, whose total cost is equal to:
du(r,k) = , min d Pi Maj, (1)
where M is the cost matrix, U(r, k) all possible ways to match factories with stores and P,; quantifies the amount of goods that
is transported from factory 2 to store 7. This is called the optimal transport between r and k. It can be solved relatively easily
using linear programming. The optimum, dj, (r, k), is called the Wasserstein metric. It is a distance between two probability
distributions, sometimes also called the earth mover distance as it can be interpreted as how much ‘dirt’ you have to move to
change one landscape’ (distribution) in another (see Monge’s original problem).
Consider a slightly modified form of optimal transport:
1
dx, (r,k) = pani ,) d PijMij — yh(P), with h(P) = — d Pyjlog(Pi;). (2)
Which is called the Sinkhorn distance, where the second term denotes the information entropy of P. One can increase the entropy
by making the distribution more homogeneous, i.e. giving everybody a more equal share of goods. The parameter A determines
the trade-off between the two terms: trying to give every store only goods from the closest factory (lowest value in the cost
matrix) or encouraging equal distributions. This is similar to regularization in, for example, ridge regression. Similar as that for
machine learning problems a tiny bit of shrinkage of the parameter can lead to an improved performance, the Sinkhorn distance
is also observed to work better than the Wasserstein distance on some problems. This is because we use a very natural prior on
the distribution matrix P: in absence of a cost, everything should be homogeneous.
In many situations the primary interest is not to obtain the optimal transportation map. Instead, we are often interested in using the
optimal transportation cost as a statistical divergence between two probability distributions. A statistical divergence is a function
that takes two probability distributions as input and outputs a non-negative number that is zero if and only if the two distributions
are identical. Statistical divergences such as the KL divergence are frequently used in statistics and machine learning as a way
of measuring dissimilarity between two probability distributions. For example, suppose you want to compare different recipes,
where every recipe is a set of different ingredients. There is a meaningful distance or similarity between two ingredients, but how
do you compare the recipes themselves? Using optimal transport boils down to finding the effort needed to turn one recipe into
another.
Wing
i et
RA RIANA A Ly)
1] TIBSRRXX\|
[PRY
Figure 3: Example of the computation of OT between two 1-D measures. The third figure shows the displacement interpolation
between the two using OT.
5] Radford, Alec, Luke Metz, and Soumith Chintala. ’Unsupervised representation learning with deep convolutional generative
adversarial networks.” arXiv preprint arXiv: 1511.06434 (2015).
[6] Peyré, Gabriel, and Marco Cuturi. ’Computational optimal transport.” Foundations and Trends® in Machine Learning 11.5-6
(2019): 355-607.
e
Exercises
Upload the file OT.ipynb and go trough the code. Afterwards, answer the following
question:
1. Upload your own images (of equal size) using the Files tab. Afterwards transfer the colors between the two images
using the provided notebook. Show the results and explain how the color histograms are transported, how 1s this different
from non optimal color swapping (e.g. just swapping the pixels)?
Upload the file WGAN . ipynb and go trough the code. Afterwards, try to answer the following question:
1. Train a fully connected minimax GAN and Wasserstein GAN on the MNIST dataset. Compare the performance of the
two GAN’s over the different iterations. Do you see an improvement in stability and quality of the generated samples?
Elaborate on the knowledge you have gained about optimal transport and the Wasserstein distance.
Write a report of maximum 2 pages (including text and figures) to discuss the
exercises of GAN and Optimal transport.