# Neural networks

The present tutorials covers the implementation of neural networks, starting with a single neuron and then going up to a multilayer perceptron applied to audio classification.

# Reference slides

Download the slides

- The artificial neuron
- Neural networks
- Architecture zoo

# Tutorial

In this tutorial, we will cover a more advanced classification algorithm through the use of *neural networks*. The tutorial starts by performing a simple **single neuron** discrimination of two random distributions. Then, we will study the typical **XOR problem** by using a more advanced 2-layer **perceptron**. Finally, we generalize the use of neural networks in order to perform classification on a given set of audio files.

To simplify your work, we provide the following set of functions that you should find in the `02_Neural_Networks`

folder

File |
Explanation |
---|---|

`plot3view.m` |
Allows to plot a 3-dimensional view of data points |

`plotBoundary.m` |
Plots the decision boundary of a single neuron with 2-dimensional inputs |

`plotBoundarySurface.m` |
Plots the decision surface of a single neuron with 3-dimensional inputs |

`plotPatterns.m` |
Plots input patterns (bi-dimensionnal) |

`plotPatterns3D.m` |
Plots 3-dimensional input patterns |

`xorAns.dat` |
Class values for the XOR problem |

`xorPats.dat` |
Point values for the XOR problem |

## 2.1 - Single neuron

For the first parts of the tutorial, we will perform the simplest classification model possible in a neural network setting, a single neuron. We briefly recall here that; given an input vector , a single neuron computes the function

with a weight vector, a bias and an *activation function*. Therefore, if we consider the *threshold* activation function ( if ), a single neuron simply performs an *affine transform* and then a *linear* discrimination of the space. Geometrically, a single neuron computes an hyperplane that separates the space. In order to learn, we have to adjust the weights and know “how much wrong we are”. To do so, we consider that we know the desired output of a system for a given example (eg. a predicted value for a regression system, a class value for a classification system). Therefore, we define the loss function over a whole dataset as

In order to know how to change the weights based on the value of the errors, we need to now “how to change it to make it better”. Therefore, we should compute the sets of derivatives of the error given each parameter

**Exercise**

- Perform the derivatives of the output given a single neuron
- Perform the derivatives for the bias as well

**Solution** [Reveal]

Given that we have simply an (Euclidean) error criterion on a single neuron for the time being, the update of the weights for a single example , with desired output can be simply computed by

with the *learning rate* parameter (which controls the size of the update steps).

Similarly, for the bias, we simply have to compute

We will start by training a single neuron to learn how to perform this discrimination with a linear problem (so that a single neuron is enough to solve it). To produce such classes of problems, we provide a script that draw a set of random 2-dimensional points, then choose a random line in this space that will act as the linear frontier between 2 classes (hence defining a linear 2-class problem). The variables that will be used by your code are the following.

**Exercise**

- Update the loop so that it computes the forward propagation error
- Update the loop to perform learning (based on back-propagation)
- Run the learning procedure, which should produce a result similar to the display below.
- Perform multiple re-runs of the learning procedure (re-launching produces different datasets)
- What observations can you make on the learning process?
- (Optional) Change the input patterns, and confirm your observations.
- (Optional) Incorporate the bias in the weights to obtain a
**vectorized**code.

**Expected output** [Reveal]

## 2.2 - 2-layer XOR problem

In most cases, classification problems are far from being linear. Therefore, we need more advanced methods to be able to compute non-linear class boundaries. The advantage of neural networks is that the same principle can be applied in a *layer-wise* fashion. This allows to further discriminate the space in sub-regions (as seen in the course). We will try to implement the 2-layer *perceptron* that can provide a solution to the infamous XOR problem. The idea is now to have the output of the first neurons to be connected to a set of other neurons. Therefore, if we take back our previous formulation, we have the same output for the first neuron(s) , that we will now term as . Then, we feed these outputs to a second layer of neurons, which gives

Finally, we will rely on the same loss as in the previous exercise, but the outputs used are instead of . As in the previous case, we now need to compute the derivatives of the weights and biases for several layers . However, you should see that some form of generalization might be possible for any number of layer.

**Exercise**

- Perform the derivatives for the last layer specifically
- Define a generalized derivative for any previous layer

**Solution** [Reveal]

In order to *propagate* the derivatives, we can simply rely on the chain rule

Therefore, we can compute the derivative for the last layer as

And for any previous layer, we rely on the development of the chain rule giving

We provide the prototypical set of XOR values in the `xorPat.mat`

along with their class values in `xorAns.mat`

. The variables that will be used by your code are the following.

**Exercise**

- Update the forward propagation and error computation (compared to desired).
- Update the back-propagation part to learn the weights of both layers.
- Run the learning, which should produce a result similar to that displayed below.
- Perform multiple re-runs of the learning procedure (re-launching with different initializations)
- What observations can you make on the learning process?
- What happens if you initialize all weights to zeros?
- (Optional) Implement the
*sparsity*constraint in your neural network. - (Optional) Implement the
*weight decay*constraint in your network. - (Optional) Add the
*momentum*to the learning procedure.

**Expected output** [Reveal]

**Optional questions**

*Weight decay*constraint

As nothing constrains the weights in the network, we can note that usually all weights vector given a multiplicative factor might be equivalent, which can stall the learning (and lead to exploding weights). The *weight decay* allows to regularize the learning by penalizing weights with a too wide amplitude. The idea is to add this constraint as a term to the final loss (which leads to an indirect “pressure” on the learning process. Therefore, the final loss will be defined as

where the parameter controls the relative importance of the two terms.

*Momentum*in learning

Usually, in complex problems, the gradient can be very noisy and, therefore, the learning might oscillate widely. In order to reduce this problem, we can *smooth* the different gradient updates by retaining the values of the gradient at each iteration and then performing an update based on the latest gradient and the gradient at the previous iteration . Therefore, a gradient update is applied as

with the momentum parameter, which control the amount of gradient smoothing.

## 2.3 - 3-layer audio classification

Finally, we will attack a complete audio classification problem and try to perform neural network learning on a set of audio files. The data structure will be the same as the one used for parts 1 and 2. As discussed during the courses, even though a 2-layer neural network can provide non-linear boundaries, it can not perform “holes” inside those regions. In order to obtain an improved classification, we will now rely on a 3-layer neural network. The modification to the code of section 3.2 should be minimal, as the back-propagation will be similar for the new layer as one of the two others. We do not develop the math here as it is simply a re-application of the previous rules with an additional layer (which derivatives you should have generalized in the previous exercise).

However, up until now, we only performed *binary classification* problems, but this time we need to obtain a decision rule for multiple classes. Therefore, we cannot rely on simply computing the distance between desired patterns and the obtained binary value. The idea here is to rely on the *softmax regression*, by considering classes as a vector of probabilities. The desired answers will therefore be considered as a set of *probabilities*, where the desired class is and the others are (called *one-hot* representation). Then, the cost function will rely on the softmax formulation

Therefore, we compute the output of the softmax by taking

By taking derivatives, we can show that the gradient of the softmax layer is

**Exercise**

- Based on the previous neural network, upgrade the code to a 3-layer neural network
- Implement the
*softmax regression*on top of your 3-layer network - Use the provided code to perform classification on a pre-defined set of features
- As previously, change the set of features to assess their different accuracies
- Evaluate the neural network accuracy for all features combinations
- What happens if the learning rate is too large ? What is this phenomenon ?
- (Optional) Perform a more advanced visualization of the learning process.

**Expected output** [Reveal]