Skip to main content

Mathematics of Multi-Layer Perceptron

What I present below is a very concise and self-contained summary of everything you needs to know to implement (or understand other's implementation of) the algorithm of training of a Multi-Layer Perceptron (MLP) through gradient descent.  It also equips for in-depth understanding of generalisations of this fundamental algorithm.

I call this "MLP cheatsheet".  You can also listen to this lecture on YouTube:


MLP cheatsheet

Everything is you need to know is summarised in the following chart (open the image in a separate window for larger resolution).  I'll take you through the elements of this chart bit by bit down below.

MLP configuration

The cheatsheet above describes a MLP network with 'd' inputs, 'm' outputs.  That is, we building a network function as follows: 

Indices

Mathematical representation of MLP requires multi-index notations.  For the sake of consistency, I use 'k' for the index of the layer and 'j' for the index of a neuron within the layer.

Neurons

MLP network has two types of neurons: "input" neuron (k = 0) and "computed" neurons (k > 0).

Input neurons

Network has 'd' input neurons, each represented by several variable functions b as follows:

Computed neurons

Network has 'p' layers of computed neurons, each represented by two several variable functions, a and b:

Network function

Network function is the 'b' functions at the last ('p') layer:

Forward propagation

Forward propagation formula (FPF) is the recursive process when, for a given fixed value of inputs 'x', the value of each function 'a' and 'b' is computed one after another until the last layer is reached and the value of the network function 'f' is found.


The recursive formulae above depends on a set of parameters 'w' (weights) which are fixed for the duration of the FPF computation.  That is, the functions 'a' and 'b' and the network function 'f' also depends on the set of weights, but this dependence is not explicitly reflected in the notations. 

Depending on available programming tools and libraries, every step of the FPF computation may be implemented directly or with the help of matrix multiplication operations.

Backward propagation

Backward propagation (backpropagation) computes the following partial derivatives, for every computed neuron:
As above, the delta functions also depend on the set of weights.  That is, backpropagation computation (BPF) finds the values of deltas for given fix values of inputs 'x' and weights 'w'.

Deltas for output layer

BPF starts with finding deltas for the output later (k=p), which is direct differentiation of the sigmoid function:

Deltas for computed neurons (other than output neurons)

This is the first hard mathematical part of the entire algorithm.  The formula itself is the recursive relation as follows:
This is another place where matrix multiplication libraries can simplify computation coding.

Proof of deltas recursive relation

In order to
  • understand the reason/proof for the above recursive relation; or
  • generalise this relation to other network topologies, activation functions and etc
you have to have in-depth understanding of the Chain Rule for the functions of several variables.  Unless you are confident with the Chain Rule, you can jump  to the next section below.

For those of you who knows the Chain Rule, I point out the following key facts.  A minor perturbation of  one 'a' value induces the changes of 'b' in the same layer and to the all of the 'a' and 'b' functions in the layers to the right.

Hence, the Chain rule gives:

The latter two identities deliver the recursive relation for deltas, mentioned above.

Training set

I denote the training set as follows:

Error function over training set

For the given training set, the error function is as follows:
Note that the error function no longer depends on inputs 'x' but only depends on the set of weights 'w'.

Weight derivatives of error function

This is the second part where "heavy" mathematics is needed.  For the given set of weights, the weight derivatives of the error functions are as follows:


Again, you can jump to the next section, if you struggle with Several Variable Calculus.  

Otherwise, the first step here is base on the following differentiation of dot product:
 The second step, is another application of the Chain Rule (the first line below shows the propagation of minor perturbation to the weight w_{js}^{(k)}).

Training, i.e., finding optimal weights

Training starts with choosing random values for the every weight and then repeating for a few thousand times the weight adjustment step, called "epoch" described below:

Epoch

For the given set of weights, the epoch consists of the following steps:
  1. For every training input 'x', forward propagation (FPF) is used to find every 'a' and 'b' functions, and, by implication the value of the network function.
  2. For every training input 'x', backpropagation (BPF) is used to find every 'delta' value.
  3. Use the values of the preceding parts to find weight derivatives of error function.
  4. Complete epoch by adjusting every weight as follows (here gamma is a small positive number, called training rate, some implementations assume gamma = 1):






Comments