Back-propagation explained
A walk-through of the famous algorithm and the maths behind
1. Introduction
Below you can find an example of a Feed-Forward Neural Network:
Our NN consists of the input layer, one hidden layer with 3 nodes and the output layer with 2 nodes. You can also see the weights and biases of the network, the sigmoid activation functions, as well as the ground truth outputs for a specific training input sample. Finally, we present the NN structure expressed by using matrix multiplications.
2. History and purpose of the Back-propagation algorithm
As we can see in reference [1]: “The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn’t fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams”.
Our main goal when training a NN is to find the weights and biases that minimize the total error, and this can be done by using the “gradient descent algorithm”, i.e. to slightly modify the weights and biases in the opposite direction of the gradient in order to minimize the overall cost function.
The problem that back-propagation algorithm solves, is that it provides a fast way to calculate the gradients of the weights and biases.
3. The Cost (Error) function
As we mentioned before, our goal is to adjust the weights and biases in a way the the cost function is minimized (or the cost is reduced in every iteration). For our example, we will consider the quadratic cost function or Mean Squared Error (MSE):
The above cost function is a sum over individual costs of all input training samples x, and a good property is that in order to minimize it, we just have to minimize the individual errors across all samples.
Here we need to note that in practice we perform the updates of weights in batches of training samples, i.e. we compute the Δw by taking the average of gradients across the training samples x of each batch.
Another very important property of the cost function C is that it is expressed as a function of the NN’s outputs a (which are functions — sigmoids — of weights and biases from previous layers), this fact can trigger our intuition about the backward nature of the algorithm.
4. Common pitfalls in understanding
Before we go to the actual algorithm, I would like to state some common pitfalls in understanding (at least for my self 😎) regarding the activation functions, cost functions and metrics. For that reason we will examine the case [3] of a binary classification Neural Network by using the popular Keras framework:
The important thing here is to clearly distinguish between the activation functions and cost function. In the above example what evaluates the NN’s loss is the binary_crossentropy and only, the selection of the sigmoid as the activation function of the last layer is done because it produces continuous output values in the [0, 1] range that can be interpreted as input/probabilities by the cost function and we could have used tanh instead of sigmoid there. Finally, the metric does not play any role in the NN structure, and it is just an alternative measurement of the error/accuracy, we could choose as metric the area under the curve (AUC or ROC), but this wouldn’t mean that our network will be able to directly optimize that metric.
Always remember that our Neural Network can optimize the weights and biases only with respect to the cost function.
5. Step by step Back-propagation examples
In order to understand back-propagation we will see study some specific cases of weights in our example NN and try calculate their partial derivatives with respect to the cost function C.
In the below figure we can see the cost function expressed through the NN outputs for a given input vector x:
Here it is important to understand that the cost (or loss) function depends only on the weights and biases for a given set of data set points, because inputs and ground truth values (labels) are fixed.
Let’s go now to a weight example, and try to find the partial derivative of the cost function C versus the w₇:
As we can see above only the 1st output of the NN is affected by the w₇, this means that the partial derivative of C related to the 2nd output layer is 0.
In the below figure we can see that 1st term of cost function C with respect to weight w₇ is a nested function:
, i.e. C = C(sigmoid(zeta(w₇))), so in order to find the partial derivative we must use the chain rule:
Note that all the values needed above are available from the feed-forward step of the back-propagation algorithm, as we will see in section 6.
Now, we’ll go for another example in the hidden layer and try to find the partial derivative of the cost function C versus the weight w₁:
As we can see above all the two outputs of the NN is affected by the w₁ and thus both terms of the cost function C are related to this weight. For simplicity we will take the 1st term of the cost function C with respect to w₁ which again is a nested function:
C = C(sigmoid(zeta(sigmoid(zeta(w₁))))), so in order to find the partial derivative we must use the chain rule:
If we carefully check the algorithm’s equations of section 6. below, we will see that they agree with calculations in the above figure.
6. The Back-propagation algorithm
Below you can find the complete algorithm taken from the reference [1], we will not give here any proof of the equations (you can find a proof in the book), however you can check that they are valid if you apply them in the examples of section 5:
Regarding the steps 1. and 2. it easy to see in our Figure 1 how to calculate the outputs by using matrix multiplications, these steps together are called the “forward pass”.
The step 3. includes the calculation of the error vector in the output layer and it involves the derivative of the cost function versus the output a and the derivative of the sigmoid function versus the weights and biases.
The step 4. also called as the “backward pass” includes the calculation of the intermediate layer error vectors based on the “previous” calculations.
The step 5. includes the desired gradients per each weight and bias in the network, their calculations are straightforward from the forward and backward steps.
7. Back-propagation implementation/code
I have adapted the simplest Neural Network from the original code of Michael Nielsen’s repo in order to work with Python 3.8 here:
https://github.com/dimleve/neural-networks-and-deep-learning-Python-3
You can find the algorithm implementation in network.py.
References
[1] Michael Nielsen “Neural Networks and Deep Learning”:
http://neuralnetworksanddeeplearning.com/
[2] Matt Mazur “A Step by Step Backpropagation Example”:
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
[3] Jason Brownlee “Binary Classification Tutorial with the Keras Deep Learning Library”:
https://machinelearningmastery.com/binary-classification-tutorial-with-the-keras-deep-learning-library/