How Does Backpropagation Work?

Deriving the mathematics behind backpropagation. (It's not as hard as you might think)

Now that we know how neural networks function, and have built up some intuition about how to train them with backpropagation, it's time to look at how we can describe it mathematically.

The best way to learn *how* something works mathematically is to understand the *why* behind something; which is why I want you to go through the extra trouble of deriving the mathematics.

Warning: This essay is going to be heavy on the math. If you're allergic to math, you might want to check out the more [intuitive version]() (coming soon), but wit that said, backpropagation is really not as hard as you might think.

Backpropagation is not as hard as you might think

This article assumes familiarity with forward propagation, and neural networks in general. If you haven't already, I recommend reading What is a Neural Network first.

Recall that the weights in a neural network are updated by minimizing an error function that describes how wrong the neural networks current hypothesis is.

If $x$ is the input for the first layer, and $L$ is the number of layers in the network, then $a^{(L)}(x)$ is the network's hypothesis: $h(x)$. Finally, let $m$ be the number of examples, $n^{(k)}$ the number of neurons in layer $(k)$, and let $y(x)$ be the correct answer given the input $x$.

In order for an error function to be suitable for backpropagation, the average error should be computable using:

$$ E = \frac{1}{m} \sum_{i=1}^{m} E(x_i,y_i)$$

Where $E(x,y)$ is the error function for a specific example.

This is necessary if the backpropagation procedure is to update the weights on the basis of more than one example which will result in a more direct route toward convergence, and is generally preferred.

Furthermore, we assume that the error function can be written as a function of the network's hypothesis $h(x)$, and the correct answer $y(x)$.

One simple error function that satisfies these requirements, and which you probably already know, is the mean squared error (MSE) defined as:

$$ E(x,y) = (y-x)^2 $$

For a single example, and for multiple examples:

$$ E = \frac{1}{m} \sum_{i=1}^{m} (y_i-x_i)^2 $$

We see that not only does it satisfy the averaging constraint, but it also only depends on the hypothesis (noted as $x$), and the actual answer (noted as $y$).

For notational simplicity, for the rest of the essay, we will omit the function variables, so $E(x,y)$ becomes $E$.

Backpropagation

Recall that we use backpropagation to find the individual weights' contribution to the error function which is used during [gradient descent](https://kasperfred.com/posts/machine-learning-glossary#gradient+descent) when updating the weights.

Backpropagation is just figuring out how awful each weight is

In other words, backpropagation attempts to find:

$$ \frac{\partial E}{\partial \theta_{i,j}^{(k)}} $$

In order to find this, we introduce a new variable $\delta_i^{(k)}$ which is the error-sum of neuron $i$ in layer $k$; somtimes called the *delta error*. The delta error is defined as:

$$ \delta_i^{(k)}=\frac{\partial E}{\partial z_i^{(k)}}$$

Recall that $z_i^{(k)}$ is the raw output signal of a neuron in the last layer before the activation function has been applied.

During backpropagation, we will find a way of computing the delta error, and translate it to $\frac{\partial E}{\partial \theta_{i,j}^{(k)}}$.

We will derive $\frac{\partial E}{\partial \theta_{i,j}^{(k)}}$ in three steps using the three equations summarized below:

Find a way of computing $\delta^{(L)}$ to initialize the process.
Find a way of computing $\delta^{(k)}$ given $\delta^{(k+1)}$.
Find a way of computing $\frac{\partial E}{\partial \theta_{i,j}^{(k)}}$ given $\delta^{(k)}$

Equation 1

Recall that we can differentiate composite functions using the chain rule by:

$$\begin{aligned} f\big(g(x)\big)'&=f'\big(g(x)\big) \cdot g'(x) \Leftrightarrow \\ \frac{df}{dx} &= \frac{\partial f}{\partial g(x)} \cdot \frac{dg}{dx} \end{aligned}$$

The same principle holds for nested composite functions.

Using the chain rule, we can reformulate $\delta_i^{(L)}$ in terms of the partial derivative of the activation function as $a_i^{(k)}=g\big(z_i^{(k)}\big)$:

$$\begin{aligned} \delta_i^{(L)} &= \frac{\partial E}{\partial z_i^{(L)}} \Leftrightarrow \\ \delta_i^{(L)} &= \frac{\partial E}{\partial a_i^{(L)}} \cdot \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} \end{aligned} $$

Since:

$$\begin{aligned} a_i^{(L)} &= g\big(z_i^{(L)}\big) \Leftrightarrow \\ \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} &= g'\big(z_i^{(L)}\big) \end{aligned}$$

We can simplify the above to:

$$\delta_i^{(L)} = \frac{\partial E}{\partial a_i^{(L)}} \cdot g'\big(z_i^{(L)}\big)$$

We can vectorize the simplified equation by collecting $\frac{\partial E}{\partial a_i^{(L)}}$ in a vector gradient for each layer, $\nabla_a^{(L)}$. Similarily, we can collect the raw output $z_i^{(L)}$ into a vector of all the raw outputs in each layer, $z^{(L)}$.

By doing so, we find the first equation:

$$\begin{aligned}\delta^{(L)} = \nabla_a^{(L)} \odot g'\big(z^{(L)}\big) &&(1) \end{aligned}$$

Where $\odot$ is the [Hadamard product](https://kasperfred.com/posts/machine-learning-glossary#hadamard+product); elementwise multiplication.

Equation 2

While equation $(1)$ describes the error in the last layer: $z^{(L)}$, and $a^{(L)}$, equation $(2)$ describes the error of a layer in terms of the errors of the layers in front of it. The fact that the algorithm moves backwards through the layers of the network is what "back" refers to in backpropagation.

In order to achieve this, we rewrite $\delta_i^{(k)}=\frac{\partial E}{\partial z_i^{(k)}}$ in terms of the next layer, $k+1$:

$$\delta_i^{(k+1)}=\frac{\partial E}{\partial z_i^{(k+1)}}$$

Once again, we use the chain rule.

$$\begin{aligned} \delta_i^{(k)}&=\frac{\partial E}{\partial z_i^{(k)}}\\ &= \sum_\beta^{n^{(k+1)}} \frac{\partial E}{\partial z_\beta^{(k+1)}} \frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} \end{aligned} $$

Where $n^{(k+1)}$ is the number of neurons in layer $(k+1)$.

Since $\frac{\partial E}{\partial z_\beta^{(k+1)}} = \delta_\beta^{(k+1)}$, we can rewrite the above as:

$$\begin{aligned} \delta_i^{(k)} = \sum_\beta^{n^{(k+1)}} \frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} \cdot \delta_\beta^{(k+1)} &&(2.1) \end{aligned} $$

This works because the error from the previous layers is carried over to the neurons in the later layers. This is also why we sum over all the neurons. Equation $(2.1)$ can be interpreted as the total error caused by $z_i^{(k)}$ expressed by $z_\beta^{(k+1)}$.

We know from forward propagation that:

$$z_\beta^{(k+1)} = \sum_i^{n^{(k)}} \big( \theta_{\beta,i}^{(k+1)} \cdot a_i^{(k)} \big) + b_\beta^{(k+1)}$$

And since $a_i^{(k)}=g\big(z_i^{(k)}\big)$, we can rewrite the above as:

$$z_\beta^{(k+1)} = \sum_i^{n^{(k)}} \Big( \theta_{\beta,i}^{(k+1)} \cdot g\big(z_i^{(k)}\big) \Big) + b_\beta^{(k+1)}$$

By differentiating $z_\beta^{(k+1)}$ with respect to $z_i^{(k)}$, we find:

$$\frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} = \theta_{\beta,i}^{(k+1)} \cdot g'\big(z_i^{(k)}\big)$$

By substituting this expression in equation $(2.1)$, we find:

$$\delta_i^{(k)} = \sum_\beta^{n^{(k+1)}} \theta_{\beta,i}^{(k+1)} g'\big(z_i^{(k)}\big) \delta_\beta^{(k+1)}$$

If this is not obvious, I do encourage you to spend some time going through the equations in order to convince yourself that this is correct.

Finally, by vectorizing the above, we arrive at the final form for equation $(2)$:

$$\begin{aligned}\delta^{(k)} = \big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big) && (2)\end{aligned}$$

Equation 3

Equation $(3)$ is derived in the exact same way as equation $(1)$, and $(2)$ using the chain rule, so I'll simply state the final form:

$$\begin{aligned}\frac{\partial E}{\partial \theta_{i,j}^{(k)}} = a_j^{(k-1)} \delta_i^{(k)} && (3)\end{aligned}$$

In equation $(3)$, we see that an individual weight's contribution to the error function is equal to the scaled error it sends forward in the network.

Intuitively, this should make sense.

If we think of the error as throwing balls at a target, and if the percentage of balls missing the target is the delta error, and the rate of throwing is the activation, then the total number of balls missing the target, is those multiplied together - which is exactly what we do.

$a_j^{(k-1)}$ is how much a neuron is stimulated - how strong the output is, or the rate of throwing.

$\delta_i^{(k)}$ is our throwing accuracy, or rather, in a team of athletes trying to hit the target, how much an individual contributes to the overall number of balls that didn't hit the target.

Finally, we can confirm that this also works for the bias unit where $a_o^{(k-1)}=1$. It should just give us $\delta_0^{(k)}$ as there's no activation coefficient:

$$\frac{\partial E}{\partial \theta^{(k)}_{i,0}} = 1 \cdot \delta_0^{(k)} = \delta_0^{(k)}$$

Which we see it does.

Conclusion

Using these three equations, we can now describe the algorithm for backpropagation in a feed forward layer.

It turns out, that the same general principle also applies to backpropagation in other architectures such as convolutional neural networks.

1. We use equation $(1)$ to calculate the delta error of the last layer.

$$\begin{aligned}\delta^{(L)} = \nabla_a^{(L)} \odot g'\big(z^{(L)}\big)\end{aligned}$$

2. We use the delta error of the last layer to initialize a recursive process of calculating the delta error of all the previous layers using equation $(2)$:

$$\begin{aligned}\delta^{(k)} = \big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big)\end{aligned}$$

3. We use the delta errors with equation $(3)$ to calculate the derivative of the error function with respect to each weight in the neural network which can be used in gradient descent:

$$\begin{aligned}\frac{\partial E}{\partial \theta_{i,j}^{(k)}} = a_j^{(k-1)} \delta_i^{(k)}\end{aligned}$$

Finally, equation $(1)$ and $(2)$ can be combined into one [recursive](https://kasperfred.com/posts/machine-learning-glossary#recursion) equation:

$$\delta^{(k)} =\begin{cases}\begin{aligned}\nabla_a^{(L)} \odot g'\big(z^{(L)}\big) &\quad\text{if } k=L\\\big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big) &\quad\text{otherwise}\end{aligned}\end{cases}$$

And that's it. You now know everything there's to know about how backpropagation works.

Don't worry if you don't immediately understand it; that's normal. Put this essay away, and come back after a couple of days to review it, and do a couple of exercises.

Do this a couple of times, and your brain should start to pick it up, and you will be become more comfortable with backpropagation.

By Luna Lux Fredenslund