# How Does Backpropagation Work?

Deriving the mathematics behind backpropagation.

Now that we know how neural networks function, and have built up some intuition about how to train them with backpropagation, it's time to look at how we can describe it mathematically.

The best way to learn *how* something works mathematically is to understand the *why* behind something; which is why I want you to go through the extra trouble of deriving the mathematics.

Warning: This essay is going to be heavy on the math. If you're allergic to math, you might want to check out the more intuitive version (coming soon), but with that said, backpropagation is really not as hard as you might think.

Backpropagation is not as hard as you might think

This article assumes familiarity with forward propagation, and neural networks in general. If you haven't already, I recommend reading What is a Neural Network first.

Recall that the weights in a neural network are updated by minimizing an error function that describes how wrong the neural networks current hypothesis is.

If

In order for an error function to be suitable for backpropagation, the average error should be computable using:

Where

This is nescessary if the backpropagation procedure is to update the weights on the basis of more than one example which will result in a more direct route toward convergence, and is generally preferred.

Furthermore, we assume that the error function can be written as a function of the network's hypothesis

One simple error function that satisfies thse requirements, and which you probably already know, is the mean squared error (MSE) defined as:

For a single example, and for multiple examples:

We see that not only does it satisfy the averaging constraint, but it also only depends on the hypothesis (noted as

For notational simplicity, for the rest of the essay, we will omit the function variables, so

## Backpropagation

Recall that we use backpropagation to find the individual weights' contribution to the error function which is used during gradient descent when updating the weights.

Backpropagation is just figuring out how awful each weight is

In other words, backpropagation attempts to find:

In order to find this, we introduce a new variable *delta error*. The delta error is defined as:

Recall that

During backpropagation, we will find a way of computing the delta error, and translate it to

We will derive

- Find a way of computing
$\delta^{(L)}$ to initialize the process. - Find a way of computing
$\delta^{(k)}$ given$\delta^{(k+1)}$ . - Find a way of computing
$\frac{\partial E}{\partial \theta_{i,j}^{(k)}}$ given$\delta^{(k)}$

### Equation 1

Recall that we can differentiate composite functions using the chain rule by:

The same principle holds for nested composite functions.

Using the chain rule, we can reformulate

Since:

We can simplify the above to:

We can vectorize the simplified equation by collecting

By doing so, we find the first equation:

Where

### Equation 2

While equation ^{[1]}

In order to achieve this, we rewrite

Once again, we use the chain rule.

Where

Since

This works because the error from the previous layers is carried over to the neurons in the later layers. This is also why we sum over all the neurons. Equation

We know from forward propagation that:

And since

By differentiating

By substituting this expression in equation

If this is not obvious, I do encourage you to spend some time going through the equations in order to convince yourself that this is correct.

Finally, by vectorizing the above, we arrive at the final form for equation

### Equation 3

Equation

In equation

Intuitively, this should make sense.

If we think of the error as throwing balls at a target, and if the percentage of balls missing the target is the delta error, and the rate of throwing is the activation, then the total number of balls missing the target, is those multiplied together - which is exactly what we do.

Finally, we can confirm that this also works for the bias unit where

Which we see it does.

## Conclusion

Using these three equations, we can now describe the algorithm for backpropagation in a feed forward layer.^{[2]}

- We use equation
$(1)$ to calculate the delta error of the last layer.

- We use the delta error of the last layer to initialize a recursive process of calculating the delta error of all the previous layers using equation
$(2)$ :

- We use the delta errors with equation
$(3)$ to calculate the derivative of the error function with respect to each weight in the neural network which can be used in gradient descent:

Finally, equation

And that's it. You now know everything there's to know about how backpropagation works.

Don't worry if you don't immediately understand it; that's normal. Put this essay away, and come back after a couple of days to review it, and do a couple of exercices.

Do this a couple of times, and your brain should start to pick it up, and you will be become more comfortable with backpropagation.

The fact that the algorithm moves backwards through the layers of the network is what "back" refers to in backpropagation. ↩︎

It turns out, that the same general principle also applies to backpropagation in other architectures such as convolutional neural networks. ↩︎