How Does Backpropagation Work?

Deriving the mathematics behind backpropagation.

Now that we know how neural networks function, and have built up some intuition about how to train them with backpropagation, it's time to look at how we can describe it mathematically.

The best way to learn how something works mathematically is to understand the why behind something; which is why I want you to go through the extra trouble of deriving the mathematics.

Warning: This essay is going to be heavy on the math. If you're allergic to math, you might want to check out the more intuitive version (coming soon), but with that said, backpropagation is really not as hard as you might think.

Backpropagation is not as hard as you might think

This article assumes familiarity with forward propagation, and neural networks in general. If you haven't already, I recommend reading What is a Neural Network first.

Recall that the weights in a neural network are updated by minimizing an error function that describes how wrong the neural networks current hypothesis is.

If xx is the input for the first layer, and LL is the number of layers in the network, then a(L)(x)a^{(L)}(x) is the network's hypothesis: h(x)h(x). Finally, let mm be the number of examples, n(k)n^{(k)} the number of neurons in layer (k)(k), and let y(x)y(x) be the correct answer given the input xx.

In order for an error function to be suitable for backpropagation, the average error should be computable using:

E=1mi=1mE(xi,yi) E = \frac{1}{m} \sum_{i=1}^{m} E(x_i,y_i)

Where E(x,y)E(x,y) is the error function for a specific example.

This is nescessary if the backpropagation procedure is to update the weights on the basis of more than one example which will result in a more direct route toward convergence, and is generally preferred.

Furthermore, we assume that the error function can be written as a function of the network's hypothesis h(x)h(x), and the correct answer y(x)y(x).

One simple error function that satisfies thse requirements, and which you probably already know, is the mean squared error (MSE) defined as:

E(x,y)=(yx)2 E(x,y) = (y-x)^2

For a single example, and for multiple examples:

E=1mi=1m(yixi)2 E = \frac{1}{m} \sum_{i=1}^{m} (y_i-x_i)^2

We see that not only does it satisfy the averaging constraint, but it also only depends on the hypothesis (noted as xx), and the actual answer (noted as yy).

For notational simplicity, for the rest of the essay, we will omit the function variables, so E(x,y)E(x,y) becomes EE.

breaker image

Backpropagation

Recall that we use backpropagation to find the individual weights' contribution to the error function which is used during gradient descent when updating the weights.

Backpropagation is just figuring out how awful each weight is

In other words, backpropagation attempts to find:

Eθi,j(k) \frac{\partial E}{\partial \theta_{i,j}^{(k)}}

In order to find this, we introduce a new variable δi(k)\delta_i^{(k)} which is the error-sum of neuron ii in layer kk; somtimes called the delta error. The delta error is defined as:

δi(k)=Ezi(k) \delta_i^{(k)}=\frac{\partial E}{\partial z_i^{(k)}}

Recall that zi(k)z_i^{(k)} is the raw output signal of a neuron in the last layer before the activation function has been applied.

During backpropagation, we will find a way of computing the delta error, and translate it to Eθi,j(k)\frac{\partial E}{\partial \theta_{i,j}^{(k)}}.

We will derive Eθi,j(k)\frac{\partial E}{\partial \theta_{i,j}^{(k)}} in three steps using the three equations summarized below:

  1. Find a way of computing δ(L)\delta^{(L)} to initialize the process.
  2. Find a way of computing δ(k)\delta^{(k)} given δ(k+1)\delta^{(k+1)}.
  3. Find a way of computing Eθi,j(k)\frac{\partial E}{\partial \theta_{i,j}^{(k)}} given δ(k)\delta^{(k)}

Equation 1

Recall that we can differentiate composite functions using the chain rule by:

f(g(x))=f(g(x))g(x)dfdx=fg(x)dgdx\begin{aligned} f\big(g(x)\big)'&=f'\big(g(x)\big) \cdot g'(x) \Leftrightarrow \\ \frac{df}{dx} &= \frac{\partial f}{\partial g(x)} \cdot \frac{dg}{dx} \end{aligned}

The same principle holds for nested composite functions.

Using the chain rule, we can reformulate δi(L)\delta_i^{(L)} in terms of the partial derivative of the activation function as ai(k)=g(zi(k))a_i^{(k)}=g\big(z_i^{(k)}\big):

δi(L)=Ezi(L)δi(L)=Eai(L)ai(L)zi(L) \begin{aligned} \delta_i^{(L)} &= \frac{\partial E}{\partial z_i^{(L)}} \Leftrightarrow \\ \delta_i^{(L)} &= \frac{\partial E}{\partial a_i^{(L)}} \cdot \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} \end{aligned}

Since:

ai(L)=g(zi(L))ai(L)zi(L)=g(zi(L)) \begin{aligned} a_i^{(L)} &= g\big(z_i^{(L)}\big) \Leftrightarrow \\ \frac{\partial a_i^{(L)}}{\partial z_i^{(L)}} &= g'\big(z_i^{(L)}\big) \end{aligned}

We can simplify the above to:

δi(L)=Eai(L)g(zi(L)) \delta_i^{(L)} = \frac{\partial E}{\partial a_i^{(L)}} \cdot g'\big(z_i^{(L)}\big)

We can vectorize the simplified equation by collecting Eai(L)\frac{\partial E}{\partial a_i^{(L)}} in a vector gradient for each layer, a(L)\nabla_a^{(L)}. Similarily, we can collect the raw output zi(L)z_i^{(L)} into a vector of all the raw outputs in each layer, z(L)z^{(L)}.

By doing so, we find the first equation:

δ(L)=a(L)g(z(L))(1)\begin{aligned} \delta^{(L)} = \nabla_a^{(L)} \odot g'\big(z^{(L)}\big) &&(1) \end{aligned}

Where \odot is the Hadamard product; elementwise multiplication.

Equation 2

While equation (1)(1) describes the error in the last layer: z(L)z^{(L)}, and a(L)a^{(L)}, equation (2)(2) describes the error of a layer in terms of the errors of the layers in front of it. [1]

In order to achieve this, we rewrite δi(k)=Ezi(k)\delta_i^{(k)}=\frac{\partial E}{\partial z_i^{(k)}} in terms of the next layer, k+1k+1:

δi(k+1)=Ezi(k+1)\delta_i^{(k+1)}=\frac{\partial E}{\partial z_i^{(k+1)}}

Once again, we use the chain rule.

δi(k)=Ezi(k)=βn(k+1)Ezβ(k+1)zβ(k+1)zi(k) \begin{aligned} \delta_i^{(k)}&=\frac{\partial E}{\partial z_i^{(k)}}\\ &= \sum_\beta^{n^{(k+1)}} \frac{\partial E}{\partial z_\beta^{(k+1)}} \frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} \end{aligned}

Where n(k+1)n^{(k+1)} is the number of neurons in layer (k+1)(k+1).

Since Ezβ(k+1)=δβ(k+1)\frac{\partial E}{\partial z_\beta^{(k+1)}} = \delta_\beta^{(k+1)}, we can rewrite the above as:

δi(k)=βn(k+1)zβ(k+1)zi(k)δβ(k+1)(2.1) \begin{aligned} \delta_i^{(k)} = \sum_\beta^{n^{(k+1)}} \frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} \cdot \delta_\beta^{(k+1)} &&(2.1) \end{aligned}

This works because the error from the previous layers is carried over to the neurons in the later layers. This is also why we sum over all the neurons. Equation (2.1)(2.1) can be interpreted as the total error caused by zi(k)z_i^{(k)} expressed by zβ(k+1)z_\beta^{(k+1)}.

We know from forward propagation that:

zβ(k+1)=in(k)(θβ,i(k+1)ai(k))+bβ(k+1) z_\beta^{(k+1)} = \sum_i^{n^{(k)}} \big( \theta_{\beta,i}^{(k+1)} \cdot a_i^{(k)} \big) + b_\beta^{(k+1)}

And since ai(k)=g(zi(k))a_i^{(k)}=g\big(z_i^{(k)}\big), we can rewrite the above as:

zβ(k+1)=in(k)(θβ,i(k+1)g(zi(k)))+bβ(k+1) z_\beta^{(k+1)} = \sum_i^{n^{(k)}} \Big( \theta_{\beta,i}^{(k+1)} \cdot g\big(z_i^{(k)}\big) \Big) + b_\beta^{(k+1)}

By differentiating zβ(k+1)z_\beta^{(k+1)} with respect to zi(k)z_i^{(k)}, we find:

zβ(k+1)zi(k)=θβ,i(k+1)g(zi(k)) \frac{\partial z_\beta^{(k+1)}}{\partial z_i^{(k)}} = \theta_{\beta,i}^{(k+1)} \cdot g'\big(z_i^{(k)}\big)

By substituting this expression in equation (2.1)(2.1), we find:

δi(k)=βn(k+1)θβ,i(k+1)g(zi(k))δβ(k+1) \delta_i^{(k)} = \sum_\beta^{n^{(k+1)}} \theta_{\beta,i}^{(k+1)} g'\big(z_i^{(k)}\big) \delta_\beta^{(k+1)}

If this is not obvious, I do encourage you to spend some time going through the equations in order to convince yourself that this is correct.

Finally, by vectorizing the above, we arrive at the final form for equation (2)(2):

δ(k)=(θ(k+1)Tδ(k+1))g(z(k))(2) \begin{aligned} \delta^{(k)} = \big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big) && (2) \end{aligned}

Equation 3

Equation (3)(3) is derived in the exact same way as equation (1)(1), and (2)(2) using the chain rule, so I'll simply state the final form:

Eθi,j(k)=aj(k1)δi(k)(3) \begin{aligned} \frac{\partial E}{\partial \theta_{i,j}^{(k)}} = a_j^{(k-1)} \delta_i^{(k)} && (3) \end{aligned}

In equation (3)(3), we see that an individual weight's contribution to the error function is equal to the scaled error it sends forward in the network.

Intuitively, this should make sense.

If we think of the error as throwing balls at a target, and if the percentage of balls missing the target is the delta error, and the rate of throwing is the activation, then the total number of balls missing the target, is those multiplied together - which is exactly what we do.

aj(k1)a_j^{(k-1)} is how much a neuron is stimulated - how strong the output is, or the rate of throwing.

δi(k)\delta_i^{(k)} is our throwing accuracy, or rather, in a team of athletes trying to hit the target, how much an individual contributes to the overall number of balls that didn't hit the target.

Finally, we can confirm that this also works for the bias unit where ao(k1)=1a_o^{(k-1)}=1. It should just give us δ0(k)\delta_0^{(k)} as there's no activation coefficient:

Eθi,0(k)=1δ0(k)=δ0(k)\frac{\partial E}{\partial \theta^{(k)}_{i,0}} = 1 \cdot \delta_0^{(k)} = \delta_0^{(k)}

Which we see it does.

Conclusion

Using these three equations, we can now describe the algorithm for backpropagation in a feed forward layer.[2]

  1. We use equation (1)(1) to calculate the delta error of the last layer.
δ(L)=a(L)g(z(L))\begin{aligned} \delta^{(L)} = \nabla_a^{(L)} \odot g'\big(z^{(L)}\big) \end{aligned}
  1. We use the delta error of the last layer to initialize a recursive process of calculating the delta error of all the previous layers using equation (2)(2):
δ(k)=(θ(k+1)Tδ(k+1))g(z(k)) \begin{aligned} \delta^{(k)} = \big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big) \end{aligned}
  1. We use the delta errors with equation (3)(3) to calculate the derivative of the error function with respect to each weight in the neural network which can be used in gradient descent:
Eθi,j(k)=aj(k1)δi(k) \begin{aligned} \frac{\partial E}{\partial \theta_{i,j}^{(k)}} = a_j^{(k-1)} \delta_i^{(k)} \end{aligned}

Finally, equation (1)(1) and (2)(2) can be combined into one recursive equation:

δ(k)={a(L)g(z(L))if k=L(θ(k+1)Tδ(k+1))g(z(k))otherwise \delta^{(k)} = \begin{cases} \begin{aligned} \nabla_a^{(L)} \odot g'\big(z^{(L)}\big) &\quad\text{if } k=L\\ \big( \theta^{(k+1)^{T}} \delta^{(k+1)} \big) \odot g'\big( z^{(k)} \big) &\quad\text{otherwise} \end{aligned} \end{cases}

And that's it. You now know everything there's to know about how backpropagation works.

Don't worry if you don't immediately understand it; that's normal. Put this essay away, and come back after a couple of days to review it, and do a couple of exercices.

Do this a couple of times, and your brain should start to pick it up, and you will be become more comfortable with backpropagation.


  1. The fact that the algorithm moves backwards through the layers of the network is what "back" refers to in backpropagation. ↩︎

  2. It turns out, that the same general principle also applies to backpropagation in other architectures such as convolutional neural networks. ↩︎

continue reading

Machine Learning Glossary
A collection of commonly used machine learning terms, and what they mean.
Introduction to Computational Complexity
A look at different ways of determining the efficiency of algorithms. From Big-O to empirical measurements, the series interweaves theory and practice to provide a strong overview of the most important ideas.