Gradient Descent: How Machines Learn
Okay so you have an objective function which you want to optimize. But how do you find the extremum of the function?
Meet gradient descent.
Gradient descent is an iterative algorithm that slowly finds a local minimum of a function. The algorithm is a first-order optimization method since it uses the first-derivative - the gradient of the function.
It works by finding the gradient of the parameters at each step and updating them a little bit in a direction anti-parallel to the gradient (and parallel if maximizing the function - called gradient ascent).
Let be the objective function paramterized by which we want to minimize.
where is a parameter controlling how fast we update the parameters usually a low number of the order of .
The algorithm assumes that the function is continuous which is not a unreasonable requirement, and it will find the global minimum if the function is convex. However, in the real world, many functions are not neat and convex, so the algorithm will only find a local minimum which can cause problems if you need the global minimum (which is a much harder problem). Luckily, for properly designed deep neural networks, a local minimum is often good enough, and there are some arguments that the global minimum will severely overfit the data since models are very expressive and data is noisy.
To better understand why this algorithm works, it can be helpful to visualize what happens at every step.
Initial guess: (The initial -value )
Using the interactive visualization above, we can see that gradient descent works similarly to "placing a ball on the curve" and let it roll down the curve to the lowest point. Moreover, we see that it is sensitive to where we place the "ball". For example, if we place the ball at with the polynomial function, then it won't reach the global minimum.
We also see that the ball moves more quickly when the descent is steeper, and slows down as it reaches its destination and the curve is flatter. This is expected as the amount we update the parameters is proportional to the gradient.