# An Unheard Serenade of Cyanidic Thought

Translation: I answer your machine learning questions

In between writing essays, I often answer questions over on Quora. Last Friday, I requested questions about machine learning which I now answer.

### What is a formal definition of machine learning?

A system whose performance with respect to some error function increases with experience.

### In what ways are machine learning overrated?

That depends on the rating system.

For example, it’s my impression that most people who know and use machine learning have a quite accurate and balanced view of the possibilities, and more importantly the limitations of machine learning.

The problem arises when authorities who may not have adequate experience such as most of mainstream media (and me) start making claims and predictions without being very careful about not misinterpreting or misrepresenting their sources.

Moreover, even an objective reporting on the events may lead to an incorrect picture of the capabilities.

For example, when chatbots, self-driving cars, spam filtering all are collected under the same ‘machine learning’ umbrella term, it’s not unreasonable to think that they are the same problems, and extrapolate that we’re just a few years away from super intelligence.

This is strengthened by a marketing incentive for showcasing only primal examples and steer away from fringe cases. This is especially seen in natural language based “virtual assistants” where a small number of ‘advanced queries’ for which they have been specifically optimized are showcased; misleading people to assume that it’s a representative sample. False representativity is a problem for not just machine learning, but for nefarious claims in general.

While actually, even just making a robot do something without screwing everything else up is really difficult. There’s also currently no real solution for adversarial examples.

In conclusion, it’s possible that the average technological interested, but not necessarily literate, person tends to overestimate the capabilities of machine learning. In particular, people tend to overestimate the generalizability of current machine learning algorithms, and compare it to that of humans.

### Can a neural network really do anything with the right weights and hyperparameters?

No, neural networks cannot do everything and anything.

Neural networks are good at mapping continuous geometric relationships given a large amount of labeled data. In fact, they can approximate any continuous relationship given enough data.

This ability has led to some incredible achievements in many fields.

However, neural networks still have several limitations among which are:

- Adversarial examples
- Biased/Untruthful training data
- Weak generalization
- Limited mapping
- Require a lot of human-annotated training data
- Limited transfer learning

Furthermore, there are things that cannot be done using neural networks; not now, not ever. For example:

- Anything discontinuous
- Anything for which data cannot be collected or simulated
- Symbolic manipulation (at least without a supporting agent)
- Beat entropy; no new novel information (not even by synthesising information with generative networks)
- Neural networks do not imply P=NP with forward propagation being P.
- This implies that neural networks cannot perfectly encode NP problems in finite space.

### For a beginner who is not a coder, is it necessary to learn Python before learning machine learning or data science?

You don’t need to know Python in order to learn about machine learning, or data science.

In fact, you don’t even need to know a programming language in order to understand the theory.

But if you want to actually implement, or use anything you learn, you will need to learn some programming eventually.

However, while Python is probably the most used programming language in the data science community, there are a few good alternatives, and more less used (bad) alternatives.

The largest competitor is R. R is focused on statistical computing, and while R is turing complete, there’s less focus on general purpose programming in the R community which means you’ll be at a disadvantage compared to Python when doing anything but data science. (That is not necessarily a bad thing though).

Some years ago, MATLAB was quite popular, but it’s not open source, and requires a commercial license to use, so it has fallen out of favor recently.

Today, most popular programming language has some sort of machine learning library available. The most popular; Tensorflow, however, is fully supported by Python with more languages coming.

If your question instead was if it is necessary to learn the basics of any language before learning machine learning (assuming you don’t know the theory very well), I until recently would have said no as most programming languages are sufficiently simple to learn while you’re learning something else.

But when I mentored some people in learning both machine learning and a programming language at the same time, I noticed that they confused programming, machine learning, and the custom libraries they are using.

This lead to more than a few inefficiencies that could have been avoided had they learned the basics of the respectable domains beforehand why I now recommend getting familiar with the basics of programming before applying data science concepts.

### How should I choose between becoming a full stack JavaScript developer and a machine learning engineer?

Do a few personal web projects with Javascript, then do a few machine learning projects.

Which did you enjoy the most?

What kind of projects do you want to work on next; is that another web project, or a machine learning project?

Even if you focus on one, that doesn’t mean that you can never do the other.

Some projects may even require you to be able to do both; for example, making an interface to a machine learning algorithm, or a product that uses machine learning on the backend.

### What should I concentrate on after learning machine learning if I am not interested in deep learning?

That’s a bit like saying that you’re a woodworker and not interested in using a hammer; you only want to use saws.

Deep learning is very tightly coupled with neural networks as it is from the deepness they get nonlinearity, and therefore their performance. (technically, it’s through the nonlinear activation functions neural networks get their nonlinearity, but they usually sit between layers of neurons).

So if you’re already familiar with more “traditional” machine learning approaches such as, SVM, Decision Trees, Naive Bayes, and with the various ensemble methods, and feel like you want something more, there’s not a whole lot left, other than learning more about feature selection, feature engineering, and data pipelining.

After all, if you already know everything about machine learning, except for neural networks, there’s only neural networks left to learn.

Another possibility could be to use what you know as a tool, like programming, or a hammer, to solve problems that have significance beyond just being a machine learning problem. However, by refusing to use deep learning, and neural networks, you might find yourself outcompeted on complex problems.

A third possibility is to look at AI approaches outside the machine learning umbrella even though they may not be as powerful/efficient; maybe you will discover something that can be used with other techniques really well.

### What does 'deep' mean in machine learning?

In machine learning, specifically neural networks, “deep” simply refers to the number of hidden layers.

A hidden layer is a layer that is neither the input layer, nor the final output layer, but an intermediate layer that accepts input from the previous layer, and sends its output to a new layer. (that was a lot of layers)

The exact number of “hidden layers” required for a network to be deep is somewhat debatable.

Luckily, the term “deep” in deep neural networks is mostly used as a marketing term akin to how “cloud” is really just a fancy word for somebody else’s computer, so it’s really up to you when you want to use it.

In theory, you could call any network with at least 1 hidden layer deep.

In practice, anything more than 2–3 hidden layers is typically classified as being deep.

### What type of optimization is needed to train a neural network? And what is your favorite course for the types of optimization we need for NN?

To get started, all you need to know is gradient descent.

Gradient descent is quite simple to get your head around if you know basic calculus.

You know that /$f'(x)/$ can geometrically be interpreted as the slope of /$f/$ at any point /$x/$.

The gradient, usually denoted as /$\nabla/$, can likewise be interpreted as the slope of a multivariate function (a function with multiple inputs) in any given combination of the inputs.

For example:

/$$\nabla f(x,y) = \begin{pmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{pmatrix}/$$

Where the /$\frac{\partial f}{\partial x}/$ is just a fancy way of writing /$f'(x)/$ Strictly speaking the /$\partial/$ symbol means “partial”, and shows that the /$x/$ is just one out of multiple variables.

So to compute the slope of a multivariate function, you find the gradient by taking the partial derivative of each component, /$x/$ and /$y/$ in this case, and mash them together into a single vector.

Some people I’ve explained this to in the past weren’t entirely convinced that the gradient will indeed point towards the direction of steepest ascent.

While there’s a formal proof of this being the case, one way of strengthening your intuition is to draw a 3D mesh, and decide on a starting point. Then you want to find the slope of the x and y components respectively. This can be done by simply drawing the tangent lines.

If you move a small unit of length in the direction of both slopes, you should see that you’re moving in the direction of steepest ascent.

Gradient descent uses this fact in reverse.

Assume we have a neural network with the weights /$\theta/$ which is a vector. Assume further that we have defined a loss function with respect to the weights of the network: /$loss(\theta)/$.

Using this, gradient descent is defined as:

/$$\theta_j := \theta_j - \alpha \cdot \frac{\partial loss(\theta)}{\partial \theta_j}/$$

Where /$\theta_j/$ is a particular weight, /$:=/$ is the assignment operator (/$=/$ from programming), /$\alpha/$ is the learning rate - just a constant to determine the size of our step on the graph.

We use subtraction, as opposed to addition, because we want to descent as fast as possible; why we are going in the opposite direction of steepest ascent. (this works if the loss function is continuous)

Gradient descent is repeated for all /$j/$ simultaneously.

This should give you a good idea of how gradient descent works, and lay the foundation for you to understand more sophisticated optimization algorithms used for neural networks many of which are just variations on gradient descent.

When developing new algorithms, knowing about optimization as a mathematical discipline in general will become helpful. I’m not familiar with any courses that teaches this specifically, so I can’t make any recommendations.

### How do you calculate gradients in a feed forward neural network using matrices?

Let's first make sure that we understand what the gradient is.

The gradient is a vector of all the partial derivatives of the loss function with respect to the different feature weights or parameters.

For simplicity‘s let's assume a neural network with a single layer.

A non-regularized loss function could be defined as:

/$$\frac{1}{m}\sum{(\theta^{T} x^{(i)} - y^{(i)})}/$$

Where /$m/$ is the number of training examples, the sum runs over all training examples /$i/$, /$\theta^{T}/$ is the weight vector transposed, /$x^{(i)}/$ is the training feature vector, and /$y^{(i)}/$ is the correct answer vector.

The term: /$\theta^{T}x^{(i)}/$ is therefore the hypothesis of what the correct answer is, so if we were using an activation function, we would wrap that around this.

To find the gradient, or the partial derivative, for each weight /$\theta_j/$ in the weight vector, you can modify the above to:

/$$\frac{\partial cost(\theta)}{\partial \theta_j} = \frac{1}{m}\sum{(\theta^{T} x^{(i)} - y^{(i)})}x_j^{(i)}/$$

Where, again, subscript /$j/$ denotes the feature of a vector.

If done for all features, you will end up with a gradient vector which you can use with gradient descent. Each weight can then be updated using:

/$$\theta_j := \theta_j \alpha \cdot \frac{\partial cost(\theta)}{\partial \theta_j}/$$

Where /$:=/$ is the assignment operator (basically /$=/$ from programming), and /$\alpha/$ is the learning rate. (one thing to note is that in order for gradient descent to work correctly, all weights must be updated simultaneously)

As for how you can implement it in code, you can use the map reduce pattern to run the operation across the the whole vector.

### Is automatic differentiation in deep learning the same as numerical differentiation?

No, well, kind of.

In theory, for any given point, the derivative of the function in that point whether using numerical or automatic differentiation should give the same answer. [1]

How the two methods arrive at the answer varies however.

Numerical differentiation uses an approach similar to what you may have learned in school:

/$$f'(x_{0})=\frac{f(x_{0}+h)-f(x_{0})}{h}/$$

Where /$x_{0}/$ is some differentiable point on the function, and /$h/$ is arbitrarily close to /$0/$.

Autodifferentiation is similar to, but not the same as, symbolic differentiation (which is how you’d properly differentiate a function by hand) in that both use a set of techniques to find an expression for /$f'(x)/$.

Specifically, automatic differentiation uses arithmetic operations such as addition multiplication, and division along with functions like log, sin, cos, and exp, and by applying the chain rule, which is essentially a way of breaking down a complex function into less complex ones, automatic differentiation is able to reduce the function to a composite of these primitives of which it can compute the derivatives and combine to get the derivative of the entire function. It does this using approximately the same amount of operations as the original function (which is a neat thing for efficiency). This way it’s also a good method of computing partial derivatives which are useful for gradient descent.

The chain rule, in the simplest case, can be written as:

/$$f'(h(g(x)))=h'(g(x))\cdot g'(x)/$$

In other words, you’re asking how much does /$f/$ change when /$x/$ change?

The answer is that the change in /$f/$ is directly related to the change in /$h(g(x))/$, but in order to find how much /$f/$ changes when we change /$x/$, we have to figure out how much /$g/$ changes when we change /$x/$. We do this by computing /$g'(x)/$.

Now we can say that /$f/$ changes with respect to /$x/$ as much as /$h/$ changes with respect to /$g/$ per the amount of which /$g/$ changes with respect to /$x/$.

If you like, you can visualize it as interconnected cogs of different sizes rotating. You are essentially asking how many degrees of rotation for cog /$h/$ does 10 degrees of rotation for cog /$g/$ result in? And how many degrees of rotation for cog /$f/$ do those 10 degrees result in?

Mathematically, this can be written as:

/$$f'=h'(g)\cdot g'(x)/$$

Which is just a shortened version of what we had above.

The same reasoning is used with automatic differentiation, but there are often a lot more cogs in the machine, so you will have to apply the same principle for many layers which is why working it out on paper is often tedious.

Compared to numerical differentiation, automatic differentiation is a lot faster, and more precise, especially for higher order derivatives.

In fact numerical differentiation is so slow that before automatic differentiation became widely used in machine learning libraries, programmers would often symbolically differentiate the loss function by hand.

Compared to symbolic differentiation, automatic differentiation is usually more efficient, and better at calculating partial derivatives when there are many inputs (which is usually the case with neural networks).

That is why automatic differentiation is pretty much universally used in all serious deep learning frameworks, and the only real reason for not using it would be that numerical differentiation being simpler to implement; in which case you probably would want to symbolically differentiate the loss function, and hard-code it anyway.

However, not all smooth functions can be analytically differentiated. Take for instance the following function:

/$$f(x)=\begin{cases}e^{\frac{-1}{x}},& \text{if } x> 0\\0,&\text{if } x\leq 0 \end{cases}/$$

While this is a smooth function over the real number line, as seen below, the function is not analytical.

While in deep learning, this is usually not a problem as loss functions tend to be, at least partially, chosen for their good behavior, if you were to encounter such a function, you would need to have something to fall back on. I’d imagine that numerical differentiation, or semi auto differentiation is often used here, but I don’t know.

[1] This is assuming that the interval is an infinitesimal.

### How should I fill category missing values to apply machine learning?

There are a few techniques for handling missing data.

The simplest way is probably just to discard samples that have missing values. However, this quickly becomes wasteful if you have a lot of values with missing data. This technique is seldom used in practice.

Another method is to encode missing values as a neutral value, so you can still use the training examples, but attempt to ignore the missing data. (other times having missing data can be significant then a similar approach can be taken).

For example, if you use one-hot encoding for your strings, you could give the missing values a zero vector (a vector filled with zeros), or if you have normalized numerical values such that the mean is at 0, you can encode the missing features as being 0. [1]

Some frameworks even let you use dedicated null-types to encode missing data such as numpy nan.

The above are all passive methods, but you can also impute the data which can sometimes give better results, but is often more difficult as well.

The simplest form for imputation is just to guess what the data might be, and fill it in. However, this has several problems. Firstly, it requires quite a bit of domain knowledge to determine what a reasonable guess might be, and you risk encoding your own biases into the data.

Though not applicable to your case, if you have a known upper and lower bound for a feature, you can also fill in missing values with the middle option. So on a scale from 1 to 10, you can fill in missing values with 5.

A more advanced way of imputing values is by running regression analysis in order to predict the missing values, so given the other features, what will the missing values be. Then your training set will be the samples without the missing value and your production set will be the samples with the missing values. This way, you’re essentially using machine learning to heal your dataset, so you can use machine learning. (I usually remove the value I actually wanted to predict before doing imputation).

The most advanced, and popular method is called “multiple-imputation”, and is essentially the above, but where you calculate the correlation with the other features to generate multiple predicted values which are averaged to introduce error into the synthesized dataset.

Which one you should choose depends on your project, timeline, and data structure. There’s no silver bullet.

[1] Some people would classify this as imputation.

### Is max pooling in a neural network architecture considered a form of regularization?

It depends on what you mean by regularization.

Max pooling doesn’t help keep the weights small like L1 and L2 regularization on the loss function.

It does, however, help prevent a network from overfitting.

It’s worth noting though that dropout works best for convolutional layers which have multidimensional weights with spatial relevance; why closeby weights can be combined as they encode approximately the same information.

### Is the number of nodes in a hidden layer more than the input layer? Is this a problem? What can be learned in such neural networks?

It depends.

Mathematically, a network will function just fine with more nodes in the hidden layer than the output layer.

However, due to overfitting, you often want less weights in your hidden layer than input/previous layers as it helps promote generalizability. There are, as with anything, more than a few notable exceptions to this rule where you want to upscale the number of weights.

A rule of thumb I’ve seen circulating forums which is a good starting point for many problems is the upper bound for number of hidden neurons as given by:

/$N_h = \frac{N_s}{a \cdot (N_i+N_0)}/$

/$N_h/$ = number of hidden neurons.

/$N_i/$ = number of input neurons.

/$N_o/$ = number of output neurons.

/$N_s/$ = number of samples in training data set.

/$α/$ = an arbitrary scaling factor usually 2-10.

### How do I avoid overfitting with only 100 samples and 1000 features?

Wanting to fit more features than you have samples is a futile business.

If you’re using neural networks, you can try to aggressively deploy regularization techniques such as dropout.

You may also want to try and reduce the number of features by selecting the n most important features, or by combining them using PCA.

If your problem fits it, you might want to try and use SVMs which usually work relatively well when you have many features compared to training examples.

But this probably won’t solve your problem. You should, really, go out and gather more data.

### Is it possible to implement machine learning/intelligence to automation testing? If so, how?

As in generating input that will cause an exception?

That may be possible using an adversarial attacker.

However, it’s not unlikely that it will just learn to give you a generally faulty input such as a string when you expect an integer, or integers that will cause an overflow instead of more intrinsic errors in your implementation.

You may be able to circumvent this to an extent by only raising exceptions when it’s an actual error in the program logic, but how do you determine what’s an error, what’s going to check the error checking code? (and you end up having traditional unit testing with tagged on ML).

It’s worth noting, however, that this is not entirely dissimilar to random unit testing such as Quickcheck in Haskell, but this requires the programmer to code in a specific way, and doesn’t work for everything.

### What are tutorials or examples to design regression neural network with Tensorflow?

Regression networks can be made pretty much like how you would make a classification network; just don’t apply the softmax function at the end to allow for continuous values that extend beyond 0 and 1.

There are a few details to keep in mind when doing regression, but just refraining from applying the softmax function at the end (and having training data that supports regression: i.e. the output being a single continuous value) should be enough to create a regression network.

### Is there a paper on removing text from images using machine learning or deep learning?

I don’t know if there’s a paper for that specific problem, but you can find papers for a lot of the subproblems.

One way of breaking the problem down is:

- Detect bounding boxes for the text
- Naïvely remove it, or otherwise mark it as not a part of the image.
- Use image reconstruction to interpolate the missing data.

Bonus points if you get one network to do all the steps in one go.

If you have access to any kind of image library, creating the training data is trivial as you can just superimpose the text on the source image to create the signal the model will receive.

Do reuse the same source image with different text to increase the amount of training data available, and to help the network detect the text in the first place.

Ideally the text size, color, rotation, position, and wording should be randomized.

### Is there a way to extract the underlying function of a trained neural network?

The neural network, specifically forward propagation after training (prediction) is the underlying function.

As I understand it, what is meant by neural networks being black boxes is exactly the lack of an intuitive interpretation of the weights, and thereby the function which happens in most cases.

However, there has been some research into trying to “whitening” the black box for example the visualization of convolutional filter weights. This helps understand what features are being matched.

Knowledge about this in turn helps the intuition behind adversarial examples where the convnet representation of an object say a cat can be added to an image of another object say a hammer to make the classifier believe with great certainty that the object is a cat even though any human would say it's a hammer.

### How can I do data preprocessing during image classification with artificial neural networks when images have different sizes?

The simplest method is just to scale the images so that they have the correct size.

This can be done by either up/downscaling or adding padding to enforce an uniform resolution.

While there may be methods to dynamically alter the size of the network, eg by sometimes combining weights, this technique (image transformation) has worked well for me in the past, and the fancy techniques generally do the same, just in a more complicated way.

### Can I use a CNN fine-tuned on ResNet50 to classify human images into 7 emotions? What are some better transfer learning approaches to do this?

I’m not sure if I’m the best person to answer this question as I’ve very little practical experience with transfer learning, but since I was asked directly, and the question remains unanswered, I’ll have a go at it.

Assuming you want to use pre-trained weights, for which I could only find weights for the imagenet dataset, there’s a lot of work done which won’t help you.

For example, you probably only care about human faces to classify expressions[1], so it being able to detect musical instruments is completely useless to you.

While having a too complex model for your problem is bad enough, it being able to classify so many different things may mean that the resolution for each category is low.

That is it may not have features required to understand the subtleties of facial expressions, so you’d have to learn these in your own conv-net. Note that it’s possible that there’s some overlap between categories which you can utilize, but I don’t know how often it comes up in practice. (It’s here my relatively sparse experience comes to show)

Assuming that the above is true, the ResNet50 will essentially serve as a very complicated human face detection network.

With that said, some ideas for what you could do:

- Train the ResNet yourself on your dataset and hope the architecture will work well on that as well.
- Depending on your setup, it may not be feasible why it’s not the prefered option.
- You can use another network trained specifically for face recognition as your base network. This is my prefered method.
- You can forget transfer learning, and train a network from the ground up which is more expensive than transfer learning, but probably less so than training ResNet yourself.
- You can go ahead with the ResNet trained on ImageNet, or if possible, on something more relevant to your problem, and hope the cross section of the many categories work well for further segmentation of facial features.

[1] I’m using expressions instead of emotions as a person may purposefully look happy when they’re sad; something you cannot measure from their facial expression alone assuming they are adequately skilled at it.

### Why doesn't my neural network for XOR work?

The error function graph looks symptomatic of you having a too high learning rate such that when it approaches a minima, it oversteps it by a little, and has to reverse which causes it to overstep the minimum even more, but the opposite direction.

Furthermore, you may also have an implementation bug as gradient descent is supposed to converge once the error is 0.

A correct error function would decrease on every iteration with high velocity in the beginning, and will flatten out towards the end. It would also be inversely proportional to the accuracy function like depicted below.

### What would happen if we put Machine Learning on the output of a Randomizer and give it all the information that the Randomizer uses?

Your question sounds something like using machine learning on the output of a pseudo random number generator in order to predict whether it was truly random or not.

As far as I’m concerned, there have been no attempts at doing that, but there exists a series of statistical methods such as chi-squared distribution testing that can be used to analyse the quality of a random distribution.

These methods, however, cannot be classified as machine learning techniques.

Another interpretation of the question is with emphasis on “all the information the randomizer uses” which would mean the algorithm, and, ever so importantly, the seed value. In this case it’d be trivial to predict the next number, and the one after that, and after that…

### Let's say that I want to build a decision tree ensemble that can make predictions very quickly. Is there a library that can help me build such an efficient ensemble?

Sklearn is such a library.

You can use their random forest classifier or regressor ensemble class which is highly optimized (though only runs on the CPU).

The basic setup will look something like this:

```
from sklearn.ensemble import RandomForestClassifier
# train data
features_train = [...]
labels_train = [...]
# test data
features_test = [...]
labels_test = [...]
# initialize the classifier
clf = RandomForestClassifier(*parameters)
# train the classifier
clf.fit(features_train, labels_train)
# compute accuracy
acc = clf.score(features_test, labels_test)
```

In fact, I've written an entire essay on how to use sklearn.

### How can l explain the fact that with learning rate equals 1 l get much better accuracy than learning rate equals to 0.01 in CNN?

You’re likely overstepping local minima to converge on something closer to the global minimum.

Consider the following graph:

Although your model probably has more dimensions, let this graph represent the decisions space of a three dimensional model. (this is because humans generally suck at visualizing more than three dimensions, but the math doesn’t give a shit, so we are all good)

Imagine you are standing somewhere on the graph above. Let’s for the sake of demonstration say that your initial position is at the top of the red uprise around point (0,2,6).

Currently, your loss function value is super high (6) - this is what you want to minimize.

Take a moment to look at the graph. If we want to minimize our loss function (the z value), where would the optimal convergence point be?

The answer is somewhere around (0,-2,-6), so this is where we want our algorithm to end up.

It’d be great if there’s a GOTO(global_minima) function, but there isn’t. In fact, we actually know is the function that produced this graph (the loss function) - the graph is just an educational tool.

Now, I won’t go too much into other optimization algorithms as this is primarily about gradient descent, but I’ll briefly mention that the naïve approach of trying random values Monte Carlo style won’t be very efficient as you’re assuming there’s no relationship between good and bad values.

For the Monte Carlo method to be efficient, the graph should look something like this where the dots are points on the graph (with random values):

Gradient descent is a better alternative to our naïve approach. Gradient descent works by taking a step towards a lower point on the graph, and continuing to do so until you reach somewhere where you can no longer take a step down. This is where the algorithm converges.

The length of each step is what we call the learning rate.

Let’s try to run this algorithm on our graph above.

(I recognize that this was poorly drawn. The distance between the points is supposed to be constant)

We see that the algorithm converges in (-1,2,-4), and not (0,-2,-6) which we determined would be the optimal outcome.

To answer your question, the reason you see an improved accuracy when you increase the learning rate is that it oversteps some of the local minima, and therefore converges closer to the global minimum (not spatially necessarily, but in value).

Mathematically, gradient descent works by taking the derivative of each variable (each dimension) to determine which way it should step. If done numerically, this is the equivalent of feeling with your feet which way will make you go down the fastest, and updating the position based on the learning rate.

`position = position + learning_rate * direction_vector`

The algorithm can therefore be classified as greedy as it has no sense of the overall landscape, and is thus limited to making the logical choice at each position as it doesn’t have a concept of the overall landscape, and thus the global minima.

When using gradient descent, you’re only guaranteed to converge on a local minima; not the global minima. Though in practice, gradient descent rarely converges. Instead, it’s trained for n iterations, or until it doesn’t improve the accuracy anymore.

It’s possible to minimize the chance of converging on only a local minima by training the network multiple times using different initialization points; the point on the high dimensional graph from where you start your descension.

The problem with this technique is that it requires a lot of computing power, and doesn’t necessarily improve your results (there’s still no guarantee that it will find a better local minima).

Furthermore, by introducing an element of randomness, you give up the determinism which may be a problem in some scenarios.

Other ways of minimizing the odds of convergence on local minima include using the momentum method, and stochastic gradient descent. There are a few other as well, but they all fall into the category of optimizers.