Numerical Optimization is one of the central techniques in Machine Learning. For many problems it is hard to figure out the best solution directly, but it is relatively easy to set up a loss function that measures how good a solution is - and then minimize the parameters of that function to find the solution.
I ended up writing a bunch of numerical optimization routines back when I was first trying to learn javascript. Since I had all this code lying around anyway, I thought that it might be fun to provide some interactive visualizations of how these algorithms work.
The cool thing about this post is that the code is all running in the browser, meaning you can interactively set hyper-parameters for each algorithm, change the initial location, and change what function is being called to get a better sense of how these algorithms work.
All the code for this post is up on github if you want to check it out, it has both the minimization functions as well as all of the visualizations.
Nelder-Mead
Pretend for a minute that you don't remember any calculus, or even any basic algebra. You're given a function and told that you need to find the lowest value.
One simple thing to try would be to sample two points relatively near each other, and just repeatedly take a step down away from the largest value:
\(=\) \(\log{(1 + \left|x\right|^{2+\sin x})}\)
The obvious problem in this approach is using a fixed step size: it can't get closer to the true minima than the step size so it doesn't converge. It also spends too much time inching towards the minima when it's clear that the step size should be larger.
To overcome these problems, the Nelder-Mead method dynamically adjusts the step size based off the loss of the new point. If the new point is better than any previously seen value, it expands the step size to accelerate towards the bottom. Likewise if the new point is worse it contracts the step size to converge around the minima.
The usual settings are to half the step size when contracting and double the step size when expanding. For the 1 dimensional case above, this works like a galloping search doubling in size until it's bracketing the minima, when it switches to contracting and then does a binary search.
This method can be easily extended into higher dimensional examples, all thats required is taking one more point than there are dimensions - and then reflecting the worst point around the rest of the points to take a step down. Take a look at this contour plot to see how this works in 2 dimensions:
\(=\) \(x^2 + y^2 + x \sin y + y \sin x\)
Click anywhere in this graph to restart with a new initial location. This method will generate a triangle
at that spot and then flip flop towards the minima at each iteration, expanding
or contracting as necessary according to the settings.
While this method is incredibly simple, it actually works fairly well on low dimensional functions. It can even minimize non-differentiable functions like \( f\left(x\right) = \left|\left\lfloor x \right\rfloor - 50\right| \), which all of the rest of the methods I'm going to talk about would fail at.
The biggest downside to any direct search method like this is that they all start to perform terribly with higher dimensional functions. For 1 and 2 dimensional examples like above, Nelder-Mead performs well - but machine learning models can grow to millions if not billions of parameters, and this method won't work on even simple problems with more than a dozen or so parameters.
One of the problems is in figuring out the direction to go: this isn't too hard in a 2-dimensional space, but gets exponentially more difficult as the number of dimensions grow.
Gradient Descent
One possible direction to go is to figure out what the gradient \(\nabla F(X_n) \) is at the current point, and take a step down the gradient towards the minimum. The gradient can be calculated by symbolically differentiating the loss function, or by using automatic differentiation likeTorch and TensorFlow does. Using a fixed step size \(\alpha\) this means updating the current point \(X_n\) by:
$$ X_{n+1} = X_n - \alpha \nabla F(X_n) $$
This leads to taking the path of the steepest descent, which looks vaguely like a ball rolling down a hill towards one of the local minima:
\(=\) \((x^2 + y - 11)^2 + (x+y^2 -7)^2\)
The problem with this method is in setting the learning rate. If you set the rate too low Gradient Descent takes forever to find the solution, taking many tiny steps towards the solution. Setting the learning rate too high and it will wildly oscillate around the minima without converging. Even worse, the best learning rate changes from function to function so there isn't a single value that makes for a good default.
A line search can modify the learning rate at each iteration so that both the loss always is descending, which prevents it from overshooting the minima - and also making sure that the gradient is flattening out sufficiently , which prevents taking too many tiny steps. Enabling line search here leads to fewer iterations with the downside that each iteration has might have to sample extra function points.
Even when using a line search, Gradient Descent still struggles with functions like Rosenbrocks Function. The problem is that sometimes the best direction to take isn't along the gradient, and you need to also consider the curvature of the function as well.
Conjugate Gradient
The Conjugate Gradient method tries to estimate the curvature of the function being minimized by including the previous search direction with the current gradient to come up with a new better search direction.
Jonathan Shewchuk wrote a great paper called An Introduction to the Conjugate Gradient Method Without the Agonizing Pain, where he describes in depth how the Conjugate Gradient Method works. The only problem with it is that it's substantially longer than this entire post is - and it doesn't even start talking about the non-linear case until the forty-second page (which gives me nightmares of what the agonizingly painful version he had to learn from must have been like).
While the theory behind this might be a little involved, the math is pretty simple. The initial search direction \( S_0 \) is the same as in gradient descent. Subsequent search directions \(S_n\) are computed by :
$$\beta = \frac{\nabla X_n^T(\nabla X_n - \nabla X_{n-1}) }{\nabla X^T_{n-1} \nabla X_{n-1}} $$ $$S_n = \nabla X_n + \beta S_{n-1} $$
The current location \(X_n\) is updated with this search direction, using a step size \(\alpha\) computed by line search:
$$ X_{n+1} = X_n + \alpha S_n $$
You can see the progress of this method below. The actual direction taken is in red, with the gradients at each iteration being represented by a yellow arrow. In certain cases the search direction being used is almost 90 degrees to the gradient, which explains why Gradient Descent had such problems on this function:
\(=\) \((1-x)^2 + 100 (y - x^2) ^2\)
Another Example
The examples so far have only been on 1 or 2 dimensional functions, which aren't very interesting to optimize for. They also haven't been working on actual data - which is the normal case in most machine learning problems. I thought as a final example, it would be fun to look at how these algorithms do on a multi-dimensional scaling problem.
The challenge here is to convert a matrix of distances between some points into coordinates for each point that best approximate the required distances. One way of doing this is to minimize a function like:
$$loss = \sum_i \sum_j ((X_{i} - X_{j})^T(X_{i} - X_{j}) - D_{ij}^2) ^2$$
Where \(X_{i}\) are the coordinates of each point that we want to minimize for and \(D_{ij}\) is the desired distance between each point.
The data I'm using here is the distances between major North American cities, and the goal is to use that data to build a map of those cities. Passing in the distances between 20 cities involves minimizing a function with 40 parameters:
This visualization shows a couple things pretty clearly:
- The Nelder-Mead method totally fails with more than 10 cities, and even before that struggles compared to the other methods.
- Gradient descent works well if you have an appropriate learning rate, but the optimal learning rate changes with the number of cities. Setting the learning rate too high causes this not to converge, and too low makes it take forever.
- Using a line search with gradient descent leads to a zig-zag pattern, which is smoothed out by using the Conjugate Gradient method.
Further Reading
If you've read this far you've probably figured out that this whole post was just an excuse for me to mess around with some javascript code in part of my misguided attempts to learn the language. Everything I've covered here has been said before, usually by people much more eloquent than I am.
Nocedai and Wright have written an excellent book on numerical optimization that was my reference for most of this. While it is a great resource, there are a couple of other techniques not covered that I quickly wanted to mention.
Sebastian Ruder wrote an excellent overview of gradient descent methods that go into more depth, especially in the case of stochastic gradient descent on large sparse models like those used to train deep neural networks.
One cool derivative free optimization method is Bayesian Optimization. Eric Brochu, Mike Vlad Cora and Nando de Freitas wrote a great introduction to Bayesian Optimization. An interesting application of Bayesian Optimization is in hyper-parameter tuning - there is even a company SigOpt that is offering Bayesian Optimization as a service for this purpose.
Published on 25 November 2016