Recently, I ran across a fantastic article, “Why Train When You Can Optimize?”, which introduced me to the Nelder-Mead optimization algorithm. It's a lovely algorithm, and I couldn't wait to create an interactive version. First, though: what does “optimization” mean in this context?
An “optimization” algorithm takes some mathematical function as its input, and tries to find values for the parameters which make the output either as large or as small as possible. If you are like many computer programmers, your first impression might be that you are unlikely to ever use such an algorithm in your own programs. But optimization is a much more general and useful technique than it might seem. The article mentioned above gives a great example: a drawing program which detects when the user is trying to draw a straight line and replaces their jittery line with a perfectly straight one. (If you know other examples of good uses for optimization outside science and engineering, please let me know!)
Obviously, finding inputs for some function which give you the largest or smallest output value can be done without a special algorithm. You could just use brute force: test many inputs and pick the best one. But if the space of possible inputs is large, that could be too slow.
Optimization algorithms typically avoid exhaustively searching the input space by starting at some arbitrary point, then repeatedly searching for a nearby point which is better, until it hits a maximum or minimum and can't find any better point.
Many such algorithms require that you know how to calculate the derivative (slope) of the function at any given point; but Nelder-Mead doesn't need any derivatives, and combined with its general simplicity, this makes it easy to apply.
Now let me show you how Nelder-Mead works. Rather than starting with one test point and iteratively improving it, Nelder-Mead starts with test points, when the input space has dimensions. (Or, in other words, when there are different input variables whose values need to be found.) For example, if your function has two parameters, there will be 3 starting points, which will form a triangle in the 2-dimensional plane of possible inputs:
(From here on, all examples will be 2-dimensional, but the algorithm generalizes naturally to any number of dimensions. Further, we will assume that we are searching for a minimum rather than a maximum.)
Nelder-Mead repeatedly transforms the triangle of test points, replacing the worst point with a better one. This causes the triangle to move across the plane in whichever direction the function's value is dropping, and then contract around a local minimum when it finds one. When the triangle becomes small enough, then the algorithm terminates. Like this:
At each iteration, Nelder-Mead will apply one of four possible transformations to the triangle. Let's see them one by one (try dragging around the points or adjusting the coefficient values if you like):
The magnitude of each transformation can be tuned by adjusting a coefficient. As shown in the above visualizations, the default coefficient values are 1.0, 2.0, 0.5, and 0.5.
To help get a feel for how these transformations can be used to explore the input space and find a minimum, let's play a game. The below square represents the space of input values for a function . Just as Nelder-Mead only computes the function value at the corners of its triangle, I will only show you its value at those three points. Click on any three points to start, then click any of the five transformation buttons to transform your triangle. Once you contract the triangle to a sufficiently small size (or reach 50 iterations), I'll reveal the contours of the graph. You “win” only if your best point is close enough to a minimum point.
This game is quite difficult; don't be surprised if you hardly ever “win”. The Nelder-Mead algorithm doesn't always reach a minimum point, either; or at least, not in a reasonable number of iterations. Sometimes it gets close to a minimum point... and then moves very, very slowly towards it.
For that reason, when implementing Nelder-Mead, you need to limit the number of iterations so it doesn't run for too long. Rather than running Nelder-Mead for a huge number of iterations, you will probably get better results by restarting it several times, with different starting points, and then picking the best overall solution found.
You may be wondering why the algorithm works the way it does. Here are some interesting questions to think about (click to reveal possible answers):
Why does the algorithm use a triangle rather than a single test point?
Why does the size of the triangle change? Why not use a fixed-size triangle which flips, rotates, and slides around the plane?
The final piece of the algorithm, which I haven't described yet, is how it chooses which transformation to use on each iteration. Here is the procedure:
- Find the reflection point (the point which Reflect would move the worst point to) and compute the function's value there.
- If the reflection point is better than the second-best point, but not better than the best point, then do Reflect.
- Otherwise, if the reflection point is better than the best point, then check the expansion point (the one which Expand would move to). If the expansion point is better than the reflection point, do Expand. If not, do Reflect.
- Otherwise, if the reflection point is worse than the second-best point but not worse than the worst point, check the outside contraction point. If it's better than the worst point, do Contract Outside. If not, do Shrink.
- Finally, if the reflection point is worse than the worst point, check the inside contraction point. If it's better than the worst point, do Contract Inside. If not, do Shrink.
Does that seem to make sense? Perhaps these comments might make it more understandable:
While we expect that our function's graph has some kind of curved surface, Nelder-Mead can't “see” the curve; it only knows the value at its three test points (as you experienced when playing the above game). With only three numbers to work with, the best guess Nelder-Mead can make is that it should move the worst point in the direction of the better two. And in the absence of other information, a reasonable default is to move the worst point just far enough to maintain the size and shape of the triangle (that's what Reflect does). If the default was to move it more or less than that, then the triangle would tend to grow and grow or shrink and shrink, even when there was no reason to do so.
However, that guess isn't always right. Even if the triangle is sitting on a slope, it is possible that Reflect might overshoot the base of the slope and start going up the opposite slope. If the reflection point is worse than the existing points, that indicates that we are going too far and need to back off, perhaps by using Contract Inside or Contract Outside.
On the other hand, if the reflection point is better than all the existing points, that strongly suggests that the triangle really is on a slope and that Reflect really is moving in the right direction. In that case, we can try to go even further in the same direction with Expand. This not only moves the triangle further in a good direction, it also enlarges the triangle, which means the following steps will be bigger. In effect, as long as Nelder-Mead keeps picking good directions and each successive point is better and better than the previous ones, it will “accelerate downhill”. That helps the algorithm to move more quickly towards a minimum and converge in a smaller number of iterations.
As for Shrink, the original paper on the Nelder-Mead algorithm explained that Shrink is necessary for the algorithm to avoid getting stuck in some (rare) situations. One example is below. Try various combinations of Refresh, Expand, Contract Inside, and Contract Outside to see if you can get the triangle to close in on the minimum point (which is marked in red). Then try Shrink and see how it helps.
This last visualization will show all the points which Nelder-Mead considers on each iteration, and why it chooses the move which it does. Click the 'Next' button to move forward. Reflection points will be shown in ⬤ blue, expansion points in ⬤ orange, contraction points in ⬤ pink, and shrink points in ⬤ green.
Special thanks to Justin Meiners for the article which inspired this post, to Peter Collingridge for his helpful post on making SVG elements draggable, to Mike Bostock for D3.js, to John Nelder and Peter Mead for their lovely algorithm, and... to you for reading all the way to the end!
 The version of the algorithm which searches for minimum points is all we need anyways, since if we want to find a maximum point, we can just search for a minimum of the negated function instead. ⏎
 In this post by Enrico Schumann, the performance of a single run of Nelder-Mead with N iterations is compared empirically with 4 runs of N/4 iterations each, over varying values of N. The version which restarts 4 times completely dominates the one which doesn't restart. ⏎
 To be precise, the right term is “gradient”. ⏎