Interpreting the Elo Ratings for Python Global Optimizers

Published in

Geek Culture

13 min readMar 3, 2023

I just created some documentation for HumpDay and I’m refreshing this post about the Elo rating methodology a little so I can reference it.

HumpDay continues my quixotic quest, begun in Comparing Global Optimization Packages, to locate the best, practical derivative-free global optimizer packages that can be easily used from Python. HumpDay provides a simple, unified calling syntax into some — but by no means all — functionality from a dozen of the most popular Python global optimization packages.

Optimizer Elo Ratings

The sister repository optimizer-elo-ratings uses this to assess performance. That isn’t an easy task but HumpDay contains a growing list of challenges that I hope become more representative over time. It uses the classic functions you are familiar with such as griewank, rastrigin, and their ilk. See objectives/classic.py for the full list.

In this post, explain the Elo rating methodology for optimizers in more detail than you likely care for. They power the humpday/comparison utilities (just open this notebook in colab) and leverage the simplicity of a common calling syntax.

You can quickly do your own bespoke evaluations of optimizers from different packages and, thereby, arrive at one or more likely to serve your purposes, not mine.

Popular Python Optimization Packages

One has quite a choice these days, as a long table in this older post makes clear. That pain of choosing an optimizer centers on the fact that between scipy.optimize, dlib, ax-platform, hyperopt, nevergrad, optuna, bayesopt, platypus, pymoo, pySOT and skopt (and more by the time you read this) there is barely a convention in common.

The HumpDay package also makes amends for some notable omissions in my initial look Comparing Python Global Optimzers (post). I somehow left out two optimizers provided by Facebook — ax-platform and nevergrad. Sorry folks! You’ll also find consistent signatures for scikit-optimize (a.k.a. skopt), ultraopt, dlib and bayesian-optimization that didn’t appear previously. Also FreeLunch.

In HumpDay you will find more than fifty different optimization strategies spanning the usual suspects inspired by nature, Bayesian surrogate methods, homology, Parzen and hybrid approaches (not to mention, as they say on the radio, all the hits from the sixties, seventies and eighties — pattern search ain’t broken). Within each package I’ve tried to span a reasonable set of choices. Each is rendered as a simple function, even simpler than the scipy signature:

best_val, best_x = strategy( objective, n_trials, n_dim )

Cube or Simplex

I was minimalist here, so objective is a function taking the hyper-cube as its domain. At least for me there isn’t a huge loss of generality, for two reasons:

Since the domain can be mapped to a hypercube in some fashion usually. Indeed, the case has been made that for Lipshitz/Holder functions one doesn’t even need to preserve dimension. See Lera and Sergeyev pdf for one example — though this is a whole sub-field.
I introduced a new way to map the simplex to the cube which you can read about here.

I hope that as far as your views on optimizers go, the standardization provided by the HumpDay package makes it operationally trivial for you to form a weak prior at least, as to what might be useful.

Now down to nuts and bolts … suppose you have an objective function that captures the essence of many similar tasks you face, or a collection of the same. To run one optimizer:

from humpday.optimizers.nevergrad import nevergrad_ngopt_cube
    from humpday.objectives.classic import salomon_on_cube
    best_val, best_x, feval_count = nevergrad_ngopt_cube(salomon_on_cube, n_trials=100, n_dim=3, with_count=True)

You can then switch seamlessly from nevergrad to ax-platform to dlib or whatever you can find in humpday/optimizers. Or, to test all the strategies from one package against some pre-canned objective functions, you can write something like:

from humpday.optimizers.pysotcube import PYSOT_OPTIMIZERS
    from humpday import CLASSIC_OBJECTIVES
    for objective in CLASSIC_OBJECTIVES:
        for optimizer in PYSOT_OPTIMIZERS:
            print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))

Similarly one can, given enough time, test all strategies from all optimization packages:

from humpday import OPTIMIZERS
    from humpday import CLASSIC_OBJECTIVES
    for objective in CLASSIC_OBJECTIVES:
        for optimizer in OPTIMIZERS:
            print( (optimizer.__name__, optimizer(objective, n_trials=100, n_dim=3, with_count=True)))

Obviously, you could replace CLASSIC_OBJECTIVES with your own list of objective functions. To emphasize, an objective is a function taking just one argument — a single list or vector describing a point in the hypercube [0,1]^n. I dare say functools.partial may be your friend, and there are a few trivial transforms provided to squish domains as you need to.

Top Performers

The Elo ratings, presented with no expense spared at optimizer-elo-ratings, are an attempt to get around the assessment problem caused by “naughty” optimizers — those who insist on performing more objective function evaluations than they are instructed to use.

For readers of my previous post, which tried to get at optimizer performance in a different way, there are some familiar characters here and no huge surprises — though I repeat the caveat that some techniques (e.g. surrogate) might be favored by the somewhat stylized objective functions I am using here.

The flexibility of other techniques (e.g. parallelism, discrete optimization, nested domains, et cetera) aren’t allowed to shine. My intent isn’t to downplay the cleverness of design features of these packages.

As more real-world objective functions are introduced, it will be interesting to see how Elo’s change. I refer you to the current leaderboard.

The new Facebook inclusions are doing pretty well — which is a relief given that I’ve not been entirely helpful promoting their signature time series package (ahem). Facebook organized an open competition last year to improve nevergrad and IOHprofiler (link) and I’m sure the optimizers will get even better over time. They also cater to a wide range of dimensions and domains.

However, at the risk of being booted from the nevergrad Facebook group my new go-to is dlib. It uses the approach developed by Malherbe and Vayatis (paper) as noted above, which is annoyingly elegant. There is a nice video on the blog and a very clear explanation. Not only is the performance extremely good, but the run time is likely to be negligible for any real world, expensive objective functions. The library won’t let you run problems with more than 34 dimensions, but that seems to be the only limitation. The only drawback I can think of (for my particular domain) is constantly wishing I had come up with it myself.

Another good one is freelunch. Almost as good as its name.

Elo ratings are flawed in the same way that Value at Risk is flawed. I don’t suggest that “most likely to get the lowest value for a fixed number of function evaluations” is everyone’s cost function. There are some other things to be aware of too, such as the fact that some optimizers, including SHGO, really struggle to limit themselves to a fixed number of evaluations and that can hurt them in head-to-head matches when playing Black. Whether that penalty is considered fair depends on your needs. I’ll go into a little bit more detail on the methodology in a moment, to illuminate these points.

Recommendation

First let me mention that you can poke around in humpday/comparison for utilities that will generate recommended optimizers. These are only as good as the category Elo ratings behind them, which need more time to equilibrate, but say you want some quick advice on which optimizers are the best performing at 8 dimensional problems when given 130 function evaluations to play with. Perhaps you want to limit total computation time of the optimizer (as distinct from the objective function) to 5 minutes. Use:

suggest(n_dim=8, n_trials=130, n_seconds=300)

The category Elo ratings are in the process of being computed by dimension and function evaluation limit. I’ve tried to warm them up a little, but they are really intended to be a free byproduct of testing.

Go up one directory to find more, here. For some other time saving utilities I’d refer you to the HumpDay README.md as that is likely to be better maintained than this blog article. For instance:

from humpday import recommend

    def my_objective(u):
        time.sleep(0.01)
        return u[0]*math.sin(u[1])    recommendations = recommend(my_objective, n_dim=21, n_trials=130)

As this particular function is very fast, some optimizers will be culled from the list. There’s not point spending a long time updating a surrogate function if you can sample the objective many times instead.

Hopefully this speeds your search or encourages you to try some packages you would otherwise overlook. I suppose, in addition, you could continue to use HumpDay after you find you favorite optimizer, though that comes with some pip dependency baggage and serious limitations so I don’t really recommend it.

The intent is modest here: to provide you with a quick shortlisting of Python optimization packages that might be suitable to your bespoke task, and some templates to get you over the hump (groan) of moving from one syntax to another. At minimum it might provide you with some assurance that in continuing with the one you know and like, you’re not missing much.

How Matches are Officiated

Now to the details of the Elo ratings.

If you are familiar with Elo ratings for chess, and I’m guessing that many of you are, then the only question you will ask is, “How are matches arranged between optimizers?”. That is laid out in the code, and you can even watch games occurring (I’m kidding … I think) or inspect the history. But here’s the English version.

We choose an objective function at random from the list. See the previous article Comparing Python Global Optimizers for an explanation of some of these.
We select a random dimension n_dim, from the Fibonacci numbers (because … no reason) and a maximum number of function evaluations, referred to n_trials in the code.
We choose two optimization strategies at random. One to play White, the other Black. To be clear, an optimization strategy comprises a choice of optimization package, such as optuna, and also the fixing of any other choices or parameters that are required (in this case a choice of sampler — such as Tree-structured Parzen Estimator or TPE).
The White player is instructed to minimize the objective function using at most n_trials function evaluations. If they go over, we try to fool the optimizer by lowering the parameter that (supposedly) controls the number of function evaluations (but evidently, doesn’t quite do so). We repeat until the actual number of function evaluations, call it white_feval_count, is not more than n_trials. Each time we fail, we reduce the parameter by 30%.
Assuming success, it is now Black’s turn. We perform a similar procedure where n_trials is swapped out for white_feval_count, should they be different. And this time, we allow Black to exceed white_eval_count by ten percent. We also give Black finer increments, and more attempts.
Hopefully, both optimizers obey our command one way or another and in doing so, minimize a function using roughly the same number of function evaluations. A winner is determined according to the lowest function minimum that is found. However, if the proportional difference is very small, a draw is declared.

Sometimes a match will be declared void if step 4 or step 5 fails, and one optimizer fails to constrain itself appropriately. This occurs if the White player exceeds n_trials repeatedly or if the Black player exceeds the white_feval_count by more than 10%, despite strong encouragement. A match is also void if either algorithm crashes. There is no other penalty, at present.

The rest is just standard Elo updating, but I’ll save you from re-watching The Social Network if you are rusty on that topic (is it mildly ironic that we are judging Facebook algorithms with the algorithm that catalyzed The Facebook?). The Elo system assumes that if Black’s prior rating exceeds White’s rating by a difference of d then White can expect to score

points, where a win counts as 1.01.0, a draw 0.50.5 and a loss 0.00.0 and F is a constant, called the F-factor. White actually scores p points say leading to a differential of

for White and −δ for Black. Both player’s ratings are updated by an amount equal to this performance differential multiplied by the so-called K-factor. At time of writing, the following F-factors and K-factors are used.

We take the liberty of using a larger F-factor than in chess. The rationale is that a single optimization is somewhere between an entire game of chess and a hand of poker.

Interpreting Absolute Optimizer Elo Ratings

Unlike chess Elo ratings, the optimizer Elo scale is anchored. The random optimizer’s rating is reset to 1600 after every game. For the avoidance of doubt, the random optimizer is the optimizer that chooses uniformly on the hypercube n_trials times, returning the lowest value found.

This presumably limits ratings drift over time, and also makes it relatively easy to interpret an optimizer rating in an absolute sense. For example, using an F-factor of 1000 we can read off the expected points gained in a game when an optimizer is compared to random search. If draws are considered rare, this is essentially equivalent to the probability that an optimizer will help, rather than harm, when you employ it.

As also noted in the previous post, that particular measure may not be the best statistic. For example, you might be interested in the average time taken by an optimizer to perform reasonably well, as we were in that article. Or you might be interested in some tail weighted metric that penalizes cases where it never finds the minimum. Such metrics might lead you more towards optimizers that try to find all minima, whereas Elo ratings might not always correlate.

But with those caveats and more, here’s an interpretation. The probability that an optimizer performs better than random search, assuming no draws, is as follows:

Now as computed, the Elo rating is more a function of matches against other optimizers than it is against the random optimizer, so take with a grain of salt. But notice that the very best optimization strategies are better than random search three times more often than not. That might be significant, depending on the economic cost of your search.

Random search isn’t dreadful, and it is well appreciated that randomly choosing points from the space to be evaluated is superior to grid search, something we see all too often (see the paper by Bergstra and Bengio for a theoretical treatment). On the other hand, random search is certainly easy to implement, so one might reasonably be concerned if an optimizer is assigned a rating below 1600 — as seems to be the case for the “portfolio” algorithm in the ax-platform at present, and for several marsupials in the platypus package.

But again, there is a lot of noise in the ratings (it is an exercise for the reader to estimate how much) and, as new objective functions are introduced that are better suited to evolutionary algorithms, relative assessments will probably change. There is another reason why some optimizers might lose out to the random search on the Elo scale that the reader may have noticed. If they are not good at controlling their function evaluation budget, then the head-to-head procedure, as outlined above, may result in them using fewer function evaluations than their opponent — at least when they play Black.

Interpreting Relative Optimizer Elo Ratings

There are theoretical flaws with the Elo system and inaccuracies in the underpinning assumptions (logarithmic odds ratios). Yet the chess world carries on and so shall we. I hope that the relative differences in Elo ratings of optimizers can provide some rough guidance, allowing you to weigh possible performance differences against a myriad of other considerations, ranging from design, ease of use, responsiveness to Github issues, generality of the search domain, ease of modification and so forth.

Here’s an oddly familiar table relating differences in optimizer Elo ratings to the likelihood that one optimizer will find a lower minimum than the other.

I would not dismiss relatively small differences as economically insignificant if each evaluation corresponds to a lengthy simulation (or worse, some physical operation). However, I think this might also grant some of you the liberty to choose your package based on non-performance criteria in many instances and sleep well — say if the rating differential amounts to what might be won or lost in a single game (+/- 30 points initially, then +/- 15 later on) or two.

This exercise has convinced me further that download statistics and other measures of popularity are dreadful regressors for performance. Hopefully, more optimizers will be assigned Elo ratings in the future and again, if you have a suggestion for these lists, please file an issue or, better yet, a pull request.

Contributing

It should be easy to make helpful PRs:

Just grab your favorite optimization package from those that I’m missed thus far.
Shape it as I have the others in the optimizers directory, and make a pull request.
Or … find bugs. Special thanks to Robert Kern for finding some horrendous ones already.
Or … suggest better choices of method or hyper-parameters within a package.

Interesting objective functions on the hypercube are also in demand, and you can shove them here. Incidentally, there are a couple of related projects that might pique your interest too:

The embarrassingly library discussed here that is my somewhat experimental hack for robustness. It modifies the objective function only, so is uncoupled in an engineering sense from the optimizer itself.
An ongoing effort to boost elo ratings for time-series algorithms judged out of sample on live data was part of the motivation for HumpDay. HumpDay grew out of a sub-module there but I carved it out in the interest of maintainability.

Anyway, I hope HumpDay saves you some time, simple as it is. To ensure articles like this are in your thread consider following microprediction on LinkedIn. We’re trying to make powerful bespoke AI free to the world, and you are welcome to use it or contribute in large or small ways.

Geek Culture

Interpreting the Elo Ratings for Python Global Optimizers

Optimizer Elo Ratings

Popular Python Optimization Packages

Cube or Simplex

Top Performers

Recommendation

How Matches are Officiated

Interpreting Absolute Optimizer Elo Ratings

Interpreting Relative Optimizer Elo Ratings

Contributing

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Geek Culture

Written by Microprediction

No responses yet