Elo Rating in Pure SQL

Recently, I’ve been working on this website where you can rate on your favorite Minecraft mob. It was a very interesting project, and there are actually quite many noteworthy things I could talk about here. However, for this article, I would like to take a closer look at how the rating is implemented.

So buckle in, let’s get started.

The Elo System

The ranking of the mobs is established by using the Elo rating system. Now, it’s been a minute since my last statistics class, so I’m not an expert in any of this, but here’s the basic idea of how it works:

Each player is assigned a rating, and the probability of a win directly depends on the difference in rating. The scale of the ratings is essentially arbitrary, but for chess – what Arpad Elo originally designed the system for – a difference of 300 points corresponds to a win-probability of ~85% (which is also pretty close to the standard deviation).

$E_A = \frac{1}{1 + 10^{(R_b - R_a)/400}}$

$R_A$ and $R_B$ are the current ratings of players A and B. $E_A$ is the expected score of player A, i.e. the probability of player A winning the game. The expected score of player B is, of course, $E_B = 1 - E_A$ .

(The magic number 400 in the formula just defines the scale of the ratings. Chess typically uses 400 for historical reasons, but others are also used, for example 200 in Swiss Table Tennis.)

The new rating for the players then depends on the expected score $E_A$ and the actual score $S_A$ (in the case of chess either 0, ½ or 1) following this formula:

$R'_A = R_A + K \cdot (S_A - E_A)$

The K-factor represents the maximum number of points that can be lost or gained in any one match. This value is often adjusted based on how established a player is.

Additionally, the Elo rating is also used for pairing. As in: Games between equally matched players are more interesting than a game between Magnus Carlsen ( $R_A = 2832$ ) and a chess beginner ( $R_B = 400$ ), where the outcome is essentially determined from the start ( $E_B \approx 10^{-6}$ ).

The Naive Approach

As you might have already read on the About page, I built a very similar website back in high school. Back then, I took the naive approach of just fetching the current ratings once a winner was determined, calculating the new ratings, and updating them in the database.

This has two disadvantages:

It doesn’t scale well with many concurrent users. My original implementation didn’t use database transactions. But even with transactions, the entries for the participants would have to be locked, which is generally not that great from a performance standpoint.
The second issue – probably the more important one – is, that with this solution I’m unable to remove any games after the fact. So, let’s say that there are some erroneous games (maybe somebody was spamming the website) and I want to remove them. Even when I have a list of all previous games, I’d have to recalculate everything from the beginning to determine the current rating.

All this effectively boils down to the fact that the ratings recursively depend on each other. To calculate the rating of player A after game g, I need to know the rating of A at g-1, as well as the rating of opponent B at g-1. But for that, I need to know A’s rating at g-2, as well as A’s previous opponent C’s rating at g-2, and B’s rating at g-2 as well as B’s previous opponent D’s rating at g-2, and so on.

Recursion On Demand

So, here’s the idea: Why not store just the games, and calculate the ratings (only when needed) directly in the database?

SQL with CTE (Common Table Expressions) actually supports so-called “recursive queries”. Personally, I think “inductive queries” would have been a better name, since their structure resembles inductive proofs in mathematics.

There are usually two parts in a recursive query: The anchor term (also called initial term; this corresponds to the base case of an inductive proof), and the actual recursive term (this corresponds to the induction step).

The anchor term defines an initial set of records that are part of the overall result set. The recursive term refers to this result set and can add records based on the existing ones.

To illustrate how this works, here’s an example of a recursive query that executes a Turing machine. (This is just the table structure and the query executing the machine. You can find a full example over on my GitHub. The simplest way to try it out is to copy the complete source code into an online PostgreSQL environment, like the one from OneCompiler.)

CREATE TABLE input (
  input BYTEA
);

CREATE TABLE program (
  state INT,
  read INT,
  next_state INT,
  write INT,
  move INT,
  accept BOOL DEFAULT false,
  reject BOOL DEFAULT false
);

WITH RECURSIVE run_state (iteration, state, position, band, accept, reject) AS (
  SELECT 
    0, 0, 0, input, false, false
  FROM input
  UNION ALL
  SELECT 
    run_state.iteration + 1,
    program.next_state,
    run_state.position + program.move,
    set_byte(run_state.band, run_state.position, program.write),
    program.accept,
    program.reject
  FROM run_state
  INNER JOIN program
    ON  program.state = run_state.state
    AND program.read = get_byte(run_state.band, run_state.position)
  WHERE 
        NOT run_state.accept 
    AND NOT run_state.reject
)
SELECT 
  CASE
    WHEN run.accept THEN 'Input was accepted in ' || run.iteration || ' iterations.'
    WHEN run.reject THEN 'Input was rejected in ' || run.iteration || ' iterations.'
    ELSE '[insert confused meme here]'
  END
FROM run_state AS run
WHERE 
     run.accept
  OR run.reject;
Code language: SQL (Structured Query Language) (sql)

As you can see, the recursive term of the CTE refers back to the CTE itself to get the last state of the machine and calculate the next one. This is essentially what we want to do when calculating the player ratings.

JSON To The Rescue

My first idea was to have each mob rating as its own record in the CTE. In the sense that each match generates two records – one for each mob fighting in the match. That actually worked for a while because my randomly generated test data just happened not to require more than one reference to each match-mob combination.
Welp, lesson learned: Use realistic test data. 😅

Turns out: PostgreSQL doesn’t allow of the same record to be consumed recursively more than once. Kind of makes sense. A pity, though, I thought my solution was pretty clean.

Anyway, after talking to a friend of mine who does SQL magic for a living (Hi Xanatos 👋), we came up with the idea of using non-scalar data structures in the CTE. Originally, I tried using MySQL for this project, but it sucks balls when it comes to complex recursive stuff, since you can’t call the recursion in a subquery. So, at this point I was already using PostgreSQL, which just happens to natively support objects in the form of JSON and, even better, JSONB.

The basic setup I ended up with was the following:

There is a table for the mobs (stuff like name, image, …), and a table for the matches (Which two mobs are fighting, and who won?). Importantly, the matches table has a serial/auto-increment surrogate key – higher surrogate key means newer entry.

The actual logic that calculates the rating history for all mobs is a CTE in a view. Each row of the result contains a reference to a match, as well as a JSONB object with all ratings at the time of that match.

The anchor term of the CTE generates a JSONB object that contains a key for each mob available, and the initial ratings as values (We’ll come back to this later), as well as a “last update” reference (= virtual foreign key against the matches table) that is set to 0 (a value smaller than any generated key in the matches table).

The recursive term gets the latest record in the CTE, and searches for the “next match”. The “next match” is whatever match has the smallest ID greater than the “last update” reference. Based on which mobs are fighting in this match, the expected outcome is calculated. Based on who won, the new ratings are calculated. Lastly, the new ratings are combined with the existing JSONB object – this is the next record in the recursion.

Okay, now we have all the data we need. However, the format is not very useful. So, there is another view that filters the result of the rating history view to just get the latest entry (the one with the highest “last update” reference), and deconstructs the JSONB object into individual records consisting of the mob ID and the rating.

Perfect, now we have one view that always gives us the current rating just from the matches table. We can add or remove entries in the table, and all the ratings will change correspondingly.

Please note, that this is somewhat simplified. In reality, there are a few other small details to consider, just because of the way SQL works. I’ll spare you the complete view – it is quite long, after all. If you are nevertheless interested, feel free to check out the source code on GitHub – you can find the DDL file in /migrations/0001_mobsMatchesAndRating.sql.

Performance & Caching

Okay, so basically we are doing a pretty brutal amount of computation on every single request. On a system that’s optimized for data access, not calculation, no less. What about performance? Surely, this isn’t very fast, right?

Well, yes and no. It’s not exactly blazing fast, but it’s not as bad as you might think. (Ironically, the switch to JSONB improved the performance quite a bit. 😅)

On my local test database with 80 mobs and ~1500 matches, a full calculation of all the ratings takes a little under 1 second. I’m sure, my setup is not ideal here, but the core issue is that the runtime (probably) scales linearly with the number of matches. So, 15 000 matches would already be 10s – that’s too long for a web application.

The solution: We cache the rating history, and seed the recursive query with the cache instead of starting from scratch. All the results stay the same while being a lot more snappy. The cache can be updated once a day or so.

This, of course, kind of breaks one of our goals: We can no longer simply delete matches. We’d need to delete all cache entries after the deleted match. But, I guess, at least we don’t get any wrong results while we are recalculating the cache – the visitors will just have to wait a bit longer.

Conclusion

Personally, I think SQL is a pretty fantastic technology. The mere fact I was able to teach a relational database to calculate Elo ratings is amazing. (I guess, this kinda serves as evidence for why I wrote that SQL is a programming language in my post about HTML.) If only it would be more readable and easier to debug… 😂

Anyway, I hope I could share a bit of the fun had while working on this project. There are actually a lot more fun details in this project that I’d like to talk about, like the custom HTMX extension for image preloading, or the minimalistic software architecture, or the dynamic favicon (that I’m quite proud of), …
However, I thought the Elo calculation in SQL – the thing that kind of prompted the project to start with, is more interesting for a broader audience.

Have a wonderful day and see you later!
(Oh, and don’t forget to check out Mobmash – currently Cat and Axolotl are the lead.)

Yours,
Sigma

EDIT: I just found out that Elliot Chance did a very similar thing back in 2018. You can find his article here.

My pet axolotl Sisu saying “Bye”.