Cubing, Statistics, and the WCA Database: 101

2025-09-28

This article is mostly aimed at people with some statistical background to introduce them to cubing and the primary statistical questions therein. However, it also has a brief explanation of key statistical concepts aimed at cubers.

In 1982, at the peak of the Rubik's craze, 19 competitors representing 19 countries on four continents gathered in Budapest for the Rubik's Cube World Championship. Minh Thai from the United States won the event, solving the Rubik's Cube with a best time of 22.95 seconds. The craze soon died down, and it wasn't until 2003 that another formal in-person competition was organized, born out of an online community that had grown substantially in the prior years. With the success of the 2003 World Championship, more competitions sprouted up around the world, and the World Cube Association was soon formed to standardize and track records from these competitions. Today, over 250,000 people have recorded over 25 million solves at over 10,000 competitions in over 120 countries.

If you're not familiar with speedcubing (also called just cubing), the 2003 World Championship introduced many elements of the discipline that may seem unexpected. For one, competitors could compete in many events: while speedsolving the regular 3x3x3 cube was the main focus, competitors also solved 4x4x4 and 5x5x5 cubes, the shapeshifting Square-1, the tetrahedral Pyraminx and dodecahedral Megaminx, the Siamese cube (which resembles two 3x3x3 cubes stuck together), and the unique puzzles of Rubik's Clock, Rubik's Magic, and Rubik's Master Magic. They could also solve cubes in different ways: a few took on the 3x3x3 cube one-handed, some tried to solve it in as few moves as possible, and two bold competitors even took it on blindfolded, with one competitor also attempting the 4x4x4 and 5x5x5 blindfolded. With a few changes, these represent the basis of the modern list of speedcubing events: 2x2x2, 6x6x6, and 7x7x7, and Skewb (a cube that turns along diagonal axes) have been added, as well as a 3x3x3 multi-blind format where competitors have an hour to memorize and solve as many cubes as possible, the Magic puzzles have been phased out, Siamese never really made the cut, and a format solving the 3x3x3 with only one's feet has come and gone.

The 2003 World Championship also introduced a multi-round advancement system. In the most popular event, 3x3x3, the field of 83 competitors was whittled down to 32 in the first round, and just 8 proceeded from the second round to the final. 4x4x4 and 5x5x5 had two rounds, and the rest had only the final. Today, competitions can use multiple rounds freely, and frequently do, especially for quicker events and events with many competitors. In 2026, the rounds system is expected to change to allow for more solving, especially by newer competitors, but the details are still being worked out.

Finally, another important concept from the 2003 World Championship is the introduction of formats. Formats are how competitors within a round are ranked. In 1982, competitors had three attempts and their best attempt was taken. This is known as the best-of-three format, and is still used for blindfolded solving (although 3x3x3 Blindfolded is expected to switch formats in 2026). 3x3x3 Multiblind and 3x3x3 Fewest Moves, being hour-long events, also can use best-of-one and best-of-two formats. In 2003, 4x4x4 and 5x5x5 used a mean-of-three format, where the competitors make three attempts and the average of those three is used. Today, this is used for 6x6x6 and 7x7x7, and 3x3x3 Fewest Moves. The most important format, though, was only used in the finals for 3x3x3 back in 2003: average-of-five. This is actually what's known as a "trimmed mean" or "Olympic average": the slowest and fastest times are removed and the average of the remaining three solves are used to rank the competitors. One useful note here is that all events have both a "single" and an "average" time, except for 3x3x3 Multiblind (other blindfolded events record average times as the mean of three times in a best-of-three round, even though the averages aren't used for rankings in the competition). Both single and average times are recorded and tracked on the WCA website and are used for personal, national, continental, and world records.

As the WCA was created in part to track records, the question naturally has long arisen: "Who's the fastest speedcuber?" While various methods have been proposed, most notably Kinch Ranks and Sum of Ranks, to tackle the question of ability across the range of WCA events, even getting a solid answer for a single event has some difficulty. Typically, the world record holder is said to be the fastest, but as the WCA tracks both single and average records for each event, there can be and often are two record holders for any given event. At time of writing, only eight of the 16 possible events have both single and average WRs held by the same person, but this changes frequently. For some events, either single or average is considered more representative of overall skill, and therefore considered the more important record. For example, heavily scramble-dependent (and therefore luck-based) events such as 2x2x2 focus more on averages, while the difficulty of obtaining averages in the big blind events means that singles are a better representation. To some extent, this is reflected by the round advancement criteria specified by the WCA, but this is not always a perfect match, and many events are much more debatable as to which ranking is best. The CubeStats model, which I will introduce in a later post, is inspired by this question but does not directly answer it. It is instead designed to answer the more statistically-rigorous question of "What will be a speedcuber's next result?" While this will be explained in more depth in the post covering the model, it may be worth going into some detail about what this question means, why it's worth asking, and how it differs from the aforementioned rankings.

In statistics, most events of interest are considered random events, that is, there is some component of them which cannot be predicted ahead of time. In cubing, we might consider a single solve to be a random event. We cannot predict how good the scramble will be, how well-prepared the competitor will be, which continuations they will be able to see in the inspection period, and so on, whether or not they will lock up, and so on. We could consider each of these components to be random events that influence the solve, but as we have no direct data on any of them (with the notable exception of the scramble), it would be much more difficult to model. When we measure a random event in some way, that measurement is called a random variable. In our case, that random variable is the time of the solve (with the exception of 3x3x3 Fewest Moves and 3x3x3 Multi-Blind). If we look at each possible value that the random variable can take and the probability that it takes those values, we have the probability distribution of the random variable. Once a distribution is defined, we can ask many questions about the random variable. For instance, we could ask what the probability of achieving a time below 10 seconds is, or what the average of a very large number of repeated attempts would be. Indeed, in some sense, a distribution contains all possible information about its random variable.

Unfortunately, outside of particular, usually rather synthetic, cases (such as rolling a fair die), it is impossible to know precisely the distribution for the random variable of interest. As such, much of statistics in practice revolves around ways of estimating the distribution, or at least properties of it, from known data. Probably the most common approach is to assume that the distribution is in a distribution family, where the distribution can be defined by a few numbers (parameters), often just one or two. A distribution family can be selected by several methods — from first principles, manual inspection of the available data, or certain more rigorous statistical methods — but the details are for another blog post. Using this approach, the problem of determining a distribution for a speedcuber's next result can be broken down into a choice of distribution family (which will likely be a single choice for all cubers, at least within one event), and a choice of parameters for that distribution (which will be specific to that cuber). From there, different evaluation methods are available to tell us how good our choices were.

If we have good estimations of each cuber's distribution of solve times, we can do some simple analyses like comparing each cuber's estimated average solve time, or the cuber who is most likely to break the WR. We can also go a bit further and simulate a whole competition by simulating each cuber's solves over the competition, advancing the competitors through the rounds, and seeing who comes out on top in the end. By repeating this process many times (Monte Carlo simulation), we can see who's most likely to win or podium in each event of the competition. There's almost certainly more, very interesting questions to ask that I haven't thought of, let me know if you have any!

In every step of the process, it's helpful to have a large amount of data to work with. Fortunately, the WCA database provides a large amount of data. As mentioned earlier, there are over 250,000 competitors and over 25 million solves to analyze. The WCA provides this information in two formats, known as the "results export" and the "developer export". The results export is easier to work with, being in TSV format, and contains most key data. However, the developer export has more information available. I've typically done any of my WCA data analysis using the results export, but for CubeStats, I knew I wanted to have as much information available as possible to work with. The developer export turned out to be fairly simple to import, you just need to have a MariaDB server to pipe the export file to. Once imported, though, it's not the easiest database to get your bearings on. There's over a hundred tables, about half of which are completely redacted and blank. Of the tables remaining, many have certain columns redacted. There's good reason for these redactions: they would contain private personal information, internal operating values that the public wouldn't benefit from, or other data that's not otherwise public and has no particular reason to be. The tables and columns are probably left in so that the schema matches the schema of the real database, available on the WCA's GitHub. That's particularly important because, unlike the results export which should change format only rarely, the developer export schema will change regularly as the internal database updates. For anyone looking to use the WCA developer export, you can look at the schema_sha1 value in ar_internal_metadata to monitor the schema changes. Also, here's a short list of tables that aren't in the results export that I think look the most likely to be useful for various analyses:

concise_average_results and concise_single_results
registrations and registration_competition_events
rounds
users
user_roles and roles_metadata_councils, roles_metadata_delegate_regions, roles_metadata_officers, and roles_metadata_teams_committees

With all that said, even with lots of data, there's many problems that impede a solid analysis. I'll address some of them when I introduce the model, but here's a few things to consider:

Results in 3x3x3 Fewest Moves are fundamentally different from times, and should potentially be modelled completely differently.
Results in 3x3x3 Multi-Blind are even more different, with three components to consider (the number of cubes attempted, the number of cubes solved, and the amount of time taken). Each of these components, even in isolation, present unique challenges.
Competitors can DNF a solve, and DNFs are very common in some events. A good model should consider DNFs in some way.
Competitors may also record a DNS, for various reasons. DNSs are fairly rare in all events, but may still need to be considered.
Competitors may also receive time penalties, in increments of two seconds, which are added to their time. Receiving a time penalty is not recorded (although in a few special cases can be observed), only the final time with penalty is. A model might need to consider the effect of penalties.
Times are recorded to the centisecond, but are unusually rounded down to the nearest centisecond. This may cause an approximately 5-millisecond bias in some methods, although in some other cases it can cancel out, depending on interpretation.
The skill of an individual competitor tends to improve over time, sometimes even drastically over a short window. Results from a year ago probably shouldn't influence our estimation of how good the competitor is now if they've clearly improved since then.
Relatedly, if it's been a while since a cuber has competed, we shouldn't necessarily expect them to come back in at the same skill level as they were last at. There are a lot of different ideas to consider here, from trying to model how cubers tend to shift over these gaps to simply reducing confidence after a longer time away.
Advancements in hardware and technique mean that older results and newer results aren't directly comparable, even for the same competitor, over a span of enough years.