TrueSkill™ Ranking System: Details
TrueSkill™ Ranking System

The TrueSkill™ ranking system is a skill based ranking system for Xbox Live developed at Microsoft Research.

Overview

Most games have at their root a metric for judging whether the game’s goals have been met. In the case of matches involving two or more players (“multiplayer matches”), this often includes ways of ranking the skills of match participants. This encourages competition between players, both to “win” individual matches, and to have their overall skill level recognised and acknowledged in a broader community. Players may wish to evaluate their skills relative to people they know or relative to potential opponents they have never played, so they can arrange interesting matches. We term a match “uninteresting” if the chances of winning for the participating players are very unbalanced – very few people enjoy playing a match they cannot win or cannot lose. Conversely, matches which have a relatively even chance of any participant winning are deemed “interesting” matches.

Many ranking systems have been devised over the years to enable leagues to compare the relative skills of their members. A ranking system typically comprises three elements:

  • A module to track the skills of all players based on the game outcomes between players ("Update").
  • TA module to arrange interesting matches for its members (“Matchmaking”).
  • A module to recognise and potentially publish the skills of its members (“Leaderboards”).

In particular, the ELO ranking system has been used successfully by a variety of leagues organised around two-player games, such as world football league, the US Chess Federation or the World Chess Federation, and a variety of others. In video games many of these leagues have game modes with more than two players per match. ELO is not designed to work under these circumstances. In fact, no popular skill-based ranking system is available to support these games. Many one-off ranking systems have been built and are in use for these games, but none of them is general enough to be applied to such a great variety of games.

How to Represent Skills

The TrueSkill ranking system is a skill-based ranking system designed to overcome the limitations of existing ranking systems, and to ensure that interesting matches can be reliably arranged within a league. It uses a technique called Bayesian inference for ranking players.

Rather than assuming a single fixed skill for each player, the system characterises its belief using a bell-curve belief distribution (also referred to as Gaussian) which is uniquely described by its mean μ (speak [mju:]) (“peak point”) and standard deviation σ (speak [sigma])(“spread”). An exemplary belief is shown in the figure on the right. Note that the area under the skill belief distribution curve within a certain range corresponds to the belief that the player’s skill will lie in that range. For example, the green area in the figure on the right is the belief that the player's skill is within level 15 and 20. As the system learns more about a player’s skill, σ has the tendency to become smaller, more tightly bracketing that player’s skill. Another way of thinking about the μ and σ values is to consider them as the “average player skill belief” and the “uncertainty” associated with that assessment of their skill.

Since the TrueSkill ranking system uses a Gaussian belief distribution to characterise a player’s skill, all mean skills (that is, μ's) will always lie within ± 4 times the initial σ (more precisely with probability 99.99%). Experimental data tracking roughly 650,000 players over 2.8 million games support this claim: Not a single μ ever happened to be outside the range ± 4 times the initial σ and 99.99% of the μ's happen to be even within ± 3 times the initial σ.

Interestingly, the TrueSkill ranking system can do all calculations using an initial uncertainty of 1, because then μ and σ can be scaled to any other range by simply multiplying them. For example, suppose all calculations are done with an initial μ of 3 and σ of 1. If one wishes to express player’s skill as one of 50 “levels”, multiply μ and σ by 50/6 = 8.3 because almost all μ's happen to be within ± 3 times the initial σ.

The intuition is that the greater the difference between two player’s μ values – assuming their σ value are similar – the greater the chance of the player with the higher μ value performing better in a game. This principle holds true in the TrueSkill ranking system. But, this does not mean that the players with the larger μ's are always expected to win, but rather that their chance of winning is higher than that of the players with the smaller μ's. The TrueSkill ranking system assumes that the performance in a single match is varying around the skill of the player, and that the game outcome (relative ranking of all players participating in a game) is determined by their performance. Thus, the skill of a player in the TrueSkill ranking system can be thought of as the average performance of the player over a large number of games. The variation of the performance around the skill is, in principle, a configurable parameter of the TrueSkill ranking system.

How to Update Skills

The TrueSkill ranking system will base its update of μ and σ on the game outcome (relative ranking of all teams) only; it merely assumes that the outcome is due to some unobserved performance that varies around the skill of a player. If one is playing a point based game and the winner beats all the other players by a factor of ten, that player’s victory will be scored no differently than if they had only won by a single point. Every match provides the system with more information about each player’s skill belief, usually driving σ down.

Before starting to determine the new skill beliefs of all participating players for a new game outcome, the TrueSkill ranking system assumes that the skill of each player may have changed slightly between the current and the last game played by each player. The mathematical consequence of making such an assumption is that the skill uncertainty σ will be slightly increased, the amount of which is, in principle, a configurable parameter of the TrueSkill ranking system. It is this parameter that both allows the TrueSkill system to track skill improvements of gamers over time and ensures that the skill uncertainty σ never decreases to zero ("maintaining momentum").

In order to determine the new skill beliefs of all the participating players for a new game outcome, the TrueSkill ranking system needs to determine the probability of the observed game outcome for given skills of the participating players and weight it by the probability of the corresponding skill beliefs. This is done by averaging over all possible performances (weighted by their probabilities) and deriving the game outcome from the performances: The player with the highest performance is the winner; the player with the second highest performance is the first runner up, and so on. If two players' performances are very close together, then the TrueSkill ranking system considers the outcome between these two players a draw. The larger the margin which defines a draw in a given league, the more likely a draw is to occur, according to the TrueSkill ranking system. The size of this margin is a configurable parameter of the TrueSkill ranking system and is adjusted based on the game mode. For example, a street race in Project Gotham Racing 3 can never end in a draw (thus the parameter is set to zero) whereas a Capture-the-Flag game in Perfect Dark Zero can easily end in a draw.

By virtue of the above weighting technique (which is also known as Bayes' Law), the system arrives at a new skill belief for every player participating in the game. These skill beliefs are not Gaussian anymore. Hence, the TrueSkill ranking system determines the best Gaussian approximation. As a result, given players' μ values increase for each opponent they out-performed, and decreases for each opponent they lost against. The following table gives before and after values for μ and σ for each (hypothetical) participant in an 8-player match.

Name  Outcome  Pre-Game μ  Pre-Game σ  Post-Game μ  Post-Game σ 
 Alice 1st  25  8.3  36.771  5.749 
Bob  2nd  25  8.3  32.242  5.133 
Chris  3rd  25  8.3  29.074  4.943 
Darren  4th  25  8.3  26.322  4.874 
Eve  5th  25  8.3  23.678  4.874 
Fabien  6th   25 8.3  20.926  4.943 
George  7th  25  8.3  17.758  5.133 
Hillary  8th  25  8.3  13.229  5.749 

One can see that σ values – the uncertainty in the skill for each player – is lower after the match, substantially more so for the players on the 4th and 5th rank (Darren and Eve) Those two players have the property that they are "bracketed" by a maximal number of players in terms of defeat: They were defeated by 3 (or 4) players and they defeated 4 (or 3) other players. In contrast, the first player (Alice) is simply known to be better than the 7 other players which does not constraint her skill from above: She may be even better than level 36.771. This is reflected in the larger uncertainty of 5.749.

The simplest case for an TrueSkill ranking system update is a two-person match. Suppose we have players A(lice) and B(ob), with μ and σ values (μA,σA) and (μB,σB), respectively. Once the game has finished, the update algorithm determines the winner (Alice or Bob) and loser (Bob or Alice) and applies the following update equations (we disregard the possibility of a draw for the sake of simplicity here):

In these equations, the only unknown is β2 which is the variance of the performance around the skill of each player. Moreover, ε is the aforementioned draw margin which depends on the game mode. But what do the functions v(.,.) and w(.,.) look like? Instead of giving the exact definitions, let us have a look at plots of these functions for varying values of ε/c:

There are a few observations about these update equations:

    • Similarly to the ELO system, in the mean skill update equation the winner gets a multiple of v((μwinner-μloser)/c,ε/c) added to the mean skill and the loser gets a multiple of v((μwinner-μloser)/c,ε/c) subtracted from the mean skill. However, in contrast to ELO the weighting factors are roughly proportional to the uncertainty of the winner/loser vs. the total sum of uncertainties (2β2 is the uncertainty due to the performance variation around the skill and σ2winner+σ2loser is the uncertainty about their true skills). Hence, only if Alice and Bob have the same uncertainty, the TrueSkill ranking system's mean skill update equation reduces to an ELO update equation. Note that the TrueSkill ranking system's update equation for the mean skill is thus not guaranteed to be zero sum.
    • The uncertainty of both players (regardless of win/loss/draw) is going to decrease by the factor 1-σ2player/c2 * w((μwinner-μloser)/c,ε/c). Again, the player with the larger uncertainty gets the bigger decrease.
    • The change in the mean skill, v((μwinner-μloser)/c,ε/c), and the decrease factor in the uncertainty, 1-σ2player/c2 * w((μwinner-μloser)/c,ε/c), are close to zero if the game outcome was not surprising.

      Win/Loss
      If the winner had the much bigger mean skill relative to the total uncertainty (thus (μwinner-μloser) > ε) then a win cannot buy the winner extra mean skill points or remove any uncertainty. The opposite is true if the game outcome was surprising: If the winner had the smaller mean skill (μwinner-μloser) < ε) , mean points proportional to μloser-μwinner get added/subtracted to/from the winner/loser.

      Draw
      If both player had similar mean skills upfront (thus |μwinner-μloser| < ε) then both player are already close enough together and no mean skill point update needs to be made; hence the uncertainty is not reduced. However, if one player was thought to be much stronger by the TrueSkill ranking system before the game (let's say,μwinner-μloser > ε) then his mean skill will be decreased and the other player's mean skill will be increased which, in effect, brings their two mean skill closer together.

The mean skill update equations of the TrueSkill ranking system are similar to the update equations of the ELO algorithm. The key difference is that a variable K factor is used for both players mainly depending on the ratio of the uncertainties of the two players. Hence, playing against a very certain player in the TrueSkill ranking system allows the uncertain player to move up or down in larger steps than in the case when playing against another uncertain player.

But how does the TrueSkill ranking system incorporate the game outcome of a team match? In this case, the team's skill is assumed to be the sum of the skills of the players. The algorithm determines the sum of the skills of the two teams and uses the above two equations where (μwinner,σ2winner) and (μwinner,σ2loser) are the mean skills and skill variances of the winning and losing team, respectively.

The update equations for more than two teams are not possible to write down as they require numerical integration (the above plots have been obtained by using the same numerical integration code). In this case the TrueSkill ranking system iterates two team update equations between all teams on neighbouring ranks, that is, the 1st versus the 2nd team, the 2nd team versus the 3rd team and so on. If you want to learn more about this variant of the TrueSkill ranking algorithm, please follow this link to our interactive ranking calculator.

How to Match Players

Matchmaking is an important service provided by gaming leagues. It allows participants to find team-mates and opponents who are reasonably close to their own skill level. As a consequence, it is likely that the match will be interesting, as all participants have roughly the same chances of winning.

TrueSkill ranking system's skill beliefs are based upon probabilistic outcome models and thus enable players to be compared for relative chance of drawing. The more even the skills of match participants, the more likely it is that this configuration of players will end up in a draw, and the more interesting and fun the match will be for every participant. For example, for two players A(lice) and B(ob) with skill beliefs (μA,σA) and (μB,σB), the (re-scaled) chance of drawing is given by:

This number is always between 0 and 1 where 0 indicates the worst possible match and 1 the best possible match. Even if two players have identical μ values, uncertainty σ affects the quality of the match; if either of the σ values σA or σB is large, then the match quality criterion is significantly smaller than 1!

How to Build a Leaderboard

Using the two parameters μ and σ which characterise a belief in a player’s skill the TrueSkill ranking system ranks players using the so-called conservative skill estimate = μ - k*σ. This estimate is called conservative because it is a conservative approximation of the player’s skill: it is extremely likely the players actual skill is higher than the conservative estimate. The bigger the value of k the more conservative the estimate; a common value of k is 3.

How to Proceed From Here

If you still want to know more about the TrueSkill ranking system, you can go and check out: