Plain English Introduction to Dynamic Grading
by Louis Nel
"Introduction to Dynamic Grading"1 has until now been the only available description of this grading system. Its purposes included a complete definition, so that readers could judge it or program a computer for its implementation. This comprehensiveness makes it less than ideal for readers who merely want to interpret its output. The present article is written with the latter kind of reader in mind. Accordingly, it avoids formulas and technical details (available in the mentioned article) while focusing on the underlying ideas.
I’ve been asked many questions about Dynamic Grading and they all ultimately involve win probability. So I want to begin by illustrating this central concept. Imagine a bag containing 70 white beads and 30 black beads, thoroughly shuffled. It should be intuitively clear that if a bead is randomly drawn it will more likely be white than black. In fact the chances are 70 out of 100 (100 = 70 + 30). This illustrates probability: the probability of drawing a white bead is 70/100 or 0.7 or 70%. A probability is a number between 0 and 1, so it can be expressed as a percentage or as a fraction.
The same bag of beads also illustrates win probability. Suppose player A wins if a white bead is drawn and B wins if a black one is drawn. So the statement "A has win probability 70% over B" can be visualised as the situation represented when 70 of the beads are white and the remaining 30 are black. We are tacitly assuming that in a game played between players X and Y there is an underlying probability that X will beat Y. This underlying probability exists but is never known precisely. The main task of a grading system is to estimate it as accurately as possible.
A ranking system is not necessarily a grading system. A grading system is a ranking system such that a grade difference gd(X,Y) = G(X) – G(Y) between the ranking statistics (grades) of two players X and Y determines wp(X,Y), the win probability of X over Y, via a known formula. All grading systems discussed here use the same grading scale and the same gd(X,Y) to wp(X,Y) correspondence. It is illustrated by the table opposite.
This correspondence is the heart and soul of a grading system. It will come up again and again in the discussion to follow. It should be clear from the table that the correspondence is most sensitive near a grade difference of 0. For example, when the grade difference increases by 50 points from 40 to 90 the win probability increases about 5% from 54.6 to 60.2, but the grade difference has to increase from 480 to 640, i.e. by 160 points, in order to increase the win probability from 90.1 to 95.0 (also an increase of about 5%).
The only way in which the grade of a player is ever used is by comparison with the grades of other players. So grade differences matter, the value of the grade as such does not. If we were to add 1000 to the grade of every player in the system, then such action will not disturb the ranking order, nor will it disturb any grade difference. Indeed 2190 - 2100 = 3190 – 3100 = 90. So the grading scale is to some extent arbitrary. It is chosen once and for all. Once a grading scale is chosen, the starting grades given to players are given in such a way that grade differences involving a new player reflect win probabilities over existing known players as accurately as possible.
It follows from the foregoing that grades are estimates of relative performance levels. If G(A) = 2100 and G(B) = 2000 then the grade difference G(A) – G(B) = 2100 – 2000 = 100. This indicates (via the above table) that the win probability of A over B is estimated to be slightly above 60%. Since this is above 50%, it means that A has a higher performance level than B.
If all players always performed at a constant level, then accurate starting grades would be enough to maintain a grading system. However, performance levels change all the time. So to maintain a grading system as effectively as possible grades need to be adjusted so that grade differences remain, as closely as possible, in correspondence with win probabilities via the known correspondence formula. The preceding sentence states the one and only purpose of grade adjustments.
Grade adjustments differ from one grading system to the next. Some systems do them game by game, others event by event. In fact, a grading system is defined by how its grade adjustments are done.
The grading systems now to be discussed are destined to provide the raw material out of which Dynamic Grading (DG) is built. So any understanding of how they work is relevant to an understanding of DG.
We call a grading system simple when there is a specified positive number M, the modulator of the system, such that grade adjustments amount to addition or subtraction of the adjustment quantity M * LWP after every game, where LWP denotes the Loser’s Win Probability. The adjustment quantity is added to the winner’s grade and subtracted from the loser’s grade. The modulator is the same for all players in all events. Since LWP can be obtained by applying the correspondence formula to the difference (Loser’s Grade) – (Winner’s Grade) before the game, the adjustment quantity can be calculated. We denote the simple grading system with modulator M by IdxM and denote the grade of player X by Idx(X), with M understood.
The system Idx20 is maintained along with DG on the Butedock website. Readers who would like to see numerical illustrations of its grade adjustments will find a plentiful supply on that website.
Simple grading systems have a pleasant self-correcting feature. Suppose Idx(X) is larger than it ought to be. Then, when X loses a game, the difference (Loser’s Grade) – (Winner’s Grade) = Idx(X) – (Winner’s Grade) is larger than it ought to be, so the adjustment M * LWP to be subtracted from Idx(X) is larger than it would have been in case of an accurate grade. When player X wins a game, the difference (Loser’s Grade) – (Winner’s Grade) = (Loser’s Grade) – Idx(X) is smaller, so the adjustment M * LWP to be added to Idx(X) is smaller than it would have been in case of an accurate grade. The succession of reduced additions and augmented subtractions will continue until Idx(X) is no longer larger than it ought to be. Similar reasoning applies in the situation where Idx(X) is supposed smaller than it ought to be.
This self-correcting feature may give the impression that the system will converge to a state of accurate grades. That is not so. In fact, a non-convergence feature is present. Namely, grades do not necessarily keep becoming more accurate as more games are played. Indeed, if Idx(X) is higher than it ought to be and X happens to win his next game (something not under control of any grading system) then after that game Idx(X) will be even more inaccurate than before the game because the adjustment quantity becomes added. And if X proceeds to win several games in a row in that state, its value could stray well above what it ought to be. Similarly, it could stray well below. Of course, X will not keep winning and the self-correction feature will be at work. So eventually the grade will return to a reasonably accurate value. Thus grade accuracy has a natural tendency to fluctuate randomly. When the modulator M is large the fluctuations are more pronounced compared to when M is small.
In view of the non-convergence feature it should be clear that grades are at best approximations to the current performance level of the player. It follows that the ranking order implied by small grade differences particularly are shrouded in uncertainty. It is impossible to give a threshold where grade differences become significant. The implied ranking order gradually becomes more certain as the grade differences become larger. There are situations in which grade accuracy could be expected to have greater certainty. We return to that at the end of this article.
Being confronted with a whole family of simple grading systems IdxM (one for each choice of modulator M) the question naturally arises: which is preferable? For example, is Idx20 preferable to Idx50? In search of an answer one might look at ranking lists produced by Idx20 and Idx50 and, based on a presumed knowledge of the players, judge which gives the more appropriate ranking. Which of us can presume to know all players that well?
There is a more effective, more objective method. It is to do a reality check. When the system Idx20 estimates the grade difference between A and B to be 90 points it is implicitly saying that the win probability of A over B is slightly more than 60%. Does this agree with direct observation of real game results? We could check this by looking at all games over the last 10 years (say) in which the top rated player had a win probability between 60% and 61%. We could count the number of such games in which the top rated player actually won the game. This gives us the number of correct predictions by Idx20 for the batch of test games as described. Are these observed wins close to 60%? We could then repeat the same procedure for the system Idx50. This gives us a direct and objective way of telling which of these systems has the more credible grade differences of about 90 points. But what if one system is better at grade differences around 90 points but worse at 50 or 200 or 400? To address this issue we could proceed more systematically and subdivide the probability interval 50% to 100% by a large number of small subintervals e.g. 50 to 51, 51 to 52, etc. and perform the above reality check for each subinterval in the same way we described for the subinterval 60 to 61 above. Then we could use an averaging process to distill one single number out of the list so obtained. This gives us an objective comparison of Idx20 with Idx50, not subject to the human fallibility and prejudices that may attend a comparison of ranking lists.
The preceding paragraph outlined roughly the chi-squared test long used in statistics. In practice we have used a transform of the chi-squared statistic to get an equivalent Grade Deviation statistic (GDev) on a more convenient scale. General users of DG need not be familiar with the precise definition of GDev. However, it should be known that the derivation of DG from the family of Simple Grading Systems is based on objective testing to see which choices yield the smallest GDev. The choices are never based on somebody’s gut feeling.
Gdev measures how much the observed grade differences deviate (over the test games used) from what they ought to be. A perfect system would give Gdev = 0.00, which is unattainable in practice. The table to follow reports the GDev statistic for a selection of systems. The first five are simple grading systems. Among them Idx24 gives the smallest (the best) GDev. The system IdxCF50 is the system obtained by incorporating Class Factors i.e. for prestigious events the modulator is increased to 60 and for Class 3 events (like plate games) it is reduced to 40. CGS is the previous official system for Association Croquet. It incorporates Class Factors as well as a smoothing of the grade. These additional measures in the grade adjustment process clearly do not uphold the central purpose of grade adjustments stated above. As the GDev statistic shows, they undermine the correspondence rather than to uphold it.
The non-convergence feature suggests that small modulators are preferable because they would not blow grades off course as much as larger ones. On the other hand the self-correcting feature calls for a modulator large enough to enable quick correction when necessary. The latter consideration is of particular interest for handling rapid improvers. Indeed, if a player’s grade is 300 points too low (which frequently happens in case of a rapid improver) and the modulator is small (like M = 10) it would take unduly long for the player’s grade to catch up. This situation is one of the problems that faces any grading system.
The above table indicated Idx24 as the best general choice among simple grading systems because it yielded the smallest GDev. Its modulator M = 24 is presumably larger than it needs to be for slow changes in performance level while smaller than it needs to be for rapid improvers. This situation suggests creation of a grading system that uses a variable modulator: one that is small when dealing with a constant performance level and large when dealing with a rapidly changing one. That is the underlying idea of Dynamic Grading.
It is all very well to say the system should use a large modulator when dealing with a rapid improver, but how is the system to know that it is dealing with one? The Performance Deviation Trend (pdt) statistic is introduced for this purpose. It expresses the extent to which the player, over the course of the preceding 37 games, performed better or worse than expected. When pdt > 0 the performance is better, when pdt < 0 it is worse.
Determination of pdt is relatively complicated. It is based on the difference between Observed Wins and Expected Wins over the preceding 37 games (e.g. when a player has win probability of 0.35 in a game it is counted as 0.35 Expected Wins). The general reader need not be concerned with the technical details. A rapid improver becomes recognisable as a player with a large positive pdt. The fact that the preceding 37 games are used (rather than just the preceding 9 or the preceding 90) is not coincidental: it is experimentally determined to give the best results, i.e. to yield the smallest Gdev statistic for the resulting Dynamic Grading system.
With the pdt-statistic in place the variable modulator M_X of player X is determined after each game as an expression in terms of the pdt, which is also adjusted after each game. This personalised modulator M_X increases continuously (even smoothly) from a prescribed minimum of 16 (attained when pdt = 0) and it approaches a maximum of 35.2 when pdt becomes large positive or large negative. The parameters 16 and 35.2 are also experimentally determined (to yield the smallest GDev). They depend on the nature of the sport and the tournament culture e.g. the average disparity between players encountered in tournaments. (One should expect the corresponding parameters for Golf Croquet to be different). Dynamic Grade adjustments are ultimately rather similar to those of simple grading systems: the adjustment quantity M * LWP merely becomes replaced by M_W *LWP for the winner W and M_L* LWP for the loser L.
As pointed out in the discussion of rapid improvers, grade adjustments in a simple grading system will often be too large or too small, more so because a single modulator is chosen for all players over all games. That the variable personalised modulators of DG give more appropriate adjustments can be seen from its Gdev = 0.90 compared to the 0.99 attained by Idx24 and the 2.64 attained by CGS. Let us hope that future refinements of the postgame adjustment procedure will lead to a DG with even better grade difference accuracy.
It was noted above that grades (of any system) are at best approximations of what they purport to represent. There are circumstances in which the user could be more confident or less confident about the accuracy of a grade.
Since grades are based on game results, long periods of inactivity should be seen as a cause of uncertainty about grade accuracy. In Bayesian Grading (a forerunner of Dynamic Grading in that it implicitly also used personalised adjustment quantities) periods of inactivity was used as basis for an increase in these personalised adjustment quantities. Since this was the only basis for such increase, it is not as effective as DG but it was already an improvement over CGS. In Dynamic Grading the pdt statistic is a useful pointer. For a reasonably active player a pdt near zero (less than about 50 points; from zero, above or below) is indicative of a reasonably reliable grade. A pdt far from zero would be indicative of uncertainty.
All told, on any given rank list a few players will inevitably be incorrectly ranked. It is impossible to know who they are. That is a problem we just have to put up with. Administrators are generally ready to point out players deemed incorrectly ranked, often in disagreement with one another. That is another problem we just have to put up with.
All rights reserved © 2012-2017