Bayesian Ranking for Croquet
by Louis Nel, 13 October 2006
The purpose of this article is to introduce a new ranking system, codenamed BR, and to explain how it works. This is followed by a discussion of its suitability for croquet world ranking, in comparison with that of the currently used Computer Grading System (CGS). (The system BR has been submitted to the WCF Ranking Review Committee for consideration).
Selection of a ranking system is not easy. It is somewhat like the selection process that leads to marriage. It is practically impossible to give all potential candidates a try, but when you become aware of attractive attributes of a particular candidate, you are motivated to find out more about that one ... So what are the attractive attributes of BR?
Any ranking system needs to use simplifying assumptions. An underlying assumption at work in post-game adjustment of most systems is that the player is performing at the level represented by his Index, or whatever the rating statistic is called. In BR, performance is rated in terms of a Bell Curve rather than a number. Such a curve is characterized by two parameters, its Mean and its Standard Deviation (SD). So a player's data on the ranking list will include numbers like 2347 (say) for his Mean and 61 (say) for his SD. The corresponding underlying assumption applicable to such a player is that, with 68% certainty, he is performing within 61 points of 2347. Thus the SD provides a numerical measure of uncertainty about the player’s performance level. This is more elaborate than the time-honored typical rating, but it is also more realistic. Indeed, in the real world nobody can be certain to perform at a given precise level. Another underlying assumption seen in many systems is that rating data are not adjusted after long absences from tournaments. In BR, the SD is adjusted after a period of absence according to the number of inactive days. It is more realistic to recognize the increased uncertainty that results from inactivity. (Many readers may have an initial skepticism about this. One tends to think about a particular player who showed no signs of rustiness after several months. One does not think about the numerous unseen players. The numerical evidence of improved rating accuracy that results when the temporal adjustments are included ought to convince the doubters.)
Post-game adjustment of ranking data is a crucial element of any ranking system. It is not uncommon to find this procedure defined in terms of a user-determined parameter (“Step-size” or "Class Factor") that regulates the size of the adjustment according to a player-invariant formula . Where present, it has substantial influence on how the system functions and (not surprisingly) the chosen parameters are sometimes controversial. In BR there is no such user-determined parameter. The adjustment algorithm is based on that celebrated statistical procedure known as Bayes’ Rule. Adjustment size is automatically determined by the Bell Curve of the player, so it varies with the player. Disparity between the opponents (calculated more elaborately in BR) has implicitly an important role, but so also has the SD. The effect is to make relatively small adjustments in case of established players (who typically has small SD) and relatively large adjustments in case of new players (who start off with large SD).
What is the scope of ranking? The performance of a player depends on many things – talent, shooting skills, tactical skills, psychological skills, physical state, mental state and so on. Ranking—as considered in this article—does not evaluate any of these important things directly. It is concerned only with their combined effect as reflected in game results. Ranking evaluates nothing in absolute terms, only how each player performs relative to the rest of the population.
Different rankings for different kinds of performance can be derived from the same set of game results. Whatever the kind, players are ranked on the basis of their rating – a real number maintained by the system, for each player, to represent performance level. The rating algorithm determines what kind of performance is being measured. In this article we focus firstly on a rating for Recent Performance. It is the natural starting point. It is supplemented by a rating for Performance over a Period.
Ranking lists are mainly used by three groups: by the General Public to monitor competitive standings of players, by Event Organizers to place players into classes or blocks or seed them into a knock-out ladder, by Team Selectors as a source of information about candidates for selection, sometimes for an event several months into the future. These different uses make different demands. So users ought to have a clear idea of what each rating is attempting to measure. This will enable them to know better how to use (and also how not to use) the published ranking lists.
Accordingly, the General Public may find Period Performance ranking lists informative as an ongoing historical assessment. This ranking effectively turns the games played during the assessment period (12 months, say) into a virtual world championship. Of course, the General Public will also be looking at Recent Performance ranking for an indication of how things are currently shaping up. Event Organizers will naturally be interested in Recent Performance. And what will Team Selectors look at? If only there were a rating focused specifically on expected player performance a few months from now … However, such a rating is unheard of. Recent Performance ratings is possibly the closest to what is needed, but an approximation at best.
While most subsections are written with the general croquet player in mind, those marked “(Computational Details)” are meant for readers who happen to know a little mathematics. Others can blissfully ignore these subsections without risk of losing the thread.
In BR, Recent Performance of a player becomes rated in terms of a Bell Curve, called the Rating Curve of the player. ("Bell Curve" is the popular name for the statistical concept Normal Probability Density Function.) Here, for example, are plots of the Rating Curves of two real players, Ada and Bob (not real names) immediately before a game they played.
The abscissas marked on the x-axis, i.e. the Performance Axis (horizontal reference line at the bottom on which performance level is represented) are 10 points apart on the familiar croquet performance scale, starting with the value 2110 on the left (value not shown). The height of the Rating Curve at level x relates to the probability that the player will perform approximately at level x.
Let us express this more precisely. The Performance Interval determined by two points on the Performance Axis is the set of all values that lie between these two points. For example, the Performance Interval [2358, 2397] determined by the end points 2358 and 2397 is the set of all values x not below 2358 and not above 2397. We call attention to Performance Intervals in order to express the following remarkable geometric property of Rating Curves. The area below a Rating Curve and vertically above the Performance Interval represents the probability that the players's performance will be somewhere in that interval (orange area below). For example, if the area enclosed above the performance interval [2358, 2397] is 0.75 in size, then 0.75 is the probability that the performance level will lie between the points 2358 and 2397
In view of the symmetry, the point on the Performance Axis below the highest point happens to be a point such that the area to its left is equal to the area to its right. So it represents a point such that the performance is equally likely to be above it than to be below it. That point is called the Mean of the Rating Curve. (There is a formula for its calculation applicable also to non-symmetric curves). The Mean determines the horizontal position of the Rating Curve on the Performance Axis.
It can be seen that Ada's curve is lower at its Mean than Bob's and generally flatter. The Standard Deviation (SD) of a Rating Curve gives a numerical indication of its flatness. Namely, the area below the Rating Curve and above the Performance Interval [Mean – SD, Mean + SD] is exactly 0.68. In other words, A player will, with probability 0.68, perform at a level that lies within SD points of his Mean. One does not even need a calculator to apply this useful rule. For example, a player with Mean = 2350 and SD = 55 has 68% probability of performing at a level that lies within 55 points of 2350. Likewise a player with Mean = 2415 and SD = 70 will have 68% probability of performing within the interval [2415 – 70, 2415 + 70] = [2345, 2485]. The probability in this rule always stays at 68%, only the width of the relevant Performance Interval varies.
According to the preceding facts, a player’s SD indicates uncertainty about his performance level. The smaller the SD, the smaller the uncertainty. This helps us to interpret ranking lists.
There is a formula for computation of SD. It is of interest to the programmer rather than the general reader.
A Rating Curve is completely specified by its Mean and Standard Deviation (it is explicitly known as function in terms of these two parameters). So in practice the Rating Curve of a player need never be drawn - all relevant information is encoded in the two numbers Mean and SD. Curves are drawn here only because a picture is worth a thousand words. The pair of numbers (Mean, SD) will be called the Rating Data of the player.
The Initial Mean and SD of a player is to be assigned by the Ranking Officer as an estimate in the light of available evidence. In all numerical trials reported here, the initial Mean was set equal to the initial Grade of the player and the Initial SD was set at 350. This value 350 is not a guess. It is an important choice which influences eventual efficiency. It was determined by optimization experiments.
For a first glimpse of how BR works, let us look at top 20 lists at end-April and early August 2006. These listings are based on all World Ranking games played since January 1985. For each new player the Ranking Officer assigned an initial Index (= initial Grade) for CGS use. That initial Index served as initial Mean and each initial SD was assigned the value 350. At the start of each event the SD of each participating player is given an automatic temporal update by BR – an update which depends on the number of days since the last event of that player (see 1.3 for more about that).
[note that Firefox does not make a good job of rendering the tables in this article - leaving huge spacings between the lines; use Internet Explorer for neat output]
Ranking list end-Apr 2006
Let us say something about the SD – the built-in uncertainty indicator of BR. This is something new to croquet ranking. The initial SD of 350 generally drops rapidly at first, normally below 200 within the first 10 games. The rate of decrease then slows down and the SD eventually stabilizes around 60. It is automatically increased by the system after an absence (see subsection 1.3). So for the northern hemisphere players the relatively large SD increase after the dormant winter period will still show its effect in the first spring tournaments. The Last Event of 01.05.06 shown for several players above is really an event that started in late April.
Post-game updates always cause a drop in SD (which may be very small). It is noteworthy on the list to follow that Keith Aiton's SD of 53 is virtually the same as Robert Fulford's 54, despite the fact that Rob has played 2402 games in the system compared to the 1469 of Keith.
Ranking list 6 Aug 2006
In game-by-game updating, a larger SD gives a larger adjustment for the player (see the numerical illustrations in subsection 1.4). This has the effect that newcomers (starting as they do with the high SD=350) reach relatively stable data (smaller SD) fairly quickly. A smaller SD automatically gives smaller adjustments. So the SD effectively provides a variable personal step-size for adjustments.
BR makes an automatic upward adjustment to the Standard Deviation of a player after an absence from tournament play. The size of that adjustments increases with the number of days of absence in a nonlinear manner illustrated in the table to follow. It shows a few typical increments arising in practice. For example, it shows that 20 inactive days would bring to an SD = 60 an Increment of 2.5 and to an SD = 120 an Increment of 1.3.
Note that the rate of increase gets smaller as the number of inactive days increase. The rate of increase is largely regulated by a parameter Tau, whose current value of 75 was experimentally determined. It is the value that gives the optimum predictive efficiency (further discussed in subsection 1.6).
While readers in general need not be concerned with the mathematical details of the updating algorithm (given in 1.7), the effects produced are worthy of their attention. The adjustment sizes in post-game adjustments are influenced by the disparity between the opponents (as one would expect) but also influenced quite strongly by the SD of the player being updated.
The following illustrative examples (taken from real cases) of Rating Data of winner and loser before and after a game will give an idea of how it works. WMn = Winners Mean, LMn = Loser's Mean, etc.
In both examples (a) and (b) the Disparity (indicated by the difference WMn – LMn) is very small, but the SDs of the four players conspicuously different. Rule of thumb: the larger the SD the larger the size of the Adjustment.
In each of examples (c), (d) and (e) the winner has much the same SD, but faced opposition ranging from strong in (c) to very weak in (e). Unsurprisingly, the adjustment sizes dropped accordingly.
The size difference in adjustments of winner and loser indicates that BR is not a zero-sum system. Nevertheless, on the whole things balance out so that there is no apparent inflation or deflation. Actually, the average Mean of all players in the database does go down. This is understandable in the light of hundreds of beginners who stop playing after a mere 10 or so games in the system. These players lose most of their games at a time when they have a large SD, so their Mean decreases substantially without the opponent gaining all the points so lost. These players never get to a stage where they gain substantially from other opponents. The result is a net loss of points to the system which is unimportant because it does not relate to the active players. The average Mean of active players remains pretty constant.
A function f on the real line is called a probability density function if f(x) ≥ 0 for all x and
Here and elsewhere, an integral with subscript R means a definite integral over the whole real line. For such functions f, the Mean and Standard Deviation (SD) is defined by the following integrals:
Computation of these values proceed in general by numerical integration. These days this has become quite easy, through the availability of good programming software, effective methods and powerful personal computers.
A Probability Density Function f is called normal if it can be expressed in terms of two parameters μ and σ, with σ > 0, as follows:
Such functions are popularly known as “Bell Curves”.
In case of a normal f, its Mean and SD can be computed via differential calculus. Such computation brings to light that the Mean = μ and the SD = σ. It follows that a Bell Curve is uniquely determined by its Mean and SD and these two numbers are again uniquely determined by the Bell Curve. BR assigns a Bell Curve, i.e. the Rating Curve, to every player by just assigning a Mean and an SD.
If N days had elapsed since a player's last event, his SD is adjusted at the start of the present tournament as follows:
In the application of the above formula, N is replaced by 365 whenever it exceeds 365 and New SD is not allowed to exceed the initial SD value of 350 given to new players.
The expression sqrt(σ2 + ν2) used above (with ν2 = τ2*N/365) arises in the theory of normal probability measures. Every probability density function represents a probability measure in a standard way. In particular, NPD(μ, σ, ), NPD(0, ν, ) and NPD(μ, sqrt(σ2 + ν2), ), (see 1.5 for notation) represent probability measures and if they are independent, then the third one is the convolution product of the first two.
Temporal adjustment is new to croquet and there is no existing data that could serve as a guide for what the value of the parameter τ should be. So I proceeded to determine by experiment a value that would optimize the Predictive Efficiency of the system. The latter can be expressed as the Percentage of Correct Predictions (PCP) for the test games of a given test period. (If a system ranks Player A above Player B, then this is interpreted as the prediction that A will beat B in the next game they play).
As test games I used those in which both players had at least 30 games tallied in the system. While predictions could be based on information updated right up to the game to be played, that information is normally not available to Event Organizers or to the public. What is generally available is the ranking data with which players enter an event. So I used this more practically relevant Event Entry Data as basis for predictions. For some players this data may be several weeks old. This does not matter because it is the same for each of the PCP values to be compared. Indeed, the PCP provides a comparative yardstick rather than an absolute one.
The table to follow gives for each year the Test-game count (Tgmcnt), the Correct Predictions count CP(x) and Percentage of Correct Predictions PCP(x) for each listed parameter choice x.
A few words about this table are in order. Year 2006 covers only up to 6 August (the available data at the time when these numerical studies were done). The PCP method for comparison of choices may seem crude at first glance, but the consistency with which it manages to distinguish between the various parameter choices reassures us about it usefulness as a discriminant. There may be no more than a subtle difference between 75 and 65 as parameter values, but over a few thousand test games, the PCP gives a clear indication of which is preferable. It is noteworthy that the PCP differs from year to year. This can be attributed to average disparity between opponents that happen to differ from year to year. (The absolute PCP value is in fact more indicative of disparity than anything else). For this reason we are using the average over the years after 2000 as our "bottom line" PCP values - indicative of typical disparities encountered in our sport in recent years.
If player A performs precisely at level XA on the familiar croquet performance scale and player B performs precisely at level XB, then the Win Probability of A over B is given by the formula
where the constant β = ln(10)/500 just regulates the scale of the system. This classical formula (needed in the next subsection) has been known and widely used for decades. The expression on the right can be written equivalently as 1/(1 + 10^((XB - XA)/500)).
The Classical Win Probability formula is appropriate for use where players are assumed to perform at precise levels. That situation arises in particular in hypothetical calculations. When this formula is applied to real players, it gives a result with some usefulness, but without the accuracy one would wish for.
Allow me to digress for a moment, just for fun. If we put x black tokens and y white tokens in a bag and randomly draw out one token, then the probability that the drawn token is black is given by the expression x /(x+y). In fact, this is a nice way to illustrate probability. Note that "x /(x+y)" is essentially a restatement in different terms of the expression that defines the CWP function. Indeed, by performing the order preserving reversible transformation
and then dividing both the numerator and denominator of the resulting fraction by exp(β*a) and applying algebraic properties of the exponential function, we get
So it seems that the croquet gods have allocated a number of tokens to every player to determine the performance level of that player. Before a game they place the tokens of the two players in a bag and randomly draw one. That decides the outcome … :-)
BR adjusts the Rating Data of a player after every game. The adjustment is based on that celebrated statistical procedure known as Bayes' Rule. It is remarkable that the research of Reverend Thomas Bayes (1702 – 1761) remained largely unnoticed for more than a hundred years after his death before evolving into a powerful school of thought in modern statistical theory. The cited rule gives a procedure for revising an Existing Belief in the light of new information provided by a Sample Event. In our situation the Existing Belief is expressed by the Rating Curve to be updated. The Sample Event is the result of a game against another player with known Rating Curve. The revised belief is a new probability density function expressed in terms of the relevant conditional probabilities. This elegant algorithm involves no man-made parameter.
Consider players A and B who have respectively the Rating Curves (see subsection 1.5)
In the event that A beats B, the probability of this Sample Event is given by the Bayesian Win Probability formula:
So according to Bayes' Rule, NPD(μA, σA, x) becomes updated to the new function UW given by
The function UW so obtained is again a Probability Density function, which need not be normal, but will usually be close to normal. Its Mean and SD are adopted as the updated Rating Data for player A. This new Mean is larger than μA. The update of NPD(μB, σB, y) (loser) is obtained in a corresponding way. Namely, first the new Probability Density function UL, given by
is obtained and then, via its Mean and SD, we get the loser's new data. Its Mean will be smaller than μB. The updated SD is smaller for both winner and loser. Sometimes it drops by a very small amount.
Applications of Bayes' Rule are illustrated in many books. See for example Donald L. Harnett and James L. Murphy, Statistical analysis for Business and Economics (Third Edition), Addison-Wesley (1985). Or just Google “Bayes' Rule". Elementary illustrations typically involve finite sums rather than integrals.
The average of the player’s Recent Performance ratings naturally comes to mind as a possibility. The problem with this approach is that game by game ratings are not independent. The rating after game 51 strongly influences that player's rating after game 52, a little less strongly his rating after game 53, and so on. Every rating influences all subsequent ratings to some degree. This turns an ordinary “average” effectively into some kind of weighted average – whether so intended or not - with weighting that is at best only vaguely known. A deliberate weighted average becomes by default a differently weighted average whose effective weighting is again vaguely known at best. So it is hard to know what the rating eventually represents, except that it causes games played near the end of the period to be more influential than games near the start.
Another possibility is Swiz, the iterative system proposed some time ago on the Nottingham Board by David Maugham. It can be outlined as follows. Start by giving all qualifying players the same rating (Index). Using a fixed Step-size, all players have their Index adjusted by the average adjustment arising in CGS for the games of the period with reference to the Index each player had at the beginning of the period. So the players have different ratings after the first iteration and those new ratings are then used throughout the second iteration. This process is repeated until, for each player, the absolute difference between ratings coming from successive iterations is below a pre-assigned tolerance. This approach has attractive features. It uses only the games of the period and the order in which the games occur is immaterial. So there is none of the unintended weighting inherent in an averaging approach. However, what causes a frown is the uncertainty of convergence. To understand the uncertainty, consider first the series 1 + 1/2 + 1/3 + …+ 1/n + …. The first few terms decrease in size fairly rapidly, but the rate of decrease slows down so much that the series fails to converge – a mathematically proven fact. The sum to 5569 terms is 9.20 and the sum to 80 thousand terms is only slightly more, namely 11.87. Now, as regards Swiz, it turns out that the absolute successive differences of the Index of the top player also has a very slow rate of decrease. After 80 thousand iterations its rate of decrease is much the same as the rate of decrease of the terms of the mentioned non-convergent series. This raises doubt about the convergence of Swiz. Without convergence one does not know when to stop iterating.
It is not obvious that a satisfactory rating algorithm exists which is based on the player's games during the period only and independent of the order in which these games were played. After numerous experiments and numerous failures, I found the one herewith introduced: the Period Performance Grade (PPG).
This rating is based on sums of weighted wins and losses. For this purpose a game is weighted by the Loser's Win Probability. To see that this is reasonable, note first that all wins are not equally important. There are wins in which the losing player had very little hope of winning – say 5% probability. Even a large number of such wins tells us very little about the performance of the winner. Wins in which the loser had 30% win probability are much more significant. They reflect with credit on the winner. If the loser had 70% win probability it means the actual winner went into the game as underdog. A number of such wins has big impact on the winner's standing. Corresponding statements hold for the loser of games of each kind. These examples suggest that the higher the Loser's Win Probability, the greater the impact on the performance estimate of the two players. This kind of weighting is used also in CGS post-game updating, where the Index adjustment equals (Step-size) * (Loser's Win Probability). This is not to suggest that the new rating here introduced is going to arise via adjustments to an existing rating. On the contrary, we depart from the view that the player to be rated starts with a clean slate and the only knowledge at our disposal comes in the form of Rating Data of the opponents together with the game results.
For a player P with both wins and losses in the period, we proceed as follows. Give P a trial rating T and assume P to perform precisely at the constant level T throughout the period. (For a hypothetical rating it makes sense to assume this precision). By using T and the known opponent data (Mean and SD) for a game we can compute the Loser's Win Probability. Call this the T-Weight of that game. By using these T-Weights, we can determine whether a trial rating T is too low or too high, as follows.
By adding together the T-Weights of all games won by player P and subtracting the T-Weights of all games lost, we arrive at the Net T-sum for player P over the period. A trial rating T which deliberately underrates P causes, in every game won by P, the estimated Loser's Win Probability to be higher than it ought to be. So the Sum of T-Weights of games won is then also higher than it ought to be. In a corresponding way, the Sum of T-Weights of games lost will be lower than they ought to be, so less is subtracted from the Sum of T-Weight for wins than ought to be subtracted. All told, a deliberately underrated player is found to have a positive Net T-sum, while a deliberately overrated player is found to have a negative Net T-sum.
We define the PPG of player P to be that unique T such that the Net T-sum = 0. It can be shown by mathematical analysis that such T exists and is unique. It can be computed by a standard numerical procedure (Bisection Method). It is as efficient or accurate as the Winning Probabilities used in its calculation. The latter depends on the accuracy of the Recent Performance system from which PPG is derived.
The PPG of a player can be thought of as that hypothetical constant performance level which requires no upward or downward adjustment in the light of the game results of the period.
See Appendix 4.3 for sample PPG ranking lists.
A period of 12 months until the end of April and again until the end of October is arguably the most natural choice for an assessment period. An end-October listing would come more or less at the end of one full season in the northern hemisphere. On the April listing the northern hemisphere players will have generally been inactive for a few months, but the comparison with their recently active southern hemisphere rivals will be of interest. Six months later the roles would be reversed.
Publication every 6 months only would bring relief from the volatility that is inherent in any accurate rating for Recent Performance. Publication every 3 months or even every month is possible, but would involve less natural periods and may result in information overload. Recent Performance ranking lists would presumably continue to appear on a monthly basis.
Any kind of period based ranking runs the risk of producing a misleading rating for a player who had a small number of games in the period. Untypical results are then more likely to occur. To counter this risk, restrictions need to be introduced.
Let us say a game has Moderate Disparity if its Weight (see 2.2) is at least 0.25. There needs to be enough Moderate Disparity wins as well as losses to safeguard against ratings that are out of line. The following activity requirement seems reasonable as a starting point:
Some extraordinary real cases came to light in our trials. One fairly high ranking player had 33 games during a 24 month period. He had 31 wins, but no Moderate Disparity Wins at all and only two Moderate Disparity losses. Another player with 70 games in 24 months had only 2 MD wins and only 4 MD losses. Both of these players would readily have qualified if "games played in period" were the only criterion. The Moderate Disparity qualification quite appropriately filtered out both.
It will be seen in the sample ranking lists provided that the number of players on the PPG list is quite substantial, so the above requirement does not seem too restrictive – at least not for a ranking list supplementary to the populous Recent Performance list.
No objective yardstick exists for finding the best cut-off points for PPG qualification, nor for the period length or frequency of publication. The choices mentioned here should be regarded as a reasonable starting point from which consensus could develop.
The games played during the period can be regarded as a mega-event – a virtual world championship. All appropriately active players are automatic participants. It is based on many more games per player than are played in a real championship. A temporary surge in form will generally not help a player as much as it could in a real championship. Nor will the impact of one extraordinary shot. For these reasons, the top player on a PPG ranking list has good reason to be regarded as the best player in the world for that period.
Separate regional PPG listings would be of considerable interest. The April ranking would provide a virtual Player of the Season contest for Australia and likewise there would be one for New Zealand; the October ranking would do this for the UK and for North America.
Occasionally, a longer assessment period could be used e.g. to produce a Player of the Decade.
Let P be the player to be rated, with enough Moderate Disparity wins and losses to qualify for the Period Ranking list. Let T be a trial rating for P. The Weight of a game played by P is defined to be the Loser's Win Probability. Its calculation is based on T and the Rating Data of the opponent via the formula to be detailed below. Put
GW(j,T) = Weight of the j-th game won by P in the period
GL(k,T) = Weight of the k-th game lost by P in the period
Netsum(T) = Σj GW(j,T) - Σk GL(k,T),
where the first sum is over all games won and the second over all games lost. The looked for PPG is that unique trial rating T such that Netsum(T) = 0.
The above definitions give a quick overview of how the PPG is arrived at, but we still need to explain the Weight calculation and establish existence and uniqueness. As regards the Weight calculation, we are dealing here with a situation in which P has a hypothetical performance level T deemed precisely known, while each opponent has a performance level expressed by real Rating Data. Neither the Classical Win Probability (subsection 1.7) nor the Bayesian Win Probability formula (subsection 1.8) applies to the present situation. So we have to derive a new formula for use here.
One can regard a precisely known performance level T as the limit case, as n → ∞, of a sequence of Rating Data (T,S(n)), where the Standard Deviations S(n) converge to 0. Of course, any Standard Deviation appearing in the NPD formula needs to be > 0 to avoid division by 0. So the sequence of Rating Curves NPD(T,S(n),x) cannot have a limit in the space of Rating Curves. However, each Probability Density function can be interpreted as a probability measure in a standard way. The limit of the above sequence does exist in the larger space of probability measures. In fact, it is the Dirac measure δ at the point T i.e. the "point mass” whose mass is concentrated at T. This comes as no surprise if one considers that the curves NPD(T,S(n),x) have peaks that get higher and sharper above the mean value T as n → ∞. The limit conjures up the image of something like a spike of infinite height and infinitesimally small width, with unit area.
It is known from measure theory that integration of a smooth function f with respect to Dirac measure has the following effect:
Recall from subsection 1.8 that the win probability of player A with data (μA, σA) over player B with data (μB, σB) is given by
We are now equipped to deal with the case of player P (with performance level deemed = T) and an opponent player B (with Rating Data = (μB, σB)). By substituting (T ,S(n)) for (μA, σA) in the above formula and using the mentioned Dirac measure property, we obtain
Win Probability of P over B
Similarly we obtain
These formulas are used for calculation of the Weights GW(j,T) and GL(k,T) mentioned above.
Finally, let us show that the PPG must exist and be unique. Note first that the function Netsum is continuous, being a sum of compositions of standard continuous functions. The Weight GW(j,T) i.e. the Loser's Win Probability in a game won by P, decreases when we increase T and could be made as close to 0 as we wish by taking T large enough. On the other hand, the Weight GL(k,T) to be subtracted i.e. the Loser's Win Probability in a game lost by P, increases when we increase T and could be made as close to 1 as we wish by taking T large enough. It follows that the sum of all the GW(j,T) could be made less than ½ and the sum of all the GL(k,T) could be made greater than ½ by taking T large enough. So Netsum is a strictly decreasing function with Netsum(T) < 0 for large enough T. It follows by a similar argument that Netsum(T) > 0 for small enough T. The Intermediate Value Theorem guarantees existence of a value T such that Netsum(T) = 0. By the monotonic nature of the function, there is a unique such value for T. It can numerically be found via the Bisection Method up to the predetermined accuracy. Once we have values Ta and Tb such that Netsum(Ta) > 0 and Netsum(Tb) < 0 it typically takes 5 or 6 further trials to reach a value T such that the absolute value of Netsum(T) is below the adopted Tolerance of 0.01. Calculation of all PPG on a typical ranking list takes only slightly more than a minute.
CGS uses two ratings: Index and Grade. What are their roles?
The Index is a rating for Recent Performance, generally regarded as being excessively volatile, even if one recognizes that this kind of rating needs to be fairly volatile to do its job. In practice the Index seems to provide the raw data from which the Grade is computed, rather than to be a rating to be used in its own right for ranking purposes. (For an explanation of how CGS works, see http://www.oxfordcroquet.com/tech/nel-wr/index.asp)
As explained in (reference to CGS), the Grade is effectively a weighted average of all previous Indexes. Since these Indexes are not confined to a period, it is not a rating for performance over a period, even though it is retrospective to some extent. Being effectively a weighted average of Indexes, it suffers from the disadvantage mentioned in subsection 2.1: it is hard to fathom what it really measures. Being a smoothed version of the Index, it can also be regarded as a rating for Recent Performance. As such it does a reasonably good job. However, its lag effect produces well known anomalous behavior, namely a lost game may cause it to increase, a won game may cause it to decrease. This is not in keeping with what a rating for Recent Performance is expected to do.
In summary, there are conspicuous qualitative improvements in the service offered by the BR package:
When two different ranking systems are applied to the same set of game results, they will inevitably rank the players in a different order. The obvious question then arises: which system gives the more appropriate ranking.
When two ratings are both for Recent Performance, a comparison as to their predictive efficiency is relevant. One merely needs to find for each system its percentage of correct predictions (PCP) under the same conditions over a suitable period. It was already detailed in subsection 1.6 how the "bottom line" PCP is obtained. The following table compares these PCP values for the ratings here under consideration. The Test period is from 1 Jan 2001 to 6 August 2006.
Tgmcnt = number of Test games in this period.
These PCP values speak for themselves. Accuracy of CGS ratings is naturally influenced by the various parameter choices made, but there is also the inherent loss of accuracy that comes with the application of the Classical Win Probability to real player data. No degree of precision in the calculations can overcome that handicap. A real player simply does not play precisely to any given performance level. In this respect BR, with its rating in terms of a Bell Curve, has a built-in advantage.
When ratings are used by Team Selection committees there needs to be an awareness that the predictions become less and less accurate the further they are applied into the future. I did an experiment in which the BR Mean, as it was 6 months earlier, was consistently used for all predictions. It yielded a bottom line PCP of 67.60 compared to the 68.98 obtained above (with the Mean after the most recent event of the player). This gives a numerical idea of how the predictive accuracy diminishes when applied into the future.
How do BR ranking lists differ from those of CGS? To the extent that they are pursuing the same goal one should expect considerable similarity. To the extent that they employ very different updating algorithms one should expect considerable difference. In Appendix 4.1 there is shown monthly Top 10 comparative listings over the period Sep 2005 through July 2006. They do indeed reveal both similarity and difference. Here, as a first glimpse, we show compressed Top 5 comparative listings over the 5 month period ending July 2006. The BR listings are on the left, the CGS listings on the right.
The above link leads to discussions about this ranking system which took place on the 'Nottingham List'.
All rights reserved © 2006