A New And Better Rating System?

by Steve Cox

Having said in a recent letter to Richard Sharp that I knew he'd had no end of advice over the years on how to improve the ratings system, I decided to look it all up, and it makes interesting reading. For example, there've been professional statisticians who have worked out systems based on how different a player's record of wins, draws and eliminations is from what could have been achieved by chance, or who have described systems for determining which values of the constants in the ratings formula make the ratings most useful for predicting the results of games, as well as, of course, endless discussion on the relative merits of coming second to a winner compared with sharing in an n-way draw, or the effects of inflationary drift (i.e. the fact that game ratings tend to increase with time, so that newcomers can make faster progress up the table than their predecessors were able to).

What emerges most strongly is that there is no perfect system, and not even any agreement on what features the perfect system would have if it existed. It's not even clear whether or not ratings are a good thing. To take the last question first, I think they are a good thing. In fact, if the ratings didn't exist, I wouldn't play regular Dip postally (I'd stick to variants), just as I wouldn't play it at cons if the games weren't part of a tournament. The single biggest boost that could be given to variants would be if there was a ratings system for the players of them that had the same high profile that Richard's ratings do. Unfortunately, one of the main reasons why Richard's system has gained the acceptance that it has, despite its manifest flaws, is that everyone can see how much effort he has put into it, and continues to put into it, and the prospect of having to match this puts people off.

In the light of the material I have unearthed, I think it's very important, if you plan to devise a new system, that everyone is clear what the philosophy behind it is, and what its objectives are. In other words, you must state explicitly whether you believe that second behind a winner is worse than any draw, and if so, how much worse; you must say whether a draw against strong opposition is better than a win against novices and dropouts, whether the player on one centre in a 17-16-1 draw deserves as much credit as the one on 17 centres, and whether a two-way draw is better than a three-way. And you must come clean about whether you wish to encourage certain styles of play and discourage others, and whether your ratings are intended to indicate the current level of skill of a player rather than his average level throughout his career, or even to be able to predict the likely results of new games.

Richard has said more than once (and I agree with him) that he believes that the most important factor that any system should take into account when awarding points to the players in a game is what the quality of the opposition was. That said, there's a problem with Richard's system that no-one seems to have spotted, which is that it awards points to a player not just on the basis of how good the opposition was, but on how good the player himself is as well! Thus if Rob Chapman beats a collection of novices, he gets more points from it than if another novice beats them. This problem can be overcome quite simply, it seems to me, by making the game rating for each player be the sum of the other six players' ratings, not the sum of all the players' ratings.

Richard's system also has the problem that a player's recent record has no more significance than his historical one. Richard's system also penalises experienced players in that it is harder for them to increase their rating because each new result is swamped by all the old ones.

These problems can be overcome by using the following formula to calculate a player's rating, R:

R = (w1P1 + w2P2 + w3P3 + ... + wNPN ) /

(w1+w2+w3+...+wN)

where P1 is the points awarded in the current game, P2 is the points awarded in the previous game, etc., and the w's are constants, called 'weighting factors' (in case you don't recognise it, this is the formula for a 'weighted average').

We can choose a maximum value for N in this formula so that only a player's last 10 games count, say (although this won't make much difference if w1 is large compared with wN). For the purpose of these notes, I will assume that N=11, with w1, w2, w3,...w11 = 2, 1.9, 1.8,...1; this makes the divisor 16.5. (Actually, by making some of the weightings less than 1, we could arrange it so that the divisor was N, just as if it were a true average. However, this is equivalent to multiplying the top and bottom of the rating fraction by a constant, so it makes no difference to the result.) The actual choice of the weightings could be determined by answering such questions as 'how many recent draws does it take to cancel out a past elimination, all else being equal?', or 'if a player's last n results were eliminations but before that he had 11-n wins, what is the value of n for which that player's rating should be the same as that of a player who has played only one game, which he won, and what values of the weightings ensure that this is the case (i.e. that the ratings are the same)?' What fun!

To get an idea of how the formula works, suppose that a certain novice player (as usual, novices start with 100 points from a notional first game) gets 200 points from his actual first game (against 6 other novices), which he wins, but only 50 points from his next one (against 6 different novices). His rating would be

R = ( 2x50 + 1.9x200 + 1.8x100 ) / (2+1.9+1.8) = 115.8

However, a second novice who plays exactly the same games, with the same results, but who plays them in the opposite order, would end up with a rating of

R = ( 2x200 + 1.9x50 + 1.8x100 ) / 5.7 = 118.4

Note that Richard's system would also give the two players different ratings, but for a different reason. Under his system, the game rating for each novice's second game would be affected by his rating at that time, which is different for each one, so the percentage of the game rating that he was awarded at the end would represent a different number of points. Under my system, the game rating for each game is the same for each novice, because it is independent of his own rating. It is the greater weight given to the score for the second game that leads to the different ratings.

Out of interest, I worked out what ratings Richard's system would give the two novices. This is not as easy to do as you might think, because it is no longer the case that if the games have identical results, then the players get the same points from them. For the first novice, the result was 116.7, and for the second 113.3. In other words, the first novice has the higher rating, whereas under my system it's the second one! Now, one could argue that the order in which the games were played shouldn't matter (and perhaps it shouldn't for a novice). However, that would only justify making the ratings the same. I can think of no justification, other than mathematics, for giving the novice who lost his most recent game a higher rating than the one who won his...

Well, maybe I can - perhaps the others ganged up on the first novice when they realised his rating was higher than theirs in his second game, so his first game was more representative of his actual ability! The trouble with arguing this way, though, is that it assumes one can deduce the ethical quality of a game from the final result, which one cannot. For example, suppose a game ends in an 18-16 win. Was this because the two players made a pact whereby the second helped the first to win, or was it because, at 17 all and after a bloody race to win by both players, the second one took a gamble for the last centre, that didn't come off? Obviously, we can't tell.

It is important, I think, that we do not allow similar interpretations to bias the design of other features of the rating system. For example, many people argue that the player on one centre in an 18-15-1 win deserves less credit than the one on 15 because he didn't come as close to winning. But we could just as easily argue that the one on 15 deserves less credit because he had a better chance to stop the win, but failed. Since there's no way of knowing what actually happened, all we can do is give both of them equal credit (not no credit, though - we decided earlier that we were going to reward survivors pour encourager les autres).

A possible objection to my system is that it would result in larger swings in people's ratings than occur at present. Well, we can make the swings as large or as small as we like, by adjusting the weighting constants and the number of games counted. In the limit, if the weightings all have the same value, the swings will be the same as those in Richard's system.

Furthermore, larger swings might actually be a good thing, since they would give more people hope that some day they might be number one in the ratings, even if it wasn't for very long. Currently, if the best Dip player ever born were just starting his career, it would be many years before his rating came to reflect his ability. This is because of Richard's singular 'experienced player bonus', which allows a player, simply because he has played a lot of games, to ignore recent defeats in favour of past victories. Mine is a more socialist approach!

Well, it wasn't my intention when I started writing this to propose a new ratings system, but that's what I seem to have done. It's main features are:

i) The game rating for a player ignores his own rating - it is a measure of the quality of the opposition

ii) Recently played games count for more than historical ones

iii) Very old results are ignored, however good they might have been

iv) All losing survivors score the same, as do all survivors in a draw (a draw being any game in which there is no outright winner), ie. DIAS applies

v) Centre counts have no effect on points awarded

vi) Dropouts are allowed to ignore the games from which they dropped out, if they subsequently complete enough games with honour

The following is a tentative definition of the system. The letters A to L denote constants yet to be decided; comments in square brackets would not be included in any final version.

1. Everyone starts with 100 points from a notional first game.

2. The game rating for a completed game is different for each player, being the sum of the ratings of all the other players in the game.

3. At the end of the game, each player receives a number of points that is a percentage of the game rating as he sees it (his 'score'), multiplied by an 'anti-deflation factor'. The anti-deflation factor is given by the sum of the players' ratings before adjustment divided by the sum of their scores for the newly completed game [see below for explanation]. Centre counts are irrelevant.

4. An outright winner, or a player who is conceded a win, scores A% of the game rating. The other n survivors score B/n % each (this will represent a different number for each of them, of course). Eliminees split the remaining (100-A-B)% in the ratio 1:2:3:...:(6-n), those eliminated first getting the fewest points. If two players were eliminated in the same season, the split becomes, for example, 1:2:2:3.

5. If there is no outright winner, the n survivors score 100(n+1)/7n % of the game rating each. Eliminees split any remaining percentage equally. [This slightly complicated formula is required because all the available points must be split, to avoid deflation, but without giving the eliminee, if there is only one, the same points as the survivors.]

6. The new rating, R, for each player is determined from the following formula (the 'ratings formula'):

R = (w1P1 + w2P2 + w3P3 + ... + wNPN ) /

(w1+w2+w3+...+wN)

where P1 is the points awarded in the current game, P2 is the points awarded in the previous game, and so on. The w's are constants, known as 'weightings'. Their values are E, F, G...H [E>F>G...>H]

N is J, or the number of games played (including the notional one), whichever is the smaller. [It might be better if novices were deemed to have played N notional games, scoring 100 points in each one; this would help to reduce the ratings of 'one game wonders'.]

7. If a player drops out of a game, he scores no points for it. The weighting for that game for him is fixed at K [K>wN (K=w1?)], however many games he plays subsequently, and the value of N in the ratings formula becomes L [L>J] for him, as long as any games from which he dropped out count towards his rating.

8. For the purpose of calculating game ratings for the other players in a game with a dropout, a special rating is calculated for the dropout for that game. This uses the normal formula for non-dropouts, the dropout's actual rating being entered as his score for the games from which he dropped out (note that since the normal ratings formula counts a smaller number of results than the formula for dropouts, the dropout's special rating might not include any games from which he dropped out). [This helps to eliminate the problem that people are reluctant to play against reformed dropouts because of their low ratings.]

The anti-deflation factor requires explaining. In the current system, the sum of the ratings of all the players, divided by the number of players, remains more or less constant over time (more or less, because some players ignore games that others do not). It doesn't seem that way, because players who leave the hobby with ratings below 100 contribute the points they lost to the total in circulation amongst the active players, but if we count everyone who's ever been rated, the observation holds.

In my system, it's not obvious that this will be the case, though. For example, suppose an outright winner scored 100% of the game rating (ie. A=100 in item 4 above), and no-one else got anything. 7 novices now play the first ever game, which one of them wins. The winner's rating, ignoring the anti-deflation factor, would become

Rwinner = (2x600 + 1.9x100)/(2+1.9) = 356.4

whilst each of the losers would go down to

Rloser = (2x0 + 1.9x100)/(2+1.9) = 48.7

So the sum of the ratings is 648.6, not 700.

This happens because the game rating is not the sum of the player ratings. We can't just fiddle the percentages so that they add up to more than 100% (this is obvious if you imagine that the winner in the above game started with a rating other than 100 - it makes no difference to the new ratings) so my suggestion is to multiply each new rating, R, by Sr/SR, where r denotes the original ratings. This ensures that the sum of the new ratings is the same as the sum of the old ones. What's more, it does it in a way that allows us to play around with the constants in my system - even making them add up to more than 100%, or making the denominator in the ratings formula different from the sum of the weightings - without having to worry about 'inflationary drift' (or at any rate, no more than we do at present).

A couple more thoughts. It might be suggested that it would be a good idea to make the weightings of historical games a function of how long ago they were played. Thus games played last year would have a weighting of 1.9, those played two years ago would have a weighting of 1.8, and so on. Unfortunately, this would have the effect that a player's rating would change every year even if he didn't play a game - in fact it might go up! This is clearly not on.

Another idea I had was this. It seems a pity to make veteran players with dozens of games to their credit throw away their early games. Instead, why not take all of those games and work out a rating from them, as if they had just been played, setting N sufficiently large to include all of them (but ignoring dropouts). We could then pretend that the resulting rating was the points scored by the player in the oldest game that is counted for his actual rating. I have yet to work out whether this would be better or worse for the player than ignoring these games altogether, and whether, if it's better, the effect is any different from including all of the games in the normal rating calculation (I don't think this can be answered, because the weightings aren't defined above wN) , but perhaps we could say that this modification is made only if it results in an improvement to the player's rating?

I gather from what you’ve said to me that you think any new ratings system should be applied retrospectively. I'm not sure why you think the data would all have to be typed in again though. Is RS still using whatever ancient, non-IBM compatible machine he was years ago for the stats? If not, he should be able to supply the data in CSV (comma separated variable) format, that can easily be read into Excel. If you had to type it all in again, it would take 33 hours even at one entry a minute, and since you've got to enter Boardman number, players' names, opening moves, Winter '01 centre counts, dates of eliminations, final centre counts, and agreed result for each game, I think one a minute is a bit optimistic (and you'd probably also want zine name, GM name and standby details, as well - perhaps even end date (but perhaps some of this could be derived from the Boardman number?)).

Of course, this could all be done whilst the details of the new system are still being finalised, and you could also farm it out in packets to anyone with a PC - they wouldn't even need Excel.

One of my projects whilst I'm 'between jobs' was supposed to be to learn Visual Basic programming - I even spent £37 on a text book for the purpose - and writing a front end for an Excel based ratings system would be an ideal exercise on which to test my new skills. Unfortunately, I've spent every day so far tied up with PBM stuff, except when I've been learning German so as to be able to read games rules. Any such program is therefore some way off, I'm afraid.

Home - About this Site - Diplomacy Rules and Maps - Dip Strategy - Variants - Dip Software - Play Diplomacy - FtF Diplomacy - Postal Diplomacy - Diplomacy Humour - Tournament Scoring - Dip Hobby History - Zines - Con Reports - UK Zine Archive Miscellaneous - FAQ - Links - Recommended Reading