A New And Better Rating System?
by Steve Cox
Having said in a recent letter to Richard Sharp that I knew he'd had no end of advice
over the years on how to improve the ratings system, I decided to look it all up, and it
makes interesting reading. For example, there've been professional statisticians who have
worked out systems based on how different a player's record of wins, draws and
eliminations is from what could have been achieved by chance, or who have described
systems for determining which values of the constants in the ratings formula make the
ratings most useful for predicting the results of games, as well as, of course, endless
discussion on the relative merits of coming second to a winner compared with sharing in an
n-way draw, or the effects of inflationary drift (i.e. the fact that game ratings tend to
increase with time, so that newcomers can make faster progress up the table than their
predecessors were able to).
What emerges most strongly is that there is no perfect system, and not even any
agreement on what features the perfect system would have if it existed. It's not even
clear whether or not ratings are a good thing. To take the last question first, I think
they are a good thing. In fact, if the ratings didn't exist, I wouldn't play regular Dip
postally (I'd stick to variants), just as I wouldn't play it at cons if the games weren't
part of a tournament. The single biggest boost that could be given to variants would be if
there was a ratings system for the players of them that had the same high profile that
Richard's ratings do. Unfortunately, one of the main reasons why Richard's system has
gained the acceptance that it has, despite its manifest flaws, is that everyone can see
how much effort he has put into it, and continues to put into it, and the prospect of
having to match this puts people off.
In the light of the material I have unearthed, I think it's very important, if you plan
to devise a new system, that everyone is clear what the philosophy behind it is, and what
its objectives are. In other words, you must state explicitly whether you believe that
second behind a winner is worse than any draw, and if so, how much worse; you must say
whether a draw against strong opposition is better than a win against novices and
dropouts, whether the player on one centre in a 17-16-1 draw deserves as much credit as
the one on 17 centres, and whether a two-way draw is better than a three-way. And you must
come clean about whether you wish to encourage certain styles of play and discourage
others, and whether your ratings are intended to indicate the current level of skill of a
player rather than his average level throughout his career, or even to be able to predict
the likely results of new games.
Richard has said more than once (and I agree with him) that he believes that the most
important factor that any system should take into account when awarding points to the
players in a game is what the quality of the opposition was. That said, there's a problem
with Richard's system that no-one seems to have spotted, which is that it awards points to
a player not just on the basis of how good the opposition was, but on how good the player
himself is as well! Thus if Rob Chapman beats a collection of novices, he gets more points
from it than if another novice beats them. This problem can be overcome quite simply, it
seems to me, by making the game rating for each player be the sum of the other six
players' ratings, not the sum of all the players' ratings.
Richard's system also has the problem that a player's recent record has no more
significance than his historical one. Richard's system also penalises experienced players
in that it is harder for them to increase their rating because each new result is swamped
by all the old ones.
These problems can be overcome by using the following formula to calculate a player's
rating, R:
R = (w1P1 + w2P2 + w3P3 + ... + wNPN ) /
(w1+w2+w3+...+wN)
where P1 is the points awarded in the current game, P2 is the points awarded in the
previous game, etc., and the w's are constants, called 'weighting factors' (in case you
don't recognise it, this is the formula for a 'weighted average').
We can choose a maximum value for N in this formula so that only a player's last 10
games count, say (although this won't make much difference if w1 is large compared with
wN). For the purpose of these notes, I will assume that N=11, with w1, w2, w3,...w11 = 2,
1.9, 1.8,...1; this makes the divisor 16.5. (Actually, by making some of the weightings
less than 1, we could arrange it so that the divisor was N, just as if it were a true
average. However, this is equivalent to multiplying the top and bottom of the rating
fraction by a constant, so it makes no difference to the result.) The actual choice of the
weightings could be determined by answering such questions as 'how many recent draws does
it take to cancel out a past elimination, all else being equal?', or 'if a player's last n
results were eliminations but before that he had 11-n wins, what is the value of n for
which that player's rating should be the same as that of a player who has played only one
game, which he won, and what values of the weightings ensure that this is the case (i.e.
that the ratings are the same)?' What fun!
To get an idea of how the formula works, suppose that a certain novice player (as
usual, novices start with 100 points from a notional first game) gets 200 points from his
actual first game (against 6 other novices), which he wins, but only 50 points from his
next one (against 6 different novices). His rating would be
R = ( 2x50 + 1.9x200 + 1.8x100 ) / (2+1.9+1.8) = 115.8
However, a second novice who plays exactly the same games, with the same results, but
who plays them in the opposite order, would end up with a rating of
R = ( 2x200 + 1.9x50 + 1.8x100 ) / 5.7 = 118.4
Note that Richard's system would also give the two players different ratings, but for a
different reason. Under his system, the game rating for each novice's second game would be
affected by his rating at that time, which is different for each one, so the percentage of
the game rating that he was awarded at the end would represent a different number of
points. Under my system, the game rating for each game is the same for each novice,
because it is independent of his own rating. It is the greater weight given to the score
for the second game that leads to the different ratings.
Out of interest, I worked out what ratings Richard's system would give the two novices.
This is not as easy to do as you might think, because it is no longer the case that if the
games have identical results, then the players get the same points from them. For the
first novice, the result was 116.7, and for the second 113.3. In other words, the first
novice has the higher rating, whereas under my system it's the second one! Now, one could
argue that the order in which the games were played shouldn't matter (and perhaps it
shouldn't for a novice). However, that would only justify making the ratings the same. I
can think of no justification, other than mathematics, for giving the novice who lost his
most recent game a higher rating than the one who won his...
Well, maybe I can - perhaps the others ganged up on the first novice when they realised
his rating was higher than theirs in his second game, so his first game was more
representative of his actual ability! The trouble with arguing this way, though, is that
it assumes one can deduce the ethical quality of a game from the final result, which one
cannot. For example, suppose a game ends in an 18-16 win. Was this because the two players
made a pact whereby the second helped the first to win, or was it because, at 17 all and
after a bloody race to win by both players, the second one took a gamble for the last
centre, that didn't come off? Obviously, we can't tell.
It is important, I think, that we do not allow similar interpretations to bias the
design of other features of the rating system. For example, many people argue that the
player on one centre in an 18-15-1 win deserves less credit than the one on 15 because he
didn't come as close to winning. But we could just as easily argue that the one on 15
deserves less credit because he had a better chance to stop the win, but failed. Since
there's no way of knowing what actually happened, all we can do is give both of them equal
credit (not no credit, though - we decided earlier that we were going to reward survivors
pour
encourager les autres).
A possible objection to my system is that it would result in larger swings in people's
ratings than occur at present. Well, we can make the swings as large or as small as we
like, by adjusting the weighting constants and the number of games counted. In the limit,
if the weightings all have the same value, the swings will be the same as those in
Richard's system.
Furthermore, larger swings might actually be a good thing, since they would give more
people hope that some day they might be number one in the ratings, even if it wasn't for
very long. Currently, if the best Dip player ever born were just starting his career, it
would be many years before his rating came to reflect his ability. This is because of
Richard's singular 'experienced player bonus', which allows a player, simply because he
has played a lot of games, to ignore recent defeats in favour of past victories. Mine is a
more socialist approach!
Well, it wasn't my intention when I started writing this to propose a new ratings
system, but that's what I seem to have done. It's main features are:
i) The game rating for a player ignores his own rating - it is a measure of the quality
of the opposition
ii) Recently played games count for more than historical ones
iii) Very old results are ignored, however good they might have been
iv) All losing survivors score the same, as do all survivors in a draw (a draw being
any game in which there is no outright winner), ie. DIAS applies
v) Centre counts have no effect on points awarded
vi) Dropouts are allowed to ignore the games from which they dropped out, if they
subsequently complete enough games with honour
The following is a tentative definition of the system. The letters A to L denote
constants yet to be decided; comments in square brackets would not be included in any
final version.
1. Everyone starts with 100 points from a notional first game.
2. The game rating for a completed game is different for each player, being the sum of
the ratings of all the other players in the game.
3. At the end of the game, each player receives a number of points that is a percentage
of the game rating as he sees it (his 'score'), multiplied by an 'anti-deflation factor'.
The anti-deflation factor is given by the sum of the players' ratings before adjustment
divided by the sum of their scores for the newly completed game [see below for
explanation]. Centre counts are irrelevant.
4. An outright winner, or a player who is conceded a win, scores A% of the game rating.
The other n survivors score B/n % each (this will represent a different number for each of
them, of course). Eliminees split the remaining (100-A-B)% in the ratio 1:2:3:...:(6-n),
those eliminated first getting the fewest points. If two players were eliminated in the
same season, the split becomes, for example, 1:2:2:3.
5. If there is no outright winner, the n survivors score 100(n+1)/7n % of the game
rating each. Eliminees split any remaining percentage equally. [This slightly complicated
formula is required because all the available points must be split, to avoid deflation,
but without giving the eliminee, if there is only one, the same points as the survivors.]
6. The new rating, R, for each player is determined from the following formula (the
'ratings formula'):
R = (w1P1 + w2P2 + w3P3 + ... + wNPN ) /
(w1+w2+w3+...+wN)
where P1 is the points awarded in the current game, P2 is the points awarded in the
previous game, and so on. The w's are constants, known as 'weightings'. Their values are
E, F, G...H [E>F>G...>H]
N is J, or the number of games played (including the notional one), whichever is the
smaller. [It might be better if novices were deemed to have played N notional games,
scoring 100 points in each one; this would help to reduce the ratings of 'one game
wonders'.]
7. If a player drops out of a game, he scores no points for it. The weighting for that
game for him is fixed at K [K>wN (K=w1?)], however many games he plays subsequently,
and the value of N in the ratings formula becomes L [L>J] for him, as long as any games
from which he dropped out count towards his rating.
8. For the purpose of calculating game ratings for the other players in a game with a
dropout, a special rating is calculated for the dropout for that game. This uses the
normal formula for non-dropouts, the dropout's actual rating being entered as his score
for the games from which he dropped out (note that since the normal ratings formula counts
a smaller number of results than the formula for dropouts, the dropout's special rating
might not include any games from which he dropped out). [This helps to eliminate the
problem that people are reluctant to play against reformed dropouts because of their low
ratings.]
The anti-deflation factor requires explaining. In the current system, the sum of the
ratings of all the players, divided by the number of players, remains more or less
constant over time (more or less, because some players ignore games that others do not).
It doesn't seem that way, because players who leave the hobby with ratings below 100
contribute the points they lost to the total in circulation amongst the active players,
but if we count everyone who's ever been rated, the observation holds.
In my system, it's not obvious that this will be the case, though. For example, suppose
an outright winner scored 100% of the game rating (ie. A=100 in item 4 above), and no-one
else got anything. 7 novices now play the first ever game, which one of them wins. The
winner's rating, ignoring the anti-deflation factor, would become
Rwinner = (2x600 + 1.9x100)/(2+1.9) = 356.4
whilst each of the losers would go down to
Rloser = (2x0 + 1.9x100)/(2+1.9) = 48.7
So the sum of the ratings is 648.6, not 700.
This happens because the game rating is not the sum of the player ratings. We can't
just fiddle the percentages so that they add up to more than 100% (this is obvious if you
imagine that the winner in the above game started with a rating other than 100 - it makes
no difference to the new ratings) so my suggestion is to multiply each new rating, R, by
Sr/SR, where r denotes the original ratings. This ensures that the sum of the new ratings
is the same as the sum of the old ones. What's more, it does it in a way that allows us to
play around with the constants in my system - even making them add up to more than 100%,
or making the denominator in the ratings formula different from the sum of the weightings
- without having to worry about 'inflationary drift' (or at any rate, no more than we do
at present).
A couple more thoughts. It might be suggested that it would be a good idea to make the
weightings of historical games a function of how long ago they were played. Thus games
played last year would have a weighting of 1.9, those played two years ago would have a
weighting of 1.8, and so on. Unfortunately, this would have the effect that a player's
rating would change every year even if he didn't play a game - in fact it might go up!
This is clearly not on.
Another idea I had was this. It seems a pity to make veteran players with dozens of
games to their credit throw away their early games. Instead, why not take all of those
games and work out a rating from them, as if they had just been played, setting N
sufficiently large to include all of them (but ignoring dropouts). We could then pretend
that the resulting rating was the points scored by the player in the oldest game that is
counted for his actual rating. I have yet to work out whether this would be better or
worse for the player than ignoring these games altogether, and whether, if it's better,
the effect is any different from including all of the games in the normal rating
calculation (I don't think this can be answered, because the weightings aren't defined
above wN) , but perhaps we could say that this modification is made only if it results in
an improvement to the player's rating?
I gather from what youve said to me that you think any new ratings system should
be applied retrospectively. I'm not sure why you think the data would all have to be typed
in again though. Is RS still using whatever ancient, non-IBM compatible machine he was
years ago for the stats? If not, he should be able to supply the data in CSV (comma
separated variable) format, that can easily be read into Excel. If you had to type it all
in again, it would take 33 hours even at one entry a minute, and since you've got to enter
Boardman number, players' names, opening moves, Winter '01 centre counts, dates of
eliminations, final centre counts, and agreed result for each game, I think one a minute
is a bit optimistic (and you'd probably also want zine name, GM name and standby details,
as well - perhaps even end date (but perhaps some of this could be derived from the
Boardman number?)).
Of course, this could all be done whilst the details of the new system are still being
finalised, and you could also farm it out in packets to anyone with a PC - they wouldn't
even need Excel.
One of my projects whilst I'm 'between jobs' was supposed to be to learn Visual Basic
programming - I even spent £37 on a text book for the purpose - and writing a front end
for an Excel based ratings system would be an ideal exercise on which to test my new
skills. Unfortunately, I've spent every day so far tied up with PBM stuff, except when
I've been learning German so as to be able to read games rules. Any such program is
therefore some way off, I'm afraid.
|