Rookie 3.0 Chess Playing Program

About

Current and historic server ratings

News

2007-03-24 -- Ratings back to normal

Strange rating recovery problems
Today I finally had time to check out the rating recovery progress. During last month, I was expecting the Rookie(C) ratings to climb back to just above the blik(C) ratings. However, for some reason both got stuck at a level about 50 points too low for weeks in a row. Last week I finally found out that this was because the database didn't get any more updates fed into it, so I was looking at old data all the time.

Moronic Linux
The root cause appeared to be a crazy Linux update on solar: /bin/mail moved to /usr/bin/mail, breaking the code that submits completed games. This type of totally unnecessary change is exactly what makes Linux an unreliable platform compared to the BSDs or Solaris. I lost the logs of the games from February 22 to March 18, or about 5,000 games. On March 18 /bin/mail was sym-linked to the new mail program. This is a workaround until I have time to recompile the binaries.

No full recovery
Today I found time to restore some of the damage. I imported the missing games from the game e-mails that FICS and ICC send out, so at least I have the PGN data back. Anyway, here are the corrected results. On ICC, Rookie(C) has fully recovered to its expected position with respect to blik(C). On FICS, this isn't the case. Maybe this is due to some players preferring one opponent over the other. It does mean that even extended server play is not a reliable method to measure strength differences. That is a very disappointing conclusion.

Rookie3: Recovery of rating drop

Developments
I have been busy with moving for my work again, so there is not much work done on Rookie. The eval tuning experiment has failed. I will focus now on book learning and tuning, as that must be the other easy way to improve Rookie. Last month I rewrote the book file handling code so that it becomes more flexible to use. I have some other plans with that code as well that I won't disclose those now.


History

2007-02-03 -- Restored original evaluation vector

Today I reverted to the original Rookie 2.0 evaluation vector. The graph and numbers speak a thousand words. I still have some experiments ongoing in search for a working self-tuning method. Since I'm little busy preparing my move to Taipei, I probably won't work on Rookie until the Chinese New Year holidays.

Results as graph:
Rookie3: Effect of evaluation tuning experiment
Results as table:
Account
Server Rating stats (last 1000)
Average Sigma Accuracy Delta rating
Rookie(C) chessclub.com 2130 80 2.5
-392
blik(C) chessclub.com 2499 65 2.0 +32
Rookie(C) freechess.org 1983 24 0.8
-238
blik(C) freechess.org 2190 57 1.8 +7

2007-01-15 -- Server results of new evaluation vector: not good

Taking the plunge
The second big iteration of the self-tuning run has almost finished. It has been running for about a month to get there. I decided to deploy the new vector yesterday, because it seems to make some interesting changes to Rookie's perception of material, weak pawns and piece placement. There are also not many improvements found anymore by the algorithm, so it looks more or less `finished'. It is time to see the result of all that effort in the real world...

Disaster strikes
After copying the new vector to the Internet versions of the engine, Rookie(C) got some games quite quickly. However, the first couple of games already indicated that something was wrong, very wrong. Some players commented that this version has gone totally crazy. It indeed looks like the computer is drunk. When you observe the games, the erratic behavior becomes clear almost immediately: Rookie(C) whispers +2 pawns up or more in equal positions, gives away material, refuses to capture easy pawns and so on... It is really horrible. No need to explain that the ratings took a steep dive right away. I didn't dare to check out how much it sank precisely, but so far it looks like Rookie(C) lost 100 to 200 points...

Lessons learnt
So, the first big `improvement' towards Rookie 3.0 turns out to go terribly bad. Not a good start! What can we learn from that?

Follow-up
After the first two games I already needed to resist the urge to undo the change right away. But in the name of Science and Engineering, I will let Rookie suffer a little bit longer and record what happens. For the next couple of days, Rookie(C) will continue to play with the crazy evaluator settings. There are two reasons and one excuse for that: First, I want to collect a representable collection of bad games, so I can extract positions from them for learning. Second, I want to measure the rating drop accurately. Many games are required for that. Last, it is a bit of a hassle to take Rookie offline, modifying everything back, double-checking it all, starting it again and monitoring if all is running normal. So I want to postpone that to the weekend.

I promise that the current version will go offline as soon as I can't bear it anymore: probably just a few days from now. Players can enjoy beating the sitting duck until that time...


2007-01-13 -- Null measurement done

Performance
Sufficient games have been played to consolidate the initial ratings. Rating statistics since deployment on 2006-12-12 are in the graph and table below. On both servers, Rookie(C) outperforms blik(C). The difference is about 47 rating points. This difference must be due to the larger transposition tables and the 5-men endgame database. It is not clear which factor contributes the most. (It is also possible, but not likely, that the difference is caused by potentially different opponent demographics.)

Results as graph:
Rookie3: Null measurement rating plot
Note: The graph shows a daily moving average with a window of 1000 games. The games started on 2006-12-12. The moving average is plotted from the moment that the 1000-game mark was reached.

Results as table:
Account
Server Blitz games Rating stats (last 1000)
Average Sigma Accuracy
Rookie(C) chessclub.com 1998 2522 55 1.7
blik(C) chessclub.com 2198 2467 55 1.7
Rookie(C) freechess.org 1282 2221 39 1.3
blik(C) freechess.org 1604 2183 61 1.9
Reliability
The transition to solar seems to be mostly OK. There are two glitches, though, that affect both blik(C) and Rookie(C): sometimes the engine simply hangs... It must be a Linux-thing, as it never happened on the Sun/Solaris, FreeBSD and MacOSX systems before. The frequency is low, so I will live with it for now. Maybe it is related to the way Rookie handles signals. There are two scenarios:

2006-12-12 -- Rookie3 project kick-off

New machine playing online
Today I setup solar.bitpit.net[marcelk.bitpit.net] to play games online. The blik(C) player moved from piggy.bitpit.net to solar. Solar's processors (2.0GHz Intel Xeon Core-Duo 5130) are about 40% faster than piggy's (2.3GHz PowerPC-G5)! I expect to see Rookie finally break the 1Mnps barrier regularly now in the endgame. The Rookie(C) accounts started playing with a slightly better setup than blik(C): For the rest, there are no differences between blik(C) and Rookie(C). I will use the coming weeks to test this setup and get a good indication of the current playing strengths.

Evaluator auto-tuning started
As a second milestone, the old scripts for tuning the evaluator are running again for the first time in almost 6 years. Solar has 4 CPU cores, so now I have enough computing power available to run these kind of long-term calibrations without impacting online play too much. I have extracted a test set of 47,265 positions from old blik(C) games. (The total number that passed the sanity filters. Initial runs suggested that maybe the targeted number of 10,000 positions is not enough for calibration purposes.) The scripts use a hill-climbing algorithm on 205 of Rookie's adjustable evaluation parameters. This should be enough to test the feasibility of this auto-tuning method. Not all parameters are covered, but it is a good start. It will take a couple of weeks to iterate over the set a number of times: Long enough to do the rating null measurement and testing the setup in the meantime.


To do


Last update: 2007-03-24 (marcelk) Ratings back to normal.