The GNUbg Training Program

This page details my effort to further improve GNU backgammon neural nets (NN).

My involvement with the GNUBG project began in October 1999. I started playing on FIBS a short time before that and wanted to view my matches in a browser. I could not find an open source program to do the job, and started to write my own. Naturally I wanted to annotate the output, and was pleased to discover GNUbg, version 0.0 at that time. The only problem was that GNUbg had no cube handling code. I thought I could improve my backgammon skills by writing it, and after several false tries got it right - and not surprisingly did not improve my skills even one iota. What I did learn was that GNUbg playing ability left a lot to be desired, and was far behind Jellyfish 3.5. I have been working on GNUbg neural nets since that time.

First step - a benchmark for comparing two nets

A benchmark does not directly improves the NN, but certainly helps developing a better one. The objective is to measure NN performance in the real currency of backgammon - good move selection and good cube decisions. Obviously it is nice when NN evaluations match ``reality'', but I prefer a net which makes better decisions over more accurate evaluations. Given such a net, if you want the ``real'' probabilities you perform a rollout, which is what you end up doing anyway in those situations.

Humans also rarely make probability estimates - they make decisions.

Here is how I choose to do it,

The Benchmark

Ideally, all possible moves are rolled, but that is not feasible given our current computing power. The compromise is to roll only the top N[1] moves, where the top N moves are selected by a 2-ply evaluation[2]. The downside is that sometimes the best move is missed altogether (assuming the rollout would have evaluated it as such). I hope this will not hurt the practical value of the benchmark, and perhaps fully roll a small subset of the positions to get an estimate on this source of error. Rollouts are done for money, since this is what the net tries to approximate.

As to cube decisions, I started with cubefull rollouts, but this turned out to be a bad idea. You can read the details in the comments below, especially in Change of cube decisions check.

A good way to collect benchmark positions is by looking at human - bot matches, and if those are not available from bot self play. I think it is important to have a real-life mix of positions. It is easy to get frustrated with a NN performance on a specific position (or type of positions), and as a result over-train the NN for those. Any candidate NN should perform well for a real-life set.

Disadvantages and objections

Rollouts introduce noise
This is definitely the biggest drawback. One way to minimize the effect is to use the exact same dice sequences for all moves of the position (A trick independently invented many times). That way, big swings in the first few turns of each rollout are shared.

Rollouts are biased toward the NN used to run the rollout
There is little that can be done. It is the best there is at this stage. When a better NN is developed (based on the benchmark), the process may be repeated with the new NN.

Rollouts are not truncated, but proceed until a race position is reached (or bearoff when rolling races). Races are evaluated to a very small error using my One Sided Race (OSR) evaluator. This way no evaluation of the NN performing the rollout is used, only it's playing capabilities.

-5:-5 is not enough (David Montgomery)
Even if you were to concentrate on cubefull numbers, the choice of -5:-5 seems sub-optimal. If the intent is to focus on money-like scores, I would go to a much longer length. If the intent is to ensure that decisions are learned for match play, then a variety of scores might be appropriate.
I agree. I would have used money cube decisions, only I never got around to implementing them. We could use -25:-25, which approximates money to a sufficient degree (but would be slower).

The main purpose of including cube decisions is to cover evaluations of positions with none or one moves, which are not covered by the moves benchmark. For that reason I don't think it makes much of a difference in practice.

7-away 7-away is more like money (Douglas Zare)
7-away 7-away is more like money play than 5-away 5-away or 9-away 9-away. In fact, 7-away 7-away closely resembles table-stakes backgammon with a limit of 8 points. The exact correspondence depends on the MET, of course.
I will change the default in the future.
Change of cube decisions check.
When considering the above and following some self reflection, I realized that using cubefull rollouts is indeed wrong. The problem is that the 0ply doubling method vs. cubefull rollout introduces most of the differences and not errors in evaluationa, which is what I want to measure.

A better way is to check the difference between two cube actions, both made using the 0ply doubling method, one with the net 0ply evaluation, the other with the rollout evaluation.

This should give a better measure of what we are looking for - the effect of absolute evaluation errors on cube decisions[3]. Ideally the measure would be independent of the particular doubling algorithm used - but then I can't see how to measure the MWC between different cube actions without using some model.

The compromise I currently use is to take the equity difference between rollout and evaluation, but only when the cube decision is different than when using rollout. I also realized that doing so for just one score takes into account only position around the take point of that score, so I started using 31 scores ranging from (-2,-3) to (-25,-25), which I found covers the range between 40% to 100% win percentage. An additional refinement is to search for an interior point between the rollout and the evaluation which,

And use the difference between that point and the evaluation. i.e., how far is the evaluation from the point where it would make the right decision.

The doubling algorithm used affects in deciding if an error was made, and the equity difference is typically larger than the actual equity lost, but the main disadvantage is that this number has no meaning and can only be used to compare different nets.

Establishing a benchmark - the rollout program

I wrote a C++ program to run the rollouts. The code is based on fibs2html libraries.

Unix/Linux users should downloaded the code, and Windows users can downloaded the self extracting .exe (thanks to Řystein Johansen).

I am experimenting with GNU pgp. Here is the signature of the code. This is my public key.

You can read more about sagnubg formats and options here.

How to install

How to run

After the successful installation,

Windows user may want to look at Oystein explains how to use sagnu for windows.

If program run is interrupted for any reason[4] you can simply resume using `sagnubg -w . --resume data-file result-file'.

That --resume is important, otherwise the new run will overwrite all previous data.

Creating the benchmark

Creating a benchmark requires lots of computing power, and is made possible only with the generous help of the rollout team. I thank you all, especially the prolific Ian Shaw.

At the moment (January 2003) work on contact benchmark has started. Status of rollouts in progress is here.

Rolling of crashed and race benchmark is complete for now. Contributors are listed here.

The Neural Net

Creating a Neural Net (NN) involves making a number of decisions,

All those features interact in a complex, non-linear way. Seeking the best combination in this multidimensional space is hard. I am content to seek improvement along one dimension while keeping the rest fixed.

All GNUbg NN's have one hidden layer and are fully connected with sigmodial activation function. Those are semi standard choices inherited from Gary Wong GNUbg-0.0, and will probably never change.

Size of hidden layers

All main nets have the same number of hidden nodes (128)[5]. A higher number may improve the net modeling ability, but increases cost of evaluating, training and storing. The best value for each net should be investigated more fully in the future, and may increase as computing power continues to grow.

Hand Crafted Inputs

There are several interesting and important generic questions, none with a satisfactory answer,

All nets encode the raw board following Tesauro using 4 inputs per point (per side), a total of 200 inputs. The idea is to hardwire the basic concepts of a blot, anchor and builder for the net. Additional inputs are added according to net type.

The Training Method

For the chosen topology, the basic training step is the standard backpropagation step (Or perhaps here). This hardly determines the training method since the choice of alpha (learning rate) and training data can make a huge difference.

GNUbg-0.0 Training

GNUbg-0.0 was trained using TD(1)[6]. Self playing to obtain positions, it used a 1ply evaluation as the training data, and decreased alpha as the number of training positions grew. The quality of the net was medium, rated around 1650 on FIBS (0ply).

TD is a good bootstrapping method but suffers from two drawbacks. First, decreasing alpha means data points are treated differently according to their (random) location in the training sequence. In fact, after some time alpha gets so small there is practicality no change. Second, 1ply evaluation are not sufficient at the first stage of learning, before the net has a reasonable grasp of the value of primes, or any other long term feature.

Supervised Training

A switch to supervised training (ST) addresses the first drawback. A set of positions is chosen, and the net is trained in epoch's (A pass through all of the training data with a fixed alpha) over and over, trying to find the best fit for the data set as a whole.

It was quite an effort to make supervised training work. I ended up with a method which works for me, but it is still unclear why it does.

I use ridiculous values of alpha in the range 30 to 1, coupled with randomizing the order of data in each epoch.

  1. set alpha to alphastart
  2. select a random order for the data
  3. perform one epoch
  4. err = NN error
  5. if err is smaller by p% than errprevious, return to stage 3
  6. decrease alpha. if alpha < alphamin, set it to alphastart
  7. if err < errprevious, return to stage 3
  8. return to stage 2

So, while there is a p% improvement, another pass is performed with the same settings. When there is a small improvement, alpha is clamped down some more. When error increases, the data is re-ordered. Typical values are alphastart = 20, p = 0.5% and alphamin = 1 or 0.5.

Obviously a big alpha's lets us escape from small local minima, but it is not clear to me why it converges at all.

Use of Rollouts to Obtain Training Data

To address the second drawback, rollouts were used to obtain training data. Both random positions and positions known to be mis-evaluated by GNUbg made the data set.

As a result, GNUbg improved quite a lot. It's rating climbed to the low 1800 on FIBS (0ply), and it started making expert moves, like hitting behind a 6 prime to catch a second checker.

From Rollouts to 2ply

One disadvantage of using rollouts is the time needed to generate them, but this is not the only concern. After using ST for a while I noticed that sometimes adding data points degraded the net instead of improving it, and started wondering if there was a systematic way to keep the data set redundant-free, effective, and (if possible) relatively small.

The space of all backgammon positions is huge[7], so you can't expect the NN to give the exact value for every position. The NN does what every human does - encapsulate a set of concepts which enables it to make good move selection and cube decisions. Given that, how is it possible to help the net develop those concepts?

One way to control redundancy is to take an incremental approach. Instead of adding random positions, add only positions which the net errs on, and at regular intervals stop to train on the augmented data set.

This is swell assuming someone can tell the net which is the right move. Fortunately, due to the nature of backgammon we have such a player at our disposal - the same net at 2ply!. If 2ply plays a better move than 0ply, two positions are added to the training set - the position after the 0ply move (the lesser move) and the position after the 2ply move (the better one) - so that the net can develop concepts to select the better moves. An added benefit of using 2ply is that they are much faster than rollouts.

This approach was used to train both contact and crashed nets, and resulted in 0ply reaching around 1930 on FIBS.

Poor Cube (to say nothing of 1ply)

Training on move selection only had an adverse effect - the absolute error could get very high since only the relative ordering mattered. This showed up in several ways,

After realizing this I began adding positions whose 0ply cube action was different than when using 2ply evaluations. At first this was done only at -7,-7 away, but later I realized this checks only positions near the doubling point of that particular score, and added other scores, such as (-2,-3), (-4,-7) and (-25,-25), which covers a wide range.

The Race Net

Neil Kaz has raised the issue of the accuracy of GNU race net. On the one hand, race situation seem much simpler than contact. On the other hand, selecting a move is difficult since (typically) there are many moves available, all sharing the same pip count, and the top moves are often separated by a very small amount.

While the crashed net was my first target, I switched to working on the race net. There is no particular problem with the race net quality - it outperforms the other nets (contact and crashed) by a large margin. I thought it worthwhile because the techniques developed are general and will be used for other nets, and it is easier to develop on races since all operations are faster (net is smaller), and the availability of a reasonable independent evaluator (the One Sided Race, or OSR).

However, I admit being quite surprised in achieving 50% reduction in race error rate, without changing inputs or net size. It might be that further improvement is possible.

Analysis of race benchmark

A benchmark of around 100K positions was generated, taken from both actual play and self play. The results give a measure of the current race net error and can be used to compare performance of other race evaluators as well.

Precisely there were 111008 positions, of which 13697 turned out to be non interesting, i.e. positions where all moves have equal equity. That left 97311 interesting positions.

Move selection error as given by race rollouts
0ply 1ply 2ply OSR(1296)
Avg err per move 0.00101 0.00078 0.00030 0.000515
% of err moves 22.8% 20.4% 13.6% 14.9%
Notes: In order to get a notion about the cube action measure, I dug out the original (version 0.0) race net code and computed it's error as well.

105573 Positions were checked. The first column is the number of positions that had an error at some score, the last is the error averaged over all positions and scores.

net # errors %errors Average error
current 6573 6.2% 0.000095
0.0 12821 12.1% 0.004881
The 0.0 0ply move error is 0.00342 (30828 of 97311, or 31.7%). So, the current net move error is 1/3 of the original, and it's cube error indicator is 51 times smaller.

While searching for a better race net the issue of poor 1ply performance came up again. I found out a large chunk of the error was due to a small number of positions with huge blunders. For example,


Move number 1: Red to play 32






Pipcounts: Red = 74, White = 3

The 1ply evaluations are an average of positions, each non interesting from a play point of view since all moves have the same equity, so that they never enter the training data. One has to explicitly look for those kinds of positions and add them to improve 1ply performance. Alas, they do degrade the 0ply evaluations, so one has to make a compromise.

Benchmark update using the large 12 point database

The benchmark has been updated using the large 12pt one sided race database. All positions in this range have been re-evaluated using the database. This gives an exact error rate for those positions. A further improvement (not done yet) is to re-roll the other positions with truncation at the database, which will further reduce the rollout noise, which I believe is still making the absolute error rates higher than they really are.

Error rates using the new benchmark

race nets benchmark results
net ply error rate n errors %error  cube action measure
old 0ply 0.0010825 21704/96620 22.5% 9.538e-5
new 0ply 0.0005888 6847/96620 17.44% 3.720e-5

New net performance against 12 point database
ply error rate n errors %error
0ply 0.0002539 3231/34165 9.46%
1ply 0.0003226 3418/34165 10.00%
2ply 0.0000209 1255/34165 3.67%

New net performance against all databases range at 0 ply
range error rate n errors %error
7 pt 0.0000651 157/3994 3.93%
8 pt 0.0001258 434/9463 4.59%
9 pt 0.0001763 925/14847 6.23%
10 pt 0.0002207 1673/21304 7.85%
11 pt 0.0002437 2456/27879 8.81%
12 pt 0.0002539 3231/34165 9.46%
other 0.0007720 13616/62455 21.80%
total 0.0005888 6847/96620 17.44%

2ply Vs. OSR racing

Here is a closer examination of 10 random positions where either 2ply or OSR differ from benchmark rollout.







Method Move 1296 rollout original error12960 rollout error
20/14 13/10 100+9.71 0.0 100+11.40.0
2ply 20/11 100+10.15 0.0044 100+11.6 0.0019
osr(1296) 13/7 8/5 100+10.3 0.0059 100+12.0 0.0065
osr(12960) 20/14 8/5 100+10.2 0.0046 100+11.96 0.0057

Verdict: Result stands, but 2ply error is smaller.







Method Move 1296 rollout original erroreval to bearoff true error
13/8 7/6 100+66.7 0.0 100+68.8
2ply
osr(1296)
10/5 7/6 100+66.9 0.0023 100+68.90.0009

Verdict: Result stands, but true error is smaller.







Method Move 1296 rollout original erroreval to bearoff true error
21/18 12/10 100+66.6 0.0100+65.63 0.0
2ply
osr(1296)
21/16 100+67.0 0.0042100+65.45 0.0018

Verdict: Result stands, but true error is smaller.







Method Move 1296 rollout original error12960 rollout error
3/off 3/1 0+49.75 0.00+46.85 0.00097
2ply
osr(1296)
3/off 2/off 0+49.12 0.00550+46.95 0.0

Verdict: Rollout obviously wrong.



Move number 1: Red to play 32






Method Move 1296 rollout original error12960 rollout error
19/16 13/11 100+61.91 0.0100+63.58 0.0
2ply 19/16 13/11
osr(1296) 13/10 7/5 100+62.78 0.0087100+64.73 0.0086

Verdict: Result stands. OSR fails on this position.







Method Move 1296 rollout original erroreval to bearoff true error
9/5 100+44.54 0.0100+44.19 0.0102
2ply 16/12 100+45.01 0.0047100+43.16 0.0
osr(1296) 9/6 16/15 100+44.74 0.0020100+43.38 0.0022

Verdict: Rollout very wrong. 2ply gets the answer.







Method Move 1296 rollout original erroreval to bearoff true error
14/7 100+28.76 0.0100+29.03 0.0
2ply 14/7
osr(1296) 14/10 12/9 100+29.49 0.0072100+29.51 0.0047

Verdict: Result stands. but OSR error is smaller.







Method Move 1296 rollout original error12960 rollout error
13/10 7/6 91.46+0 0.090.88+0 0.0
2ply 13/12 7/4 91.65+0 0.0036 91.05+0 0.0036
osr(1296) 13/10 7/6

Verdict: Result stands. 2ply wrong, OSR right.







Method Move 1296 rollout original error12960 rollout error
10/5/ 8/5 66.07 0.066.41 0.0016
2ply
osr(1296)
10/7 10/5 66.27 0.0040 66.33 0.0

Verdict: Rollout wrong.







Method Move 1296 rollout original error12960 rollout error
7/2 14.59 0.0 14.46 0.0
2ply 7/2
osr(1296) 7/5 7/4 14.79 0.0040 14.60 0.0028

Verdict: Result stands.

Conclusion

Based on this very small sample, I see no reason to doubt the validity of the benchmark result that 2ply is better than OSR. There is not one case that the rollout ruled against the OSR that is reversed by a longer rollout or exact evaluation.

In addition it seems that the actual error rate might be significantly smaller, due to rollout noise.

The Crashed Net

The crashed net evolved following my attempt to investigate GNUbg poor performance in backgame situations. Some backgame positions reach either a hopeless race, or a late hit, in which case checker(s) needs to be contained in order to win or save the gammon. GNUbg played many of those ridiculously, making even rollouts useless.

Crashed benchmark positions

A set of 100K crashed positions was collected from games played by gnu bots on FIBS.

Analysis of crashed benchmark

I started by examining 3 Neural Nets,

n16 The current net.
n11 The net used in GNUbg before the introduction of the crashed net or any specific training on crashed positions.
n150Kr-2p A net obtained by collecting 150K random crashed positions from self play, evaluating each position by a 2-ply (using the current net), and training on this data set.
Please forgive me for the dreary net names. I lack the imagination to compete with Linux distributions.

Here is how did the three nets fare -

Net Average Move Error
n16 0.011004
n150Kr-2p 0.010584
n11 0.011872
It is surprising how small the differences are. This clearly illustrates the difficulty in judging merits of different nets. Perhaps a solitary number can't describe the situation adequately? I sought to divide the crashed class into meaningful sub-classes. This is what I tried,

category description
roll-prime Side on move has a 6 prime.
trapped Side on move is trapped behind a 6 prime.
contain Opponent is ahead in the race, and should be contained. Side on move has at least 6 checkers in front of one of opponents checkers.
escape Opposite of contain. Side is ahead in the race and will win if he escapes.
scramble Side on move is well ahead in the race, and wants to break contact without getting hit.
must-hit Opposite of scramble. Opponent is ahead in the race, and side (lacking the numbers to contain) needs to hit in order to stay alive.
other anything else.

And those are the errors per category,

category %data n16 n150Kr-2p n11
roll-prime 14.22% 0.00590 0.00736 0.00755
trapped 3.92% 0.01055 0.00785 0.00768
contain 10.52% 0.01771 0.02319 0.02139
escape 11.23% 0.01502 0.01084 0.01140
scramble 26.78% 0.00704 0.00475 0.00825
must-hit 28.84% 0.01332 0.01337 0.01494
other 4.50% 0.01039 0.00980 0.00986

My interpretation of the numbers is that n16 is indeed better in containment situations, but at the expense of other situations. Especially troubling is it's poor ``escape'' performance, which means part of it's better rollout results might be due to misplaying the escaping side.

In conclusion, no matter how I tried to twist it, the existing crashed net was not a big improvement, maybe no improvement at all. It really shows how important it is to use an independent, real life benchmark, especially to overcome personal biases.

References


1. Currently N = 5

2. Actually, as a further optimization, only the top 20 moves (as selected by 0 ply) are evaluated by the 2-ply

3. Move selection checks only the relative accuracy

4. If you run windoz, Maybe you got the familiar blue screen of death, and if you run Linux, perhaps your motherboard got fried from being up for sooo long

5. I am using very small, 5 hidden nodes filtering nets

6. Or is it TD(0), I always get confused

7. Even if we take only ''positions that can be reached from reasonable play'', assuming you know what reasonable is.

8. which was using the race network. This did not work at all, since the race net was not trained for those positions at all!