July 17, 2005. Today I finished building the final slice of the 10-piece endgame database for English checkers. Ironically this final 10-piece slice is not only the most useful, since it is the set of positions containing 5 checkers vs. 5 checkers (i.e. no kings), but it is also the smallest. At 4 billion positions it is smaller than almost all of the 9-piece slices, and even smaller than a majority of the 8-piece slices. It is also the fastest slice to build, taking only a few hours of computer time, as compared to the largest slice which occupied 4 computers working in parallel for 2 months. If only this all-checker slice could be built first and put to use right away while the other more king heavy slices are being built... But backward chaining works in the other direction. The win/loss/draw values for the most advanced positions have to be calculated first, and then the values for positions that play into those can be computed. It's a bit like building a pyramid. You cannot put that final little pointy stone at the top until you've placed every row of stones beneath it in succession, starting with the base.
I started building the database on August 2, 2004. That's 11-1/2 months of calendar time, but it was also 4.2 years of computer time. When I started I only had a rough idea of how much computer time it would take. In 2003 I had built the 9-piece database in 6 computer-months. The 10-piece db is 4.3 times as many positions, and if it could be built at the same rate it would take slightly longer than 2 computer-years. I knew it would not be that easy, because I had observed a progressive slowdown as each larger db was built -- the 9-piece positions built more slowly than the 8-piece, and so on. My goal was to build it in about a year, and with an upper limit to my patience of about 15 months. Prior to starting the 10-piece build I had 2 computers that I had been using for various checkers projects including the 9-piece build. One was an aging 2.4GHz P4, and the other an Athlon XP2800+, both with 2gb of ram. To each of these I added a pair of 120gb hard drives, and in July 2004 I built a third computer from parts, a 3.2GHz P4 also with 2gb ram, and I was ready to start. But after the first month of building it was apparent that 3 computers were not going to be enough, so again I ordered the parts to build another computer, this time an Athlon XP2600+, and soon had 4 of them working on the build.
These 4 computers all sit underneath a desk in my home office, which is a small 8x13 room at one end of my house. Also in this room is my wife's computer, a small television, a couple of lamps, and various computer accessories like the scanner, printer, dsl modem, and network router. There are two displays for the 4 machines under my desk. I used a KVM switch to multiplex my keyboard, video monitor, and mouse to several of the computers, and another monitor was also set up so that my youngest daughter could use one of the build machines for her computing needs. I ran the database build program at low priority on all 4 machines, and it usually did not interfere with running other programs except to make them run a bit more slowly. With all these heat generators in a small room, you can imagine that it gets warm in there, and it certainly does! For the coldest months of winter this is rather nice, but in summer it is a serious problem to keep the room from becoming oppressively hot. Even in the winter, if the door to the room is accidentally left closed the temperature climbs above 90F.
My youngest daughter is starting college this fall, and since that coincides with the time that the 10-piece database would be finished, I had expected that she would take one of these computers with her to use at college. But she has decided to go to a school that requires all the students to have a laptop computer, so unfortunately she cannot use one of these machines, and instead an expensive laptop had to be purchased :-(I ran the build cluster using the 4 computers through the fall of 2004 and into winter. Occasionally I would look over at my wife's machine and think how nice it would be to run the build program it also, it seemed such a shame to have it sitting there idle most of the time. Unfortunately is was an older machine with a relatively slow processor and not enough ram or disk space to be useful to me. But after 6 months into the 10-piece build I thought I was beginning to see a slowing trend, and the price of computer parts had dropped some, so I decided it was time to get that machine doing some useful work. I ordered a new cpu, motherboard, and 250gb disk drive for it, and on a weekend while my wife was out for the day I turned her old machine into a modern workhorse! I 'stole' a 1gb memory module from one of the other build machines and used it in my wife's new motherboard. With the build scheme I am using, most of the 10-piece database can be built efficiently using only 1gb ram, but at least one machine needs to have 2gb or the largest database subdivisions will build much more slowly. I didn't tell my wife that she had a new machine under the hood. I simply explained that she could leave the my program running in a minimized window and "it shouldn't trouble you", and indeed it did not. The only suspicion that something was different came a few weeks later when she went to use the flash memory card reader to view pictures from the digital camera. She had written some notes on how to do this, and her instructions said to use Windows Explorer and click on the F: drive. But when I rebuilt her machine I had replaced two small hard drives with one much larger one, and the card reader had moved from F: to E:, so that tripped her up.
The 10-piece build usually ran smoothly and rarely needed any assistance from me. Each machine ran an instance of the program, and they communicated and shared results over a 100mbit Ethernet LAN. Once in a while I would stop the program on one of the machines to run something else, install a Windows update, etc. Usually the programs ran for weeks at a time without interruption. About 9 or 10 times during the past year I have had AC power line interruptions that lasted long enough to reset all the computers. I did not have a UPS, so when this happened I typically lost about half of a day of computing, although it happened once while I was away on vacation and on that occasion I lost 3 or 4 days. I experienced two hardware failures during the build. The 3.2GHz P4 developed a strange virtual memory problem after a few months. I made a guess that it was the motherboard, replaced it and never had that trouble afterwards. The other problem was difficult to track down. In January I was in the home office one evening and heard a single loud beep. It sounded like one of those error beeps you might get from the BIOS when the processor overheats. I checked each of the cpu temperatures and all were reasonable, and there were no more beeps that night so there wasn't much that could be done except let things continue to run and hope for the best. A few days later the same thing happened, and soon I was hearing this beep once or twice each day, but it was too infrequent for me to discern which computer it was coming from, and nothing appeared to be wrong with any of them. Eventually one of the build machines reported an exception, and I then I focused my attention on it and discovered that the 12 volt output of the power supply was not regulating very well, the BIOS was apparently beeping whenever the voltage dropped below its acceptable threshold. I replaced the power supply and had no further problems with it.
Verification is an important part of building endgame databases. When something takes years to compute, the probability that an error will occur sometime during the process is not insignificant. Of course one or two errors in trillions of positions is not so bad, but the danger is that an error can propagate to all the positions that play into it. This can effectively make all computations worthless after the first error. It would be really depressing to compute the entire 10-piece database and then discover that most of it was garbage! Nearly everyone that I know who has computed endgame databases for checkers has some experience with random errors that they can relate. The best way to verify the data is to have another independently built database to compare it against. But if you are the first to compute a database then you would have to compute it twice, and this doubles the effort. A somewhat quicker method is to visit each position in the database and verify that its value is consistent with each of the positions that it immediately plays into. It is still possible to have errors when all the data is self-consistent by this test, but the probability is small. This type of verification adds about 40% to the total build time. For the 10-piece English checkers database, I was fortunate that Jonathan Schaeffer had built this before me, so I was able to verify my data against his. Jonathan was happy to have an independent confirmation of his data, and I was happy to be able to build the data without needing the two additional computers it would take to include self-consistency checks and still finish the build in one year. After building each slice, I emailed Jonathan the win/loss/draw counts, which he compared to the counts extracted from his data. For the first 6 months we were in perfect agreement, but after sending the counts for the slice with 2 black men, 3 black kings, 2 white men, and 3 white kings, I received this message from Jonathan.
"Bad news. We differ. You have one more win -- we have one more draw."
This was disturbing news, but as we worked to narrow down the location of the error, it was a relief to know that this was very likely an isolated error in a single position and had not propagated to any of the positions that played into it. If it had, then if the error was in my data it would have meant about 1 month of my work was worthless, but if the error was in Jonathan's data then he would have had to recompute half of his database and all the subsequent work he had done towards solving the 3-move openings. We narrowed down the discrepancy to a single leading rank subdivision of the slice. I recomputed this subdivision, a 3-day task, and this second time the counts agreed with Jonathan's. From my log files I was able to trace the error to a particular machine. The counts in the log had been correct after the build, but an error had occurred during the compression pass. This was my first experience of a random error during any checkers database building, but Jonathan was familiar with these, having experienced not only similar errors during compression, but also errors during file transfers on a LAN, and random bit errors in ram. Before this incident I was beginning to wonder if verification was really necessary, but now I have no doubts. As a test of the self-consistency method, I ran a self-consistency check on the data containing the error, and it correctly detected it.
Designing a program to build checkers databases for up to 8 pieces is a task that is readily managed on today's 32-bit PCs that allow a single program to use up to 2gb of ram (3gb in some versions of Windows). The basic scheme was described in a paper by the team that worked on the Chinook program, and involves subdividing the database into subsets based on piece counts and the ranks of the most advanced black and white checkers. With a maximum of 8 pieces these subsets can fit entirely in ram as they are built, allowing fast random access and quick build times. The 8-piece database takes about 2 weeks of computing using just one of my build machines. However things get more difficult when you try to go beyond 8 pieces. If the same subdivisions by leading rank are used, then the largest 9 and 10 piece subdivisions are much too large to fit in the available ram. Complex disk caching schemes must then be used, else accessing the data on disk becomes horribly slow. With these larger data sets, caching is less effective at all levels -- dram caching within the cpu, disk caching within the operating system, as well as the database caching in the build program.Another speed hit in going beyond 8 pieces is that the number of positions in these subdivisions is larger than can be represented in a 32-bit integer. Compilers will allow variables of 64-bit integers to be defined, but the basic arithmetic operations on these are not intrinsic to the computers and are implemented through calls to library functions. This makes the process of iterating through all the positions in a subdivision run more slowly. When I built the 9-piece database, I changed the definition of a subdivision to be based on piece counts and the square of the most advanced black and white checkers. This had the effect of making smaller subdivisions. The typical slice has at least one black man and one white man, and so instead of 49 combinations of black and white leading ranks (7 black rows x 7 white rows), there are nearly 784 combinations of leading squares (28 black leading squares x 28 white leading squares, less the combinations where they both land on the same square, and in cases where either color has more than one man then the 28 is reduced because the leading square cannot be the first square). With this scheme all subdivisions fit within 2gb ram buffers, and they all have fewer than 232 positions, allowing 32-bit integers to be used for the position indexing. On the downside there are a lot more subdivisions with this method, and many of them are trivially small. There are 71,168 subdivisions of 2 through 9 pieces using leading square indexing as compared to 5,364 subdivisions using leading rank indexing. Each subdivision is stored on disk as 4 files for the win/loss/draw data -- a data file and an index file for each color. The moves-to-conversion data is another 4 files per subdivision. When the 9-piece database was completed there were approximately half a million files in the set. The NTFS file system in Windows XP seems to cope with this number of files without any trouble, but Windows Explorer is a different story. Any attempt to browse the directory tree of this database results in Explorer going completely comatose for 3 to 4 minutes!
While subdivisions by leading square worked well to build the 9-piece database, its advantages are lost when trying to go to 10 pieces. Many of these 10-piece subdivisions require much larger buffers than 2gb and have more positions than fit in 32-bit integers. I needed to look for ways to subdivide the slices even further. I experimented with several ideas, but the 10-piece slices are so huge that no simple scheme could make them small enough. My solution was to layer the subdivisions in several tiers. The first tier subdivides by piece counts and leading ranks, the original scheme used by the Chinook team. The second tier further subdivides by the configuration of the checkers on the leading rank. As an example, if there are 2 black checkers in a slice, then there are 10 possible configurations of these checkers on the leading rank (4 configurations of 1 checker plus 6 configurations of both checkers). The combination of these two tiers divides things into many more pieces than the leading square scheme that I used to build the 9-piece database. A slice with 2 black men has 70 combinations (7 ranks times 10 configurations per rank) and if white also has 2 men then there are 4900 (70 x 70) subdivisions in the slice. This is an improvement over the 784 combinations with leading square indexing, but it is still not enough, many 10-piece subdivisions are still too big! I added a third tier which subdivides by the penultimate rank of checkers, i.e. the rank of the most advanced checker that is not on the leading rank. This divides most of the largest subdivisions into 36 smaller pieces.
With these 3 tiers just about everything can be built with 2gb of ram and 32-bit indices, but there are still a couple of problems. The biggest problem is that there are almost 3 million subdivisions, and approximately 20 million files in the completed database. That is just too many, the build program will run out of file handles and choke on all those files. The subdivisions for the smaller subsets with 8 or fewer pieces are much too small to be used efficiently. The solution for this is to use the second and third tiers selectively, only when needed to make the subdivisions smaller. I set a threshold of 900mb as the criteria to be used to decide if additional tiers are used. If a subdivision is smaller than 900mb after the first tier (division by piece counts and leading ranks), then no further subdivisions are used. If larger, then the second tier is added, and if still larger the last tier is added. This makes the number of subdivisions a manageable 64,369, which is slightly less than the number for 9 pieces using leading square subdivisions.
A remaining problem was the 10-piece all-kings slice. It requires a ram buffer of almost 4gb, and the number of positions is greater than 232 so it needs 64-bit integers. Since this slice has no men, it cannot be made smaller by additional subdivisions. My build program can build subdivisions that do not fit completely in ram, but at a much slower speed. In order to build this slice within the confines of a 2gb buffer, I used the symmetry along the double corner diagonal of the checkerboard to cut the storage requirements in half. This idea was suggested to me by Martin Fierz and it worked well. The idea is that for every position of all kings, there is a mirror position with the same win/loss/draw value, the mirror being obtained by flipping the kings to the opposite side of an imaginary diagonal line that runs between the two double corners. Since all the positions have the same values as their mirror images, only one of each mirror image pair needs to be stored. A similar symmetry exists about both sides of the single corner diagonal, and the storage requirements can be cut in half yet again if this is also used, but the programming to take advantage of this is more effort and I did not need this additional size reduction so I did not use it. Using this mirror symmetry, the 10-piece all-kings slice just fit in 2gb of memory, but the number of positions still requires 64-bit integers. I temporarily recompiled the build program to use 64-bit indices and built this slice. After it was built, I subdivided it into several pieces, each smaller than 232 positions, so that lookups on this slice during the rest of the 10-piece build could be done using 32-bit indices.
One of the nice properties of the second tier subdivisions (division by configuration of checkers on leading rank) is that all subdivisions with the same number of checkers on the leading rank and in the same first tier can be built in parallel. When this is combined with the other possibilities for parallel building of subdivisions that arise from the first tier properties (these are well described in papers written by the Chinook team), the opportunities for parallel builds are excellent. All of my build machines were occupied nearly 100% of the time. At any given time there were on average about 50 subdivisions that could be built (all of their dependent subdivisions had been completed), and during the middle months of the build period this number was at times above 100. At the very beginning and end of the 10-piece build, this number was below 20, but these periods were only for a few weeks at the beginning and then again in the last week. I believe that by using about 20 of the dual-core processors that have recently become available the 10-piece database could be built in 5 to 6 weeks by running 2 instances of the build program on each processor.
These 64,369 subdivisions were a necessary complexity to build the database, but there is no need to keep this complex format when the database is used with KingsRow. When I built the 9-piece database, after each slice was built I re-indexed it so that each slice is a single subdivision. This simplifies the function that converts a position to an integer index, and no large tables are needed to prevent impossible positions where both a black man and a white man are mapped into the same square (this is a difficulty with many other indexing schemes). While this format was ok, there was an aspect that I thought could be improved. The 9-piece subdivisions were very large files that required 64-bit indices because of the large number of positions. When the database is used for engine lookups all of the indexing data is read into ram buffers before any searches are performed. This indexing data for the 9-piece database consumed almost 300mb of ram, and for the 10-piece database would consume nearly 2gb. This is a huge overhead. On a 3gb machine only 1gb would be left for database cache buffers. If the indices could be made 32 bits then this overhead would be cut in half. As I was thinking about ways to do this, I received a suggestion from Ed Trice which was simple and effective. His idea is to subdivide each slice into fixed size subdivisions of size 232. To find which subdivision of a slice a position is in, the 64-bit slice index is computed for the position, and this number is divided by 232. The integer quotient identifies which subdivision contains the index, and the 32-bit remainder from the division becomes the index of the position within that subdivision. My only change to this was to use 231 as the size of the subdivisions so that I didn't have to test for overflows in some 'for' loops in the code. After each slice was built I re-indexed on the newer P4 that has Intel's hyperthreading feature. The re-indexing operation runs between 12 and 15 times faster than building, so this one machine was able to keep up with the flow of newly built slices from all 5 build machines. With both the build program and re-index programs running at the same time on the P4, each ran more slowly than it would without the load from the other program, but not at half speed, more like 3/4 speed. The hyperthreading was useful at these times.
I would like to acknowledge several people that helped me in this project. Jonathan Schaeffer verified my data as it was being built, which was a big help because it saved me a from lot of additional computing time. He also invented some of the key algorithms that are used to build checkers databases. His ideas to compress the data with runlength encoding and to remove capture and threatened capture positions to improve the compression are the main reasons that these databases can be accessed in realtime by a checkers engine during a search. Martin Fierz gave me a jump start in building endgame databases when he sent me the code he used to build his 8-piece database. He made a nice improvement to Schaeffer's original compression scheme by storing only 4 positions in a byte instead of 5 and using the extra codes for runlength tokens. He helped me work through some problems when I was designing a program to build the 9-piece database, and his idea to use mirror symmetry to cut the size of the 10-piece all-kings slice allowed me to build this in only 2 days instead of several weeks. Ed Trice gave me the idea for the scheme I used in the runtime searchable version of the database to save memory using 32-bit indices.