Scrabble® [1]Back to Contents Prev: [2]Infrastructure Architecture Next: [3]Move Generation Algorithm _________________________________________________________________ Lexicon Representation The problem of representing a lexicon of words crops up in numerous applications; two major examples are spelling checkers and compiler lexical analyzers. The lexicon in a Scrabble game serves two purposes: 1. Validation of words played by human players. 2. A source for words playable by computer players. A Naive Approach Assuming a text file listing the words in the lexicon, perhaps the most straightforward approach is simply an alphabetically sorted array of strings. While this permits rapid validation of words (via a binary search for example), it does not particularly facilitate the finding of playable words given a game situation. Furthermore, the memory footprint of such a representation is significant (a 150000 word lexicon of 2-15 letter words occupies 1.6 MB), while not holding the entire lexicon in memory would result in constant I/O overhead. Finite Automata Approaches The use of finite automata to represent word lexicons is a well-established technique. Conceptually, the simplest automata for this purpose takes the form of a trie. Tries A trie is a tree-based data structure that can be used to store data items composed from a finite alphabet, such as ASCII strings ([4]Fredkin discusses tries in detail). Edges in the tree represent symbols from the alphabet; data items that begin with the same sequence of symbols share the same path in the tree. Nodes at end of a data item's path are marked as terminal. An example will help illustrate; consider a trie representing the words BEE, BEER, BET, PEE, PEER and PET: Example of a trie Word validation using a trie is straightforward; simply start at the root and attempt to find a path spelling out the word that ends in a terminal node. If such a path exists, the word is in the lexicon, else it is not. But how does a trie help in finding playable words? Finding a playable word in Scrabble essentially involves searching the lexicon, subject to constraints imposed by available letters on the player's rack and existing letters on the board. A trie is suited to this kind of search, because at each node (representing a point "within" a word), the available edges represent the letters that can possibly lead to a valid word. The constraints of available rack letters and existing board letters are used to prune the search of the tree as each letter is followed. For example, consider the problem of building words that have the prefix AD (which is already on on the board). Starting at the root the trie, we follow the path A,D. Now, we consider the letters on our rack, and use them to determine which edges to continue taking. If we reach a terminal node at any point, then we have found a playable word. The most glaring problem with a trie representation is that it has a very large memory footprint, since nodes exist on a per letter basis! The only redundancy that is exploited is redundancy in word prefixes. Automata theory suggests that the size of the data structure can be reduced by converting it to a graph and eliminating equivalent states. This process as applied to a word lexicon trie produces what [5]Appel and Jacobson refer to as a directed acyclic word graph (DAWG). Directed Acyclic Word Graphs (DAWGs) If we consider the example trie given above and merge equivalent nodes, we obtain a DAWG that looks like this: Example of a DAWG A DAWG provides the same benefits as a trie, but at a dramatically lower space cost when applied to a lexicon with a high degree of redundancy (such as an English lexicon suitable for Scrabble play), since all such redundancy is eliminated. We will see that if the data types used to implement the DAWG are chosen carefully, the size of a DAWG can be less than 1/3rd the size of its lexicon represented as an ASCII file! DAWG Implementation One can probably already imagine that generating a DAWG for a lexicon of, say, 100000 words, can be a fairly time-consuming process, especially in a language with such poor run-time performance as Java. There is no particularly compelling reason to dynamically generate the DAWG for each game, or indeed for each execution of the server. It was decided that the DAWG should be generated once, saved to a file and that file loaded by the server on startup. Not only does this minimize server startup time, but also permits the DAWG creation code to be written in a higher-performance language such as C++. The DAWG creation code performs the following steps: 1. Reads ASCII files containing the words in the lexicon. 2. Creates a trie containing all those words. 3. Applies a graph minimization algorithm to the trie to obtain a minimal DAWG. 4. Persists the minimal DAWG to a binary file in a format that makes it easy to recreate in memory on the server The trie creation process is straightforward; as each new word is read, existing edges are followed in the trie and new edges are created as necessary. The minimization process is more complicated; however, the problem is well-understood, and a number of finite automata minimization algorithms are published in the literature. I chose to implement one due [6]Aho, Sethi and Ullman, as described in section 3.9 of their book Compilers: Principles, Techniques, and Tools. Very briefly, the algorithm works by successively partitioning states into groups, such that at each iteration, states within the same group have not been found distinguishable from each other by some input (and are hence still considered equivalent). The persistence format for the DAWG is as an array of states, and each state is represented as an array of edges (hence the DAWG can be considered as just a large array of edges). Edges have a letter label, a destination index (which indexes to the first edge of their destination state), a flag indicating whether the state is terminal, and a flag indicating whether this edge is the last edge of the state. For a graph of approximately 80000 words, each edge can be encoded as a 32-bit word, leading to an almost 3:1 space savings over the lexicon in ASCII form! The Java server application simply reads the binary file and recreates the array in memory. The [7]C++ source code for the DAWG creation program is available. Please note this is very in-efficient code; the priority was to understand the algorithm and to produce a working implementation as quickly as possible. The [8]word lists used as the lexicon source are available, and were derived in large part from the [9]Official Scrabble Player's Dictionary. _________________________________________________________________ [10]Back to Contents Prev: [11]Infrastructure Architecture Next: [12]Move Generation Algorithm References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.