Scrabble® [1]Back to Contents Prev: [2]Infrastructure Architecture
   Next: [3]Move Generation Algorithm
     _________________________________________________________________
   
                            Lexicon Representation
                                       
   The problem of representing a lexicon of words crops up in numerous
   applications; two major examples are spelling checkers and compiler
   lexical analyzers. The lexicon in a Scrabble game serves two purposes:
   
    1. Validation of words played by human players.
    2. A source for words playable by computer players.
       
A Naive Approach

   Assuming a text file listing the words in the lexicon, perhaps the
   most straightforward approach is simply an alphabetically sorted array
   of strings. While this permits rapid validation of words (via a binary
   search for example), it does not particularly facilitate the finding
   of playable words given a game situation. Furthermore, the memory
   footprint of such a representation is significant (a 150000 word
   lexicon of 2-15 letter words occupies 1.6 MB), while not holding the
   entire lexicon in memory would result in constant I/O overhead.
   
Finite Automata Approaches

   The use of finite automata to represent word lexicons is a
   well-established technique. Conceptually, the simplest automata for
   this purpose takes the form of a trie.
   
  Tries
  
   A trie is a tree-based data structure that can be used to store data
   items composed from a finite alphabet, such as ASCII strings
   ([4]Fredkin discusses tries in detail). Edges in the tree represent
   symbols from the alphabet; data items that begin with the same
   sequence of symbols share the same path in the tree. Nodes at end of a
   data item's path are marked as terminal. An example will help
   illustrate; consider a trie representing the words BEE, BEER, BET,
   PEE, PEER and PET:
   
   Example of a trie
   
   Word validation using a trie is straightforward; simply start at the
   root and attempt to find a path spelling out the word that ends in a
   terminal node. If such a path exists, the word is in the lexicon, else
   it is not. But how does a trie help in finding playable words?
   
   Finding a playable word in Scrabble essentially involves searching the
   lexicon, subject to constraints imposed by available letters on the
   player's rack and existing letters on the board. A trie is suited to
   this kind of search, because at each node (representing a point
   "within" a word), the available edges represent the letters that can
   possibly lead to a valid word. The constraints of available rack
   letters and existing board letters are used to prune the search of the
   tree as each letter is followed. For example, consider the problem of
   building words that have the prefix AD (which is already on on the
   board). Starting at the root the trie, we follow the path A,D. Now, we
   consider the letters on our rack, and use them to determine which
   edges to continue taking. If we reach a terminal node at any point,
   then we have found a playable word.
   
   The most glaring problem with a trie representation is that it has a
   very large memory footprint, since nodes exist on a per letter basis!
   The only redundancy that is exploited is redundancy in word prefixes.
   Automata theory suggests that the size of the data structure can be
   reduced by converting it to a graph and eliminating equivalent states.
   This process as applied to a word lexicon trie produces what [5]Appel
   and Jacobson refer to as a directed acyclic word graph (DAWG).
   
  Directed Acyclic Word Graphs (DAWGs)
  
   If we consider the example trie given above and merge equivalent
   nodes, we obtain a DAWG that looks like this:
   
   Example of a DAWG
   
   A DAWG provides the same benefits as a trie, but at a dramatically
   lower space cost when applied to a lexicon with a high degree of
   redundancy (such as an English lexicon suitable for Scrabble play),
   since all such redundancy is eliminated. We will see that if the data
   types used to implement the DAWG are chosen carefully, the size of a
   DAWG can be less than 1/3rd the size of its lexicon represented as an
   ASCII file!
   
DAWG Implementation

   One can probably already imagine that generating a DAWG for a lexicon
   of, say, 100000 words, can be a fairly time-consuming process,
   especially in a language with such poor run-time performance as Java.
   There is no particularly compelling reason to dynamically generate the
   DAWG for each game, or indeed for each execution of the server. It was
   decided that the DAWG should be generated once, saved to a file and
   that file loaded by the server on startup. Not only does this minimize
   server startup time, but also permits the DAWG creation code to be
   written in a higher-performance language such as C++.
   
   The DAWG creation code performs the following steps:
   
    1. Reads ASCII files containing the words in the lexicon.
    2. Creates a trie containing all those words.
    3. Applies a graph minimization algorithm to the trie to obtain a
       minimal DAWG.
    4. Persists the minimal DAWG to a binary file in a format that makes
       it easy to recreate in memory on the server
       
   The trie creation process is straightforward; as each new word is
   read, existing edges are followed in the trie and new edges are
   created as necessary. The minimization process is more complicated;
   however, the problem is well-understood, and a number of finite
   automata minimization algorithms are published in the literature. I
   chose to implement one due [6]Aho, Sethi and Ullman, as described in
   section 3.9 of their book Compilers: Principles, Techniques, and
   Tools. Very briefly, the algorithm works by successively partitioning
   states into groups, such that at each iteration, states within the
   same group have not been found distinguishable from each other by some
   input (and are hence still considered equivalent).
   
   The persistence format for the DAWG is as an array of states, and each
   state is represented as an array of edges (hence the DAWG can be
   considered as just a large array of edges). Edges have a letter label,
   a destination index (which indexes to the first edge of their
   destination state), a flag indicating whether the state is terminal,
   and a flag indicating whether this edge is the last edge of the state.
   For a graph of approximately 80000 words, each edge can be encoded as
   a 32-bit word, leading to an almost 3:1 space savings over the lexicon
   in ASCII form! The Java server application simply reads the binary
   file and recreates the array in memory.
   
   The [7]C++ source code for the DAWG creation program is available.
   Please note this is very in-efficient code; the priority was to
   understand the algorithm and to produce a working implementation as
   quickly as possible.
   
   The [8]word lists used as the lexicon source are available, and were
   derived in large part from the [9]Official Scrabble Player's
   Dictionary.
     _________________________________________________________________
   
   [10]Back to Contents Prev: [11]Infrastructure Architecture Next:
   [12]Move Generation Algorithm

References

   1. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/contents.html
   2. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/arch.html
   3. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/movegen.html
   4. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/refs.html#Fredkin
   5. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/refs.html#Appel
   6. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/refs.html#Aho
   7. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/gen.cpp
   8. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/words.zip
   9. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/refs.html#OSPD
  10. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/contents.html
  11. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/arch.html
  12. http://www.ece.uwaterloo.ca/~k2tam/Scrabble/movegen.html