From gtoal@admin.vt.com Fri Jun 4 19:41:00 1999
Received: from admin.vt.com (root@admin.vt.com [204.117.188.10])
by gtoal.com (8.8.5/8.8.5) with ESMTP id TAA05938
for ; Fri, 4 Jun 1999 19:40:56 -0500 (CDT)
Received: (from root@localhost) by admin.vt.com (8.8.5/8.7.3) id RAA29502; Fri, 4 Jun 1999 17:17:06 -0500 (CDT)
Date: Fri, 4 Jun 1999 17:17:06 -0500 (CDT)
From: Graham Toal
Message-Id: <199906042217.RAA29502@admin.vt.com>
To: word-game-programmers@mit.edu
Subject: How to find a word in a dawg from a numeric word index efficiently
Status: R
Dear Word Gamers and Dawg Hackers,
Another BFO (Blinding Flash of the Obvious) struck me today...
Eric House wrote a program recently which allows you to scroll
forwards and backwards through a dawg (a vocabulary browser for
the palm pilot - shares the same dawg as his scrabble) and he
was mentioning the difficulty that is caused by not being able
to look up a word in a dawg by numeric index.
I'll switch applications just to make it easier to define the
problem and explain a possible solution:
Let's say you want to send an arbitrary document to someone, and
they have a copy of the same all-encompassing word list that you
have, then you can write a code such that a=1, aa=2, aah=3, aahli=4
etc for all the words in your word list, then just send the code
(ie the ordinal index of the word in that list). However with a
dawg, there is no easy index and to expand those codes back to a list
of words, you'd have to walk the dawg again for every word.
Well, I realised that this is very similar to the problem of
transmitting a code set in a regular compression program, such
as lzw, and the solutions there apply here: you encode each
word as a variable-length bitstream which denotes the choices
you took going down the trie to find the end of the word.
You can either do this in fixed bit fields (eg 5 bits, 4 bits,
3 bits, 3 bits, 1 bit in the example below) *or* ... (today's Blinding
Flash) encode it arithmetically!
For those of you not already familiar with Arithmetic Encoding,
I'll work a full example here on a small data set that I can do by hand:
(The only prior knowlege I assume here is a working knowlege of the DAWG
data structure.)
Let's take an arbitrary word 'super', which in some word dawg is
stored as follows:
The first letter s is branch #5 of 26 choices from the root node
The next letter u is branch #11 of 14 choices from the 's' node
The next letter p is branch #2 of 7 choices from the 'u' node
The next letter e is branch #3 of 5 choices from the 'p' node
The last letter r is branch #0 of 2 choices from the 'e' node
(aside: branch numbers start at 0, so branch #0 would be the first choice
of the two above)
For reasons that will be obvious when we decode this, we build the
value in reverse starting with 'r' and working back over 'repus'.
This is trivially implemented by returning from the recursion of
checking the word in the dawg, from the end node back. Check the
word on the way down, do the arithmetic coding as you return back
up.
So the word /super/ (aka \repus\) would be arithmetically encoded as follows:
Initial Total: 0
first data item is 'r': link #0 in a set of 2
multiply by 2 (or don't, the initial one is a no-op)
Now, add 0 Total: 0
Next, data for 'e' is link #3 in a set of 5
multiply by 5 Total: 0
add 3 Total: 3
Next, data for 'p' is link #2 in a set of 7
multiply by 7 Total: 21
add 2 Total: 23
Next, data for 'u' is link #11 in a set of 14
multiply by 14 Total: 322
add 11 Total: 333
Next, data for 's' is link #5 in a set of 26
multiply by 26 Total: 8658
add 5 Total: 8663
(If anyone new to this coding thinks that this will rapidly generate
numbers outside the range of a 32 bit integer, you're right - that's
why there are so many 'bignum' and arbitrary precision bitfield
packages out there. They're useful here as well as in crypto; though for
many (most?) wordlists it will actually be possible to code every word in
a 32 bit integer or less.)
So, we now have a word encoded as '8663' (which I'm guesstimating
would need about 15 bits of storage)
To decode this word, we do the same in reverse, using the same dawg
to guide us.
We know the first branch has exactly 26 possibilities, so we divide
8663 by 26 and take the remainder - which is 333, remainder 5
Total: 333
We go down branch number 5 (which is labelled as 's'), and see that at
this point in the trie we now have 14 branches.
divide by 14, giving 23 remainder 11
So this is the 11th letter in this node which is 'u'
Total: 23
We now traverse branch number 11 and find a node with 7 possible links
divide 23 by 7 giving 3 remainder 2
This letter was therefore choice #2 of 7, ie 'p'
Total: 3
This node has 5 links, so we divide by 5 giving 0 remainder 3
So this letter is choice #3 of 5 ('e') and we go to the
next node, which has 2 choices.
Total: 0
0 divided by 2 is 0 remainder 0, so our last letter is choice #0, aka 'r'.
Consequently we have extracted the word 'super' from index 8663.
(I leave it as an exercise to the reader to correct the above
algorithm so that words can be extracted which are also prefixes
of other words, such as a word list which contains both 'super'
and 'superman' Clue: there's an important "+1" needed somewhere)
I throw this out in case it is of any benefit to our readers. Clearly
you would not use this coding for such a trivial use as sending a text
document (there are easier ways!) but you may well find it of use in
some future word game yet to be written. I can see it being very useful
for storing in ram a long list of possibilities generated by some
traversal of a dawg. Let me know if you code this up and/or put it
to any real use. I'll archive a copy of this article on my web site
in case it's ever needed.
regards
Graham