This is my old alt.sources posting from some years ago.  Never versions
of some of the code here can be found in ../spell  (In particular the
newer version of 'print' includes a table-driven fuzzy matching
procedure that is intended to become the basis of the next generation
of phonetic spelling correction; see also the ../text2speech directory)

Over the years several people have used this code in projects and
products; I believe it even ended up in "Mastercook" (and I'm still
waiting for my complimentary copy, guys! :-) )  Anyone else who
has used this who hasn't already told me, please let me know.

(Also I think the current release in ../spell slightly redefined
the file format, and I'd like people to move to it.  It's a small
tweak just to add a magic number and won't affect your code at all)

Graham Toal <gtoal@gtoal.com>

---

This is a suite of programs for working with words.  I am releasing
it to alt.sources primarily to invite people to test it on as many
platforms as possible (mail me back any bug reports please!) so that
the final code will be completely portable.

Unpack with unshar or sh < part1; sh < part2; sh < part3

Utilities for spelling checking and correction are supplied, but
the ultimate aim is to support all sorts of word-manipulation activities
such as a writers workbench and assorted word games (See PS).

This news posting is in three parts; part1: this file + headers;
part2: utility modules in .c files to be #include'd; part3: main
programs; just CC these and they should work -- no fancy makefiles
or link commands.

The code here doesn't have much of a user interface; I'm rather hoping
that people who pick it up will hook it in to their favourite operating
system as smoothly as they can.  Perhaps someone with the time to
spare (who am I kidding :-( ) might integrate it with ispell for instance.

I've experimented with various interfaces already; I can accept TeX
input and write output suitable for correcting text in either emacs
or uEmacs.  I'm not going to release these though, until I'm happy
with the internal code.  Incidentally, the code will be totally
public domain - meaning that I've no objection to its being used
in commercial projects.  HOWEVER I would appreciate if you didn't do so
until it is officially released (probably through comp.sources.misc)
because I would prefer to define portable file-formats and magic numbers
first.

OK, enough babble; here's how it all works:

1) acquire a list of words for yourself; I have one but at 690K it is
   too big to post. (actually even these sources come to around 120K
   so I hope they make it through OK.  I don't think I have a shar
   that splits, so I've divided into three shars myself.)
   The word list should be sorted by ascii order.

2) cc dawg.c
   I've deliberately made main programs into one source file, by
   including *.c for utility modules; I know this is bad style but
   it helps make compiling and porting easier (no worries about
   long external names for instance, or non-case-sensitive linkers)

3) dawg yourdict dict
   'yourdict' is your wordlist; but the second parameter should
   be 'dict' at least while testing.  This will generate dict.dwg
   which is a compacted and *FAST* data structure for word access.

   Building a dawg[See next section for meaning of name] from a word
   list is a real memory pig; so I've written code which will let you
   generate the dawg in chunks (traded off against a small loss of
   compression density).  (Interestingly enough it isn't a time-pig;
   using a hash-table for node merging gives almost linear time - thanks
   to Richard Hooker for that one).

      If you don't have enough memory (say 2Mb?) you'll get a run-time
   message suggesting that you recompile with cc -DSMALL_MEMORY dawg.c
   (OK, so one day I'll make it select this mode itself at run-time)

   IBM PC's and Mac's will get this mode by default. [See FOOTNOTE]

    If you want to check that the data structure generated is ok,
    3a) cc pdawg.c
    3b) pdawg dict.dwg
        This should decompile the data and print out the
        original word list

4) cc dwgcheck.c
   Compile a Mickey Mouse spelling checker

5) dwgcheck a few wurdz on the comand line ?
   ... and test it.

6) cc tell.c
   Now try the 'advanced' stuff! - soundex correction... (I hates it :-()

7) tell me sume wrongli speld werds
   and test it...

   If you're getting into this stuff, here's another incidental
   hack you might want to try...
    7a)  cc proot.c
    7b)  proot root
         print out all words of form 'root*'
         One day I'll hook this stuff into regexp; but not today :-)

----------

Enough examples for now.  If that all worked you are doing well!
If it didn't, and you have the time, please find out what went wrong;
If you can fix it, mail me the fix please, else mail me a log of
the symptoms, such as compiler error messages.

By the way, I know that the spelling correction uses a tacky algorithm;
don't worry about it -- I'm working on some red hot stuff which will
put soundex to shame (which isn't hard :) )  [Late note; it now works
(apparently well) but I need a phonetic dictionary before it is useful :-(
Anyone got a phonetic dictionary?  Alvey people out there?]

You might be interested in the data structures used; the main one is
a dawg - a 'Directed Acyclic Word Graph' -- it's essentially a
Directed Acyclic Graph implementing an FSM which describes the set of
all words in your lexicon.  The name DAWG comes from Appel's in his paper
on Scrabble (although none of the code or algorithms do).

The second most important data structure is superimposed on this
one; it is a packing algorithm due to Liang (of TeX hyphenation
fame) which allows you to convert an implicitly linked trie
into an indexed trie *without* taking any more space.  Liang is
a real smart cookie & it is a shame this technique of his is
not better known... (it should be up there among binary tree and
linked lists!)

OK, more examples:

8)  cc pack.c
    compile the packing code

9)  pack  dict.dwg dict.pck
    this takes rather a long time; you'll get messages saying '1000 nodes'
    '2000 nodes' etc roughly 1 every 20 seconds +/- 50%.  There shouldn't
    be more than a couple of screenfuls of these :-)  I'll speed it up one
    day.


    If you want to check that this data structure generated is ok too,
    9a) cc ppack.c
    9b) ppack dict.pck
        Just as you did for (3)

10) cc pckcheck.c
    Just like dwgcheck but using the other type of data.

11) pckcheck sume moar possibly incorrectly speld woards
    boring, huh?

12) Tha Tha Thats all folks!

---------------

Unless I've made some major cockup at the last minute, you should
actually have a reasonable chance of getting these working on your
machine, whatever it is. (Dec 20's and EBCDIC machines *possibly*
excepted - attempts welcome!)

As I said, please mail me your bugfixes, problems, suggestions etc.
By the way, the standard of coding isn't exceptionally high; this
implementation was always intended to be a prototype.  A rewrite
of dawg.c and pack.c is definitely high on the priority list...
The actual application programs aren't too bad, but the interfaces
to the various utility routines are liable to change on an hour-by-hour
basis! (I've been trying to make them more consistent -- you should have
seen the last lot ;-) ).

However, the algorithms are all complete now; anyone who has ever needed
to convert a tree to a dag might be interested in the linear-time solution
in dawg.c - most other solutions I've seen have been NlogN or N^2.

If you're interested, there's a file dictlib.h which is a proposed interface
for the next iteration; comments are invited.

Share & Enjoy,

Graham Toal <gtoal@uk.ac.ed>

<FOOTNOTE>
   I'm experimenting in this code with ways of finding out at compile
time various parameters needed for portability.  There's a file
grope.h which does this.  If things all work first time, great;
if not, you may have to change various #define's.  The best way
to do that is to add code to grope.h which identifies your system,
and only make changes of the form:

#ifdef SYS_MYSYSTEM
#undef something
#define something (a different value)
#endif /* my system */

There are instructions for customising in grope.h itself.  My overall
aim is to avoid special makefiles and special command line options.
(A bit ambitious I know, but nice idea if it works...)
(Steve Pemberton's config.c might be useful to you too if you are
hacking grope.h.  It was reposted on usenet recently.)

</FOOTNOTE>

<PS>

Future hacks:
   1) Anagrams
   2) Crosswords
   3) Scrabble
   4) Anything where a text key can be used to lookup another
      piece of text.  This is implemented with the 'dawg_locate_prefix'
      routine; it is effectively a cheap substitute for btrees etc.
      (and better than a hash table because you can *do* things with it!)

      4a) Reverse phone book      1234567=Fred Bloggs
      4b) house style enforcer    stupid=infelicitous
      4c) uk/us spelling advisor  sidewalk=pavement
      4d) shorthand/macro expander   f.y.i.=for your information
      etc etc... (Most of wwb in fact)

   5) Anything you can suggest! (or *implement* :-) )

</PS>

<MEMO>

Archive-name: dawgutils/part[1-3].shar
Reply-to: Graham Toal <gtoal%uk.ac.ed@nsfnet-relay.ac.uk>

1) Upload fixed version of dyntrie.c - remember bug with signed chars
   being used as an index [comment rule in dawg.c?]

2) known design flaw: can't handle strings with 0 in them

3) known bug: dyntrie.c has problems if you enter a string
   which *starts* with 255.  This is due to sending 'end_of_node'
   bit rather hackily.  Can be fixed.

4) Test version to be posted to alt.sources by running on machine
   with signed chars (eg MSDOS)

5) Remove hacky Malloc debugging macros in utils.c -- there might be
   a problem if compiled on MSDOS but not under MICROSOFT C

6) Check that LIB_STRINGS is *not* defined when UX63 is set.

7) tweak the constants in pack.c to speed it up a lot

8) hack the pack.c heuristics into dyntrie.c to speed it up too.

</MEMO>