This is my old alt.sources posting from some years ago. Never versions of some of the code here can be found in ../spell (In particular the newer version of 'print' includes a table-driven fuzzy matching procedure that is intended to become the basis of the next generation of phonetic spelling correction; see also the ../text2speech directory) Over the years several people have used this code in projects and products; I believe it even ended up in "Mastercook" (and I'm still waiting for my complimentary copy, guys! :-) ) Anyone else who has used this who hasn't already told me, please let me know. (Also I think the current release in ../spell slightly redefined the file format, and I'd like people to move to it. It's a small tweak just to add a magic number and won't affect your code at all) Graham Toal --- This is a suite of programs for working with words. I am releasing it to alt.sources primarily to invite people to test it on as many platforms as possible (mail me back any bug reports please!) so that the final code will be completely portable. Unpack with unshar or sh < part1; sh < part2; sh < part3 Utilities for spelling checking and correction are supplied, but the ultimate aim is to support all sorts of word-manipulation activities such as a writers workbench and assorted word games (See PS). This news posting is in three parts; part1: this file + headers; part2: utility modules in .c files to be #include'd; part3: main programs; just CC these and they should work -- no fancy makefiles or link commands. The code here doesn't have much of a user interface; I'm rather hoping that people who pick it up will hook it in to their favourite operating system as smoothly as they can. Perhaps someone with the time to spare (who am I kidding :-( ) might integrate it with ispell for instance. I've experimented with various interfaces already; I can accept TeX input and write output suitable for correcting text in either emacs or uEmacs. I'm not going to release these though, until I'm happy with the internal code. Incidentally, the code will be totally public domain - meaning that I've no objection to its being used in commercial projects. HOWEVER I would appreciate if you didn't do so until it is officially released (probably through comp.sources.misc) because I would prefer to define portable file-formats and magic numbers first. OK, enough babble; here's how it all works: 1) acquire a list of words for yourself; I have one but at 690K it is too big to post. (actually even these sources come to around 120K so I hope they make it through OK. I don't think I have a shar that splits, so I've divided into three shars myself.) The word list should be sorted by ascii order. 2) cc dawg.c I've deliberately made main programs into one source file, by including *.c for utility modules; I know this is bad style but it helps make compiling and porting easier (no worries about long external names for instance, or non-case-sensitive linkers) 3) dawg yourdict dict 'yourdict' is your wordlist; but the second parameter should be 'dict' at least while testing. This will generate dict.dwg which is a compacted and *FAST* data structure for word access. Building a dawg[See next section for meaning of name] from a word list is a real memory pig; so I've written code which will let you generate the dawg in chunks (traded off against a small loss of compression density). (Interestingly enough it isn't a time-pig; using a hash-table for node merging gives almost linear time - thanks to Richard Hooker for that one). If you don't have enough memory (say 2Mb?) you'll get a run-time message suggesting that you recompile with cc -DSMALL_MEMORY dawg.c (OK, so one day I'll make it select this mode itself at run-time) IBM PC's and Mac's will get this mode by default. [See FOOTNOTE] If you want to check that the data structure generated is ok, 3a) cc pdawg.c 3b) pdawg dict.dwg This should decompile the data and print out the original word list 4) cc dwgcheck.c Compile a Mickey Mouse spelling checker 5) dwgcheck a few wurdz on the comand line ? ... and test it. 6) cc tell.c Now try the 'advanced' stuff! - soundex correction... (I hates it :-() 7) tell me sume wrongli speld werds and test it... If you're getting into this stuff, here's another incidental hack you might want to try... 7a) cc proot.c 7b) proot root print out all words of form 'root*' One day I'll hook this stuff into regexp; but not today :-) ---------- Enough examples for now. If that all worked you are doing well! If it didn't, and you have the time, please find out what went wrong; If you can fix it, mail me the fix please, else mail me a log of the symptoms, such as compiler error messages. By the way, I know that the spelling correction uses a tacky algorithm; don't worry about it -- I'm working on some red hot stuff which will put soundex to shame (which isn't hard :) ) [Late note; it now works (apparently well) but I need a phonetic dictionary before it is useful :-( Anyone got a phonetic dictionary? Alvey people out there?] You might be interested in the data structures used; the main one is a dawg - a 'Directed Acyclic Word Graph' -- it's essentially a Directed Acyclic Graph implementing an FSM which describes the set of all words in your lexicon. The name DAWG comes from Appel's in his paper on Scrabble (although none of the code or algorithms do). The second most important data structure is superimposed on this one; it is a packing algorithm due to Liang (of TeX hyphenation fame) which allows you to convert an implicitly linked trie into an indexed trie *without* taking any more space. Liang is a real smart cookie & it is a shame this technique of his is not better known... (it should be up there among binary tree and linked lists!) OK, more examples: 8) cc pack.c compile the packing code 9) pack dict.dwg dict.pck this takes rather a long time; you'll get messages saying '1000 nodes' '2000 nodes' etc roughly 1 every 20 seconds +/- 50%. There shouldn't be more than a couple of screenfuls of these :-) I'll speed it up one day. If you want to check that this data structure generated is ok too, 9a) cc ppack.c 9b) ppack dict.pck Just as you did for (3) 10) cc pckcheck.c Just like dwgcheck but using the other type of data. 11) pckcheck sume moar possibly incorrectly speld woards boring, huh? 12) Tha Tha Thats all folks! --------------- Unless I've made some major cockup at the last minute, you should actually have a reasonable chance of getting these working on your machine, whatever it is. (Dec 20's and EBCDIC machines *possibly* excepted - attempts welcome!) As I said, please mail me your bugfixes, problems, suggestions etc. By the way, the standard of coding isn't exceptionally high; this implementation was always intended to be a prototype. A rewrite of dawg.c and pack.c is definitely high on the priority list... The actual application programs aren't too bad, but the interfaces to the various utility routines are liable to change on an hour-by-hour basis! (I've been trying to make them more consistent -- you should have seen the last lot ;-) ). However, the algorithms are all complete now; anyone who has ever needed to convert a tree to a dag might be interested in the linear-time solution in dawg.c - most other solutions I've seen have been NlogN or N^2. If you're interested, there's a file dictlib.h which is a proposed interface for the next iteration; comments are invited. Share & Enjoy, Graham Toal I'm experimenting in this code with ways of finding out at compile time various parameters needed for portability. There's a file grope.h which does this. If things all work first time, great; if not, you may have to change various #define's. The best way to do that is to add code to grope.h which identifies your system, and only make changes of the form: #ifdef SYS_MYSYSTEM #undef something #define something (a different value) #endif /* my system */ There are instructions for customising in grope.h itself. My overall aim is to avoid special makefiles and special command line options. (A bit ambitious I know, but nice idea if it works...) (Steve Pemberton's config.c might be useful to you too if you are hacking grope.h. It was reposted on usenet recently.) Future hacks: 1) Anagrams 2) Crosswords 3) Scrabble 4) Anything where a text key can be used to lookup another piece of text. This is implemented with the 'dawg_locate_prefix' routine; it is effectively a cheap substitute for btrees etc. (and better than a hash table because you can *do* things with it!) 4a) Reverse phone book 1234567=Fred Bloggs 4b) house style enforcer stupid=infelicitous 4c) uk/us spelling advisor sidewalk=pavement 4d) shorthand/macro expander f.y.i.=for your information etc etc... (Most of wwb in fact) 5) Anything you can suggest! (or *implement* :-) ) Archive-name: dawgutils/part[1-3].shar Reply-to: Graham Toal 1) Upload fixed version of dyntrie.c - remember bug with signed chars being used as an index [comment rule in dawg.c?] 2) known design flaw: can't handle strings with 0 in them 3) known bug: dyntrie.c has problems if you enter a string which *starts* with 255. This is due to sending 'end_of_node' bit rather hackily. Can be fixed. 4) Test version to be posted to alt.sources by running on machine with signed chars (eg MSDOS) 5) Remove hacky Malloc debugging macros in utils.c -- there might be a problem if compiled on MSDOS but not under MICROSOFT C 6) Check that LIB_STRINGS is *not* defined when UX63 is set. 7) tweak the constants in pack.c to speed it up a lot 8) hack the pack.c heuristics into dyntrie.c to speed it up too.