This letter was written to the BSD project when this code was handed over with the intention of replacing the AT&T 'spell' program. Some of the stuff here is a bit dated, but I'll leave the file here anyway in case the explanations help anyone. As you probably know, BSD adopted 'ispell' as the spelling checker of choice and no-one seemed to miss the old style interface, so the code below was never put into production. Graham Toal ----- Spell utilities release 0.1 Dear Keith, here's a first iteration of my spell code; your chap will probably want to make it more unix-like -- especially in the parameter handling. You could add getopt or parseargs handling if you want. I've structured the source as a directory containing the program sources, with a subdirectory SpellLib containing all the files which are built into the library object, 'splib'. I'll probably add a 'tests' directory later. There's no unix makefile yet; I'll include the Risc Os one as a guide for now. The library modules all have a .h file of the same name which is generated automatically by 'mkptypes'. The library header file itself ("splib.h") pulls these files in. I hope include-file semantics on bsd are the same as my system -- if you include file "/somewhere/include/fred.h", and fred.h includes "jim.h", you'll pull in "/somewhere/include/jim.h". If not, we'll have to rethink include files a little. I'm also assuming your compilers support prototypes nowadays. (See comments later in this file if not) I've changed the dawg file format only by 8 bytes -- I now store a magic number at the top, and a 0L in the first word; previously I stored the number of edges in the first word and overwrote it with a 0L once I read the data in. Doing it this way means the file can be connected in read-only memory with mmap, and shared among multiple users. (Oh - note that the address of the dawg to be filled in by dawg_init is 8 bytes into the file, because of the magic number and the edge count) (before: [edges, poked to 0] [data] ^ dawg start now: [magic] [edges] [0] [data]) ^ dawg start At the moment, I'm just loading the dawg in with fread; if you want to use mmap, remember to check for byte sex first; if the data isn't already in the appropriate sex, re-connect it in copy-on-write mode and flip the order in situ. I strongly recommend you allow the dawg files to be of either sex so that they can be as efficient as possible for any arbitrary machine. I've included a simple sex-change program which I used while testing to make sure my sex-change code worked. You might want to expand that into a proper utility. I've made the hash-table dynamically sized, using a reasonable heuristic *for simple dictionaries*. If this code is ever applied to non-dictionaries (eg as a database lookup method with entries of the form "keyword=data value"), where there is less sharing of tails etc, you'll have to add an extra run-time parameter to set the hash-table size explicitly. At the moment, I've added a "-l" parameter which simply bumps up the calculated size by a constant fudge-factor. A "-h " option as well might be useful. The program takes an infile and outfile parameter, which have to include the appropriate file extensions if those are being used. (I recommend .dwg but you don't have to stick with that) I did this because I run on a non-unix platform, and it was tidier than lots of ifdef's for different systems' extensions. However, since your version of the code is for 4bsd only, feel free to remove as much of the remaining portability code as you feel like. The only thing I'd like you to keep is the dawg file format, so that programs on other systems can read them. (Even though I'd prefer people to pass dictionaries around as text, I know it won't happen.) I've decided against storing lots of info in the dawg file (like character set used, name of dictionary supplier etc.) -- this can go either in the text file or in an ancilliary file with a different extension. (.dtx ?)) If you allow this extra info to go in the text dictionary file, you will of course have to alter dawg.c so that it ignores it. The default dictionary has to be set to something sensible; probably "/usr/dict/words.dwg" or whatever you use nowadays. At the moment it is a local "dict.dwg". If you want to turn the code into a filter, you'll definitely have to add a "-n " parameter, since you won't know how big the file will be in order to estimate it. I've included a trivial checking program (texcheck); this is *only* there as a confidence test for porting. It was my first hack using flex. DO NOT use it in any other way. It's crap! :-) [It checks tex source files, as you may have guessed] I've completely (I hope) removed all the Packed-trie versions. The benefits of the faster format were outweighed by the complexity of having two sources for everything. Feel free (== I insist!) to change the names of the library procedures. They weren't chosen with any thought. Note I've done away with the rather tacky included C files; however the library structuring leaves a little to be desired. I'll work on it. Or you can if you prefer. (For example: there's a prime finder in the library; I think it should be in dawg.c) I've formatted it with the default parameters of GNU Indent, and removed all the ANSI prototypes/K&R headers from all procedures. If you can't assume ANSI prototypes, I'd prefer if you distributed the source in ANSI style with an ansi->pcc converter, rather than reverse engineering it back to K&R. However the choice is yours; I don't insist. If you want to take over the code completely now, that's OK by me. Or equally I'll hack to order if you prefer. If you do a lot of work on it, please send me back copies so I can try to keep in synch. I'm assuming it's your project since you are closer to the system than I am, so as of this release you have the master source file. Please be explicit anytime you hand over the lock to me. (as in 'we have not changed it since your release 1.2', or 'take these source files and add feature '...', then hand them back to us as release 1.3') My system of numbering is First digit: major release (eg new data structures, incompatible changes) Second digit: minor release (internal changes, small added features) Third digit: local edit number (you won't see those) I use RCS here, but the version numbers I am talking about refer to whole systems; the RCS numbers for individual files may be different. This shar file is release 0.1 of spell. What I *haven't* included are the many versions of checking procedures -- with/ without case sensitivity etc., fuzzy matching, correction et al. I know unix 'spell' doesn't use these. Let me know what the replacement needs and I'll incorporate them properly. A sensible extension to spell would be to allow checking in multiple dicts, and suppression of certain words from dicts. This can be done with the parts you have. However, for an interactive checker, you want to remember new words as they are found. I haven't included that code (for dynamic additions to dawgs) yet as it is far from efficient at the moment. Again, if you're not doing such a checker, let me know & I won't spend as much time on improving that code as I would have. I have a pretty good idea of what is needed as I've already hacked a copy of ispell to use my stuff. I believe the code is fairly solid. I've made minimal changes for this release. best regards graham