This letter was written to the BSD project when this code was handed
over with the intention of replacing the AT&T 'spell' program.  Some
of the stuff here is a bit dated, but I'll leave the file here anyway
in case the explanations help anyone.  As you probably know, BSD
adopted 'ispell' as the spelling checker of choice and no-one seemed
to miss the old style interface, so the code below was never put into
production.

Graham Toal <gtoaL@gtoal.com>


-----
Spell utilities release 0.1


Dear Keith,

   here's a first iteration of my spell code; your chap will probably want
to make it more unix-like -- especially in the parameter handling.  You
could add getopt or parseargs handling if you want.

I've structured the source as a directory containing the program sources,
with a subdirectory SpellLib containing all the files which are built into
the library object, 'splib'.  I'll probably add a 'tests' directory later.
There's no unix makefile yet; I'll include the Risc Os one as a guide for
now.

The library modules all have a .h file of the same name which is generated
automatically by 'mkptypes'.  The library header file itself ("splib.h")
pulls these files in.  I hope include-file semantics on bsd are the same as
my system -- if you include file "/somewhere/include/fred.h", and fred.h
includes "jim.h", you'll pull in "/somewhere/include/jim.h". If not, we'll
have to rethink include files a little.

I'm also assuming your compilers support prototypes nowadays.  (See comments
later in this file if not)

I've changed the dawg file format only by 8 bytes -- I now store a magic
number at the top, and a 0L in the first word; previously I stored the
number of edges in the first word and overwrote it with a 0L once I read the
data in.  Doing it this way means the file can be connected in read-only
memory with mmap, and shared among multiple users.  (Oh - note that the
address of the dawg to be filled in by dawg_init is 8 bytes into the file,
because of the magic number and the edge count)

(before:   [edges, poked to 0] [data]
           ^ dawg start
 now:      [magic] [edges] [0] [data])
                           ^ dawg start

At the moment, I'm just loading the dawg in with fread; if you want to use
mmap, remember to check for byte sex first; if the data isn't already in the
appropriate sex, re-connect it in copy-on-write mode and flip the order in
situ.

I strongly recommend you allow the dawg files to be of either sex so that
they can be as efficient as possible for any arbitrary machine.  I've
included a simple sex-change program which I used while testing to make sure
my sex-change code worked.  You might want to expand that into a proper
utility.

I've made the hash-table dynamically sized, using a reasonable heuristic
*for simple dictionaries*.  If this code is ever applied to non-dictionaries
(eg as a database lookup method with entries of the form "keyword=data
value"), where there is less sharing of tails etc, you'll have to add an
extra run-time parameter to set the hash-table size explicitly.

At the moment, I've added a "-l" parameter which simply bumps up the
calculated size by a constant fudge-factor.  A "-h <N>" option as well might
be useful.

The program takes an infile and outfile parameter, which have  to include
the appropriate file extensions if those are being used. (I recommend .dwg
but you don't have to stick with that) I did this because I run on a
non-unix platform, and it was tidier than lots of ifdef's for different
systems' extensions. However, since your version of the code is for 4bsd
only, feel free to remove as much of the remaining portability code as you
feel like.  The only thing I'd like you to keep is the dawg file format, so
that programs on other systems can read them. (Even though I'd prefer people
to pass dictionaries around as text, I know it won't happen.)

I've decided against storing lots of info in the dawg file (like character
set used, name of dictionary supplier etc.) -- this can go either in the
text file or in an ancilliary file with a different extension. (.dtx ?)) If
you allow this extra info to go in the text dictionary file, you will of
course have to alter dawg.c so that it ignores it.

The default dictionary has to be set to something sensible; probably
"/usr/dict/words.dwg" or whatever you use nowadays. At the moment it is a
local "dict.dwg".

If you want to turn the code into a filter, you'll definitely have to add a
"-n <hash size>" parameter, since you won't know how big the file will be in
order to estimate it.


I've included a trivial checking program (texcheck); this is *only* there as
a confidence test for porting.  It was my first hack using flex.  DO NOT use
it in any other way. It's crap! :-)  [It checks tex source files, as you may
have guessed]

I've completely (I hope) removed all the Packed-trie versions.  The benefits
of the faster format were outweighed by the complexity of having two
sources for everything.

Feel free (== I insist!) to change the names of the library procedures. They
weren't chosen with any thought.  Note I've done away with the rather tacky
included C files; however the library structuring leaves a little to be
desired.  I'll work on it.  Or you can if you prefer. (For example: there's
a prime finder in the library; I think it should be in dawg.c)

I've formatted it with the default parameters of GNU Indent, and removed all
the ANSI prototypes/K&R headers from all procedures.  If you can't assume
ANSI prototypes, I'd prefer if you distributed the source in ANSI style with
an ansi->pcc converter, rather than reverse engineering it back to K&R.
However the choice is yours; I don't insist.

If you want to take over the code completely now, that's OK by me.  Or
equally I'll hack to order if you prefer.  If you do a lot of work on it,
please send me back copies so I can try to keep in synch.  I'm assuming it's
your project since you are closer to the system than I am, so as of this
release you have the master source file.  Please be explicit anytime you
hand over the lock to me. (as in 'we have not changed it since your release
1.2', or 'take these source files and add feature '...', then hand them back
to us as release 1.3')

My system of numbering is

First digit:   major release (eg new data structures, incompatible changes)
Second digit:  minor release (internal changes, small added features)
Third digit:   local edit number (you won't see those)

I use RCS here, but the version numbers I am talking about refer to whole
systems; the RCS numbers for individual files may be different.

This shar file is release 0.1 of spell.


What I *haven't* included are the many versions of checking procedures --
with/ without case sensitivity etc., fuzzy matching, correction et al.  I
know unix 'spell' doesn't use these.  Let me know what the replacement needs
and I'll incorporate them properly.

A sensible extension to spell would be to allow checking in multiple dicts,
and suppression of certain words from dicts.  This can be done with the
parts you have.  However, for an interactive checker, you want to remember
new words as they are found.  I haven't included that code (for dynamic
additions to dawgs) yet as it is far from efficient at the moment.  Again,
if you're not doing such a checker, let me know & I won't spend as much time
on improving that code as I would have.

I have a pretty good idea of what is needed as I've already hacked a copy of
ispell to use my stuff.

I believe the code is fairly solid.  I've made minimal changes for this
release.

best regards
   graham