%
% Translation rules for Dutch text-to-speech.
%  this is yet another hack job!!!! A 30-minute hack to
%  see whether the idea is sound.  Surprisingly, it is.
%  The whole program can be implemented with only 3 procedures:
%  1) Tex's hyphenation algorithm
%  2) Find a substring from a string and replace it with another string
%  3) A hacky little procedure to find cvcv and replace with cVcv - any
%     single vowel being hardened when followed by a single consonant.
%
% This approach could well be good enough (simplified) for a cheap but
% more accurate soundex for spelling checking.  Of course, instead
% of scanning an array of strings linearly, we would use a trie
% and parse stuff properly and in parallel. (Say, does that work
% with priorites? Hmm...)
%
% The hyphenation algorithm attempts to split a word into syllables as these
% represent the natural break points in the written word.
%
% This text-to-speech algorithm relies on the fact that syllables
% approximate to morphemes in many cases.  The morphemes would then be
% looked up in a morpheme dictionary.  In fact, in most cases in Dutch
% we don't need to do this, as the simple spelling rules suffice.
% In English, however, we would use a larger morpheme -> phonetic table.
%
%  The algorithm is simply a string reduction.  Crude, but it seems to
% work.  The only subtlety is the use of TeX's hyphenation as a pre-pass
% in order to split accidentally formed compound graphemes.
%
%  This might not work in English, but in Dutch it's a dream! - thanks
% to the major spelling reform at the turn of the century.
%
% characters enclosed between [...]'s represent a single phoneme
% which is different from the english letter of that name.
%

% PHOENEME RULES
% General rules followed by more specific rules followed by exceptions
% for each case; reverse scanning order.
%

% "." means either end of a word (depending on context)
% "-" means either end of a syllable BUT DEFINITELY NOT the end of a word.
% The hyphenation pre-pass takes an input word (eg 'hyphenation') and
% converts that to:   .hy- -phen- -a- -tion.
% The "." on the end is sometimes useful for distinguishing start/end of
% words.

% Multi-syllable translations are allowed, e.g.
% -ation. -> -a-tion.
% -ition. -> -i-tion.  [need better examples]

% As a side-effect of this, whole or part words which have been wrongly
% hyphenated by the automatic alogorithm can be corrected here, as
% opposed to whole word replacement at an earlier stage.  Some small
% compaction can be gained by only twiddling the bit of the word which
% was wrong.

% Direct look-up from whole word to phoneme string is also allowable
% here, but as a special case - because the hyphens have to be taken
% out before the word is looked up (otherwise the person generating
% the pronunciation would have to know the hyphenation of the word as
% produced by the first pass, which is a little tedious
%

%Last rules: tidy up word for presentation to user:
. -> 

y -> I

aa -> a
bb -> [B]
cc -> c
dd -> [D]
ee -> e
ff -> [F]
gg -> g
hh -> h
ii -> i
jj -> j
kk -> [K]
ll -> [L]
mm -> [M]
nn -> [N]
oo -> o
pp -> [P]
qq -> q
rr -> [R]
ss -> [S]
tt -> [T]
yy -> y
vv -> [V]
ww -> w
xx -> x
yy -> y
zz -> [Z]

e. -> @.

- -> 

% Non-terminal ijk
ijk -> @k

% Normal pronunciation.
j -> Y
x -> ks

% NO soft 'c's in Dutch. Lovely language.
c -> k

% These three are to correct split diphthongs - might be wrong...
s-j -> -(SH)
t-j -> -(TSH)
n-k -> (NG)-k

% ABN Dutch pronounces these close to the german way.  Southern Dutch
% retains the English pronunciation.
v -> [V]
w -> [W]

% Next internal rule takes any case of cvcv and translates to c[VV]cv
%  -- done as a bit of C code because the patterns if explicitly
%  listd would be much too long, and this dumb program doesn't
%  know about wild-cards. (The [VV] means an open vowel)

$RULE1 -> ????
% (rhs is ignored. I can't be bothered parsing :-) */

% Note a bug: smekkie (kiss :-) ) appears to loose it's kk -> k
% before this is called, with the result that the first E is opened.
% This is wrong :-(


% Dutch g is synonymous with their 'ch' grapheme
% However I can't yet distinguish between soft & hard g's
% except possibly at start of word.
g -> X

% ng is *always* NG - it seems even when they happen together
% by accident.  (But not if at join of subwords in a compound???)
n-g -> (NG)

% Single one of these at end of word is lengthened
a. -> [AA].
o. -> [OO].
u. -> [UU].

% And one of these at end is hardened
g. -> X.
d. -> T.
b. -> P.

ik. -> @k.

% The strange Dutch habit of dropping the n in words ending in -en.
en. -> @.

ch -> X

% This comes from German, so NOT  isX
isch. -> i[SH].

th -> T
ng -> (NG)
nk -> (NG)k

% Same as sh
sj -> [SH]
tj -> [TSH]

% Some of these are pure vowels, some are diphthongs.
% I'll assume phoneme -> sound code knows them as a unit!
% Safe assumption since I'll probably have to write it myself :-(

uw -> [UW]
aa -> [AA]
oo -> [OO]
eu -> [EU]
uu -> [UU]
oe -> [OE]

ou -> [OU]
au -> [OU]

ee -> [EE]

% These depend on which route the word took to get into the language
% -- Saxon or German
ij -> [EI]
ei -> [EI]

% Different at end of word
ijk. -> UK

ie -> [IE]
ui -> [UI]
eeu -> [EEU]
ieu -> [IEU]

oei -> [OEI]
ooi -> [OOI]
aai -> [AAI]

ieuw -> [IEU]
eeuw -> [EEU]

tie -> [TS][IE]

% Foreign muck :-)  In fact, often spelled 'kw' and believed to
% be native words by some Dutch (ho ho ho) - same with many French
% loan words - krant = courant etc.
qu -> kw

% This should only be applied to prefix 'ge's... Ie only the FIRST
% 'ge'  in  gegeven:

% uij as in Cuijk is just like ui
uij -> ui

.ge- -> .X@
%-ge- -> -X[EE]-

% Actually, we want rules for splitting of prefixes; they should
% be dealt with seperately.

% Also, we should split compound words independantly of TeX's
% hyphenation, and treat the two parts as separate words.

% Both the above could be performed with the help of a dictionary!
% - strip off the 'ge-' etc, and see whether what is left is
% still a word :-)

% And compound words are those which fail the spell-test, but
% can be parsed from the compound word.  Clearly any large word
% may have > 1 permissible parse, but we should accept the version
% with the least subwords found.

% Bug-fixes:
% Inhibit accidental nk -> [NG]k translation
nc -> nK

% Syllables:
-ti- -> -s[IE]-
hyp -> h[IE]p

% Prefixes:
.in-g -> .in-X
.inge- -> .inX@-
.opge- -> .opX@-
.in-ge- -> .inX@-
.op-ge- -> .opX@-

wr -> vr

% Tidy up artifacts of first pass...
-- -> -

% NEED * for end of file.  Imp compiler/library is a bit dodgy that way...
*