% % Translation rules for Dutch text-to-speech. % this is yet another hack job!!!! A 30-minute hack to % see whether the idea is sound. Surprisingly, it is. % The whole program can be implemented with only 3 procedures: % 1) Tex's hyphenation algorithm % 2) Find a substring from a string and replace it with another string % 3) A hacky little procedure to find cvcv and replace with cVcv - any % single vowel being hardened when followed by a single consonant. % % This approach could well be good enough (simplified) for a cheap but % more accurate soundex for spelling checking. Of course, instead % of scanning an array of strings linearly, we would use a trie % and parse stuff properly and in parallel. (Say, does that work % with priorites? Hmm...) % % The hyphenation algorithm attempts to split a word into syllables as these % represent the natural break points in the written word. % % This text-to-speech algorithm relies on the fact that syllables % approximate to morphemes in many cases. The morphemes would then be % looked up in a morpheme dictionary. In fact, in most cases in Dutch % we don't need to do this, as the simple spelling rules suffice. % In English, however, we would use a larger morpheme -> phonetic table. % % The algorithm is simply a string reduction. Crude, but it seems to % work. The only subtlety is the use of TeX's hyphenation as a pre-pass % in order to split accidentally formed compound graphemes. % % This might not work in English, but in Dutch it's a dream! - thanks % to the major spelling reform at the turn of the century. % % characters enclosed between [...]'s represent a single phoneme % which is different from the english letter of that name. % % PHOENEME RULES % General rules followed by more specific rules followed by exceptions % for each case; reverse scanning order. % % "." means either end of a word (depending on context) % "-" means either end of a syllable BUT DEFINITELY NOT the end of a word. % The hyphenation pre-pass takes an input word (eg 'hyphenation') and % converts that to: .hy- -phen- -a- -tion. % The "." on the end is sometimes useful for distinguishing start/end of % words. % Multi-syllable translations are allowed, e.g. % -ation. -> -a-tion. % -ition. -> -i-tion. [need better examples] % As a side-effect of this, whole or part words which have been wrongly % hyphenated by the automatic alogorithm can be corrected here, as % opposed to whole word replacement at an earlier stage. Some small % compaction can be gained by only twiddling the bit of the word which % was wrong. % Direct look-up from whole word to phoneme string is also allowable % here, but as a special case - because the hyphens have to be taken % out before the word is looked up (otherwise the person generating % the pronunciation would have to know the hyphenation of the word as % produced by the first pass, which is a little tedious %Last rules: tidy up word for presentation to user: . -> y -> I aa -> a bb -> [B] cc -> c dd -> [D] ee -> e ff -> [F] gg -> g hh -> h ii -> i jj -> j kk -> [K] ll -> [L] mm -> [M] nn -> [N] oo -> o pp -> [P] qq -> q rr -> [R] ss -> [S] tt -> [T] yy -> y vv -> [V] ww -> w xx -> x yy -> y zz -> [Z] e. -> @. - -> % Non-terminal ijk ijk -> @k % Normal pronunciation. j -> Y x -> ks % NO soft 'c's in Dutch. Lovely language. c -> k % These three are to correct split diphthongs - might be wrong... s-j -> -(SH) t-j -> -(TSH) n-k -> (NG)-k % ABN Dutch pronounces these close to the german way. Southern Dutch % retains the English pronunciation. v -> [V] w -> [W] % Next internal rule takes any case of cvcv and translates to c[VV]cv % -- done as a bit of C code because the patterns if explicitly % listd would be much too long, and this dumb program doesn't % know about wild-cards. (The [VV] means an open vowel) $RULE1 -> ???? % (rhs is ignored. I can't be bothered parsing :-) */ % Note a bug: smekkie (kiss :-) ) appears to loose it's kk -> k % before this is called, with the result that the first E is opened. % This is wrong :-( % Dutch g is synonymous with their 'ch' grapheme % However I can't yet distinguish between soft & hard g's % except possibly at start of word. g -> X % ng is *always* NG - it seems even when they happen together % by accident. (But not if at join of subwords in a compound???) n-g -> (NG) % Single one of these at end of word is lengthened a. -> [AA]. o. -> [OO]. u. -> [UU]. % And one of these at end is hardened g. -> X. d. -> T. b. -> P. ik. -> @k. % The strange Dutch habit of dropping the n in words ending in -en. en. -> @. ch -> X % This comes from German, so NOT isX isch. -> i[SH]. th -> T ng -> (NG) nk -> (NG)k % Same as sh sj -> [SH] tj -> [TSH] % Some of these are pure vowels, some are diphthongs. % I'll assume phoneme -> sound code knows them as a unit! % Safe assumption since I'll probably have to write it myself :-( uw -> [UW] aa -> [AA] oo -> [OO] eu -> [EU] uu -> [UU] oe -> [OE] ou -> [OU] au -> [OU] ee -> [EE] % These depend on which route the word took to get into the language % -- Saxon or German ij -> [EI] ei -> [EI] % Different at end of word ijk. -> UK ie -> [IE] ui -> [UI] eeu -> [EEU] ieu -> [IEU] oei -> [OEI] ooi -> [OOI] aai -> [AAI] ieuw -> [IEU] eeuw -> [EEU] tie -> [TS][IE] % Foreign muck :-) In fact, often spelled 'kw' and believed to % be native words by some Dutch (ho ho ho) - same with many French % loan words - krant = courant etc. qu -> kw % This should only be applied to prefix 'ge's... Ie only the FIRST % 'ge' in gegeven: % uij as in Cuijk is just like ui uij -> ui .ge- -> .X@ %-ge- -> -X[EE]- % Actually, we want rules for splitting of prefixes; they should % be dealt with seperately. % Also, we should split compound words independantly of TeX's % hyphenation, and treat the two parts as separate words. % Both the above could be performed with the help of a dictionary! % - strip off the 'ge-' etc, and see whether what is left is % still a word :-) % And compound words are those which fail the spell-test, but % can be parsed from the compound word. Clearly any large word % may have > 1 permissible parse, but we should accept the version % with the least subwords found. % Bug-fixes: % Inhibit accidental nk -> [NG]k translation nc -> nK % Syllables: -ti- -> -s[IE]- hyp -> h[IE]p % Prefixes: .in-g -> .in-X .inge- -> .inX@- .opge- -> .opX@- .in-ge- -> .inX@- .op-ge- -> .opX@- wr -> vr % Tidy up artifacts of first pass... -- -> - .goe-de-na-vond. -> .goed-en-av-ond. % NEED * for end of file. Imp compiler/library is a bit dodgy that way... *