Substitution Cipher: Algorithm


The basic algorithm of the substitution cipher breaker can be characterized as hillclimbing with several random starting points. Our "landscape" consists of all permutations (substitutions) of the 27 characters allowed in the cipher-text.

We will distinguish three phases:

Phase 1: Picking a Starting Point

Originally we picked an arbitrary permutation of the alphabet. However, it proved to be better to start with a permutation that is "close" to the optimal permutation according to a single letter (monogram) frequency analysis.

By close we mean that the permutation can be obtained by a few transpositions (swaps) of neighbouring elements from this optimal permutation.

Phase 2: Climbing the Hill using Trigram Scoring

Giving a permutation to start with and a scoring scheme based on trigrams that assigns values to permutations. We now try to improve the permutation. That is we try to modify the permutation slightly so that the score of the modified permutation is higher than the score of the original one. We do so until we cannot improve the permutation anymore. As "slight modifications" we only consider transpositions.

Phase 3: Climbing the Hill using Dictionary Scoring

This phase tries to fine tune the permutation that was the result of phase 2. It also employs hillclimbing but uses a different dictionary based scoring scheme.

Phase 1 to 3 are repeated several times. The permutations that had the highest scores are saved.

Somtimes the dictionary optimization (phase 3) does more harm than good especially if there are many words in the text that are not in the dictionary. We therefore save not only the permutatation that had the highest dictionary score but also the one that had the highest trigram score.


Last Update: 97/06/20