breaking a cryptogram

on p. 206 of coulmas 1989 book the writing systems of the world, there is an image of the following cryptogram:

cryptogram

apparently, the author had given this encrypted text to some of his grad students, and gave them one week to study it and to find out as much about it as they could. surprisingly, one of the students managed to decrypt it completely in one week, without even knowing that the language of the encrypted text was english.

so, being intrigued by this riddle, i decided to give it a try myself. and i did eventually come up with the correct solution, though it took me a great effort and countless hours… during the process, i named and indexed every symbol of the script, did research on vowel and syllable frequencies, compiled a bunch of statistics, discussed my preliminary findings with various people, and even wrote a html/javascript tool to faciliate the testing of hypotheses :P the trouble was definitely worth it, though, since the feeling you get when the mess of letters before your eyes finally starts to make sense is very rewarding :)

in the following, i’ll explain what approaches i tried, what the largest obstacles were, and how i finally reached the solution. don’t read beyond this point if you want to give it a try yourself!

a few observations could be made right away from simply looking at the cryptogram:

  • there are more symbols than in the roman alphabet, but too few for a syllabary
  • we never find the same symbol twice in a row
  • there is a considerable number of single symbols which occur as words (i.e., between spaces)
  • there are no hyphenation marks, meaning there is either no hyphenation, or it is not indicated
  • because of the layout with a title and an indented first line of the paragraph, the text is clearly to be read left-to-right and top-to-bottom
  • one symbol that was evident was the one after the apostrophe in the title, which clearly had to be /s/

from these observations, i could draw one conclusion with certainty: because of the absence of doubled symbols, it couldn’t be a simple encoding of written english. rather, it was likely to be an encryption based on spoken english.

besided that, i faced a number of problems:

  • the amount of symbols (48) was certainly too low for a syllabary (since english has complex syllable codas, we’d expect at least 150), but surprisingly large for the amount of phonemes of english (usually assumed to be around 30). guessing that long vowels might be encoded separately, and that phonemes like the glottal stop /ʔ/ and possibly affricates like /tʃ/ might have their own symbols, it seemed possible to reach a number of phonemes as high as about 40. but 48 just seemed too high.
  • the high number of symbols, together with the high amount of 6 different symbols occurring alone, made me think that at least some of the symbols must represent a syllable rather than a phoneme. the only english word i could think of that consists of only one phoneme is the indefinite article a. it might be possible to reach more when counting diphthongs (personal pronoun i) and interjections (oh, ah etc.), but at least the latter seemed an unlikely assumption. this paradox was in fact a main obstacle for breaking the cryptogram, since i frequently discarded my attempts of assigning sounds to symbols when it resulted in some of the single symbols being assigned values like t or n, which are not words of the english language by any account.
  • because of the insufficient quality of the printed image, i was unsure about the reading of some signs. especially in the title, some symbol shapes were not properly printed, leading to some problems (how do we decide if small differences in appearence are significant? are we dealing with graph variants, or separate graphemes?), and the punctuation marks were overall hard to judge (comma or period?). if you check the list of symbols below, you will even notice that i listed one symbol (“ascending-spike”) which in fact turned out to be not a separate letter of the alphabet, but identical with the symbol “ascending-bent”, modified by a printing stain.

to make any progress, i decided that i needed to name and index all the symbols, so that i could make a transcript of the cryptogram and start compiling some statistics. here’s the list with names:

symbol names

using standard unix commands like sed, uniq and sort, as well as some regular expression magic, it was quite easy to compile statistics of 1) symbol frequencies 2) word frequencies 3) frequency list of symbols occurring in the beginning of words 4) frequency list of symbols occurring in the end of words.

but what could these statistics be compared to? i clearly needed some data about spoken english. luckily, i found a large transcript of a british english text on the internet, and i used it to compile similar statistics to the ones mentioned above, so that they could be compared easily.

thanks to that, i came up with some guesses of possible symbol – sound correlations. but how could i verify/falsify them easily? it seemed too tiresome to use a pencil and an eraser every time. that’s why i wrote a small HTML/javascript application that would substitute letters automatically and quickly. you can see it here (it’s actually still a bit buggy, but it was good enough for what i needed it).

one of the early ideas i had was that the sign i called “q” might have the value /ð/. the reason for that was that it was very frequent in anlaut, but never occured in auslaut, which matched the distribution of /ð/ in english.

but this – again – let to the problem that /ð/ is no word of the english language (not even in fast talk). this made me reconsider this guess repeatedly, even though it turned out to be correct in the end.

the breakthrough was then made possible by two observations. first, the second to last word in the title consisted of two symbols, and it was in the position right before a name. therefore, it was likely to be either an article or a preposition. i also spotted the same sequence in a frequent three-symbol word (being identical to symbol two and three in that word). so i checked the list of most frequent words in english to see if i could find a word consisting of three phonemes, where the phonemes two and three together would be identical to an article or a preposition. and i did in fact find one (actually, more than one, but this one seemed most promising): /ðæt/ and /æt/. since this fitted also with my guess that the “q” symbol might be /ð/, i felt that this was a good path to explore. with that, i was in fact already on the right track.

second, i noticed two sequences where /ðæt/ was followed by another word of length three, and they both started with /ð-/. these were likely to be one of /ðæt ðei/, /ðæt ðer/, /ðæt ðis/ or /ðæt ði:z/, and it turned out that the latter two fitted well. with that, i saw the hypothesis confirmed that long vowels were represented by separate symbols. in addition, i discovered that “it iz” had appeared in other parts of the cryptogram, giving me some confidence that i was going in the right direction.

next, i concentrated on a frequent two-symbol word ending in long /i:/, and i came to the conclusion that a /w/ would fit best for the first symbol. the sequence /w?t/ was then likely to be /wɒt/.

in the meantime, it had finally dawned on me how to read the single-letter words: it had to be the case that the most frequent functional words were given only by their characteristic consonant. therefore, /t/ was /to/, /ð/ was /ðə/, and one of the remaing ones had to be /n/ = /ænd/.

with these letters given, i managed to guess /wi: kænɒt/, occurring thrice in the first line, and this couldn’t possibly be a coincidence. so from there on forward, it wasn’t too hard anymore to guess the remaining symbols, though i admittedly didn’t get some of the symbols which represent a syllable (there were in fact a few of those in the alphabet, just as i had suspected based on the total number of symbols).

later, after decrypting the entire text, i found out that this script is in fact real. it’s known under the name of the shavian alphabet (check that link for the values of all the signs), and it was created at the occasion of a contest to invent an improved orthography for english (though it never became popular, for obvious reasons).

cryptogram
[the colored areas were crucial for the deciphering: blue – /æt/ and /ðæt/, green – /ðæt ði:z/ and /ðæt ðis/, orange – /wi:/]

the text turned out to be a part of lincoln’s gettysburg address. in full:

From Lincoln’s speech at Gettysburg

But, in a larger sense, we cannot dedicate…we cannot
consecrate…we cannot hallow…this ground. The brave men,
living and dead, who struggled here, have consecrated it
far above our poor power to add or detract. The world
will little note nor long remember what we say here, but
it can never forget what they did here. It is for us, the
living, rather, to be dedicated here to the unfinished
work which they who fought here have thus far so nobly
advanced. It is rather for us to be here dedicated to the
great task remaining before us…that from these honored
dead we take increased devotion to that cause for which
they gave the last full measure of devotion; that we here
highly resolve that these dead shall not have died in vain;
that this nation, under God, shall have a new birth of
freedom; and that government of the people, by the people,
for the people, shall not perish from the earth. (courtesy wikisource)

references
coulmas, florian: the writing systems of the world. oxford 1989.