breaking a cryptogram

on p. 206 of coulmas 1989 book the writing systems of the world, there is an image of the following cryptogram:


apparently, the author had given this encrypted text to some of his grad students, and gave them one week to study it and to find out as much about it as they could. surprisingly, one of the students managed to decrypt it completely in one week, without even knowing that the language of the encrypted text was english.

so, being intrigued by this riddle, i decided to give it a try myself. and i did eventually come up with the correct solution, though it took me a great effort and countless hours… during the process, i named and indexed every symbol of the script, did research on vowel and syllable frequencies, compiled a bunch of statistics, discussed my preliminary findings with various people, and even wrote a html/javascript tool to faciliate the testing of hypotheses :P the trouble was definitely worth it, though, since the feeling you get when the mess of letters before your eyes finally starts to make sense is very rewarding :)

in the following, i’ll explain what approaches i tried, what the largest obstacles were, and how i finally reached the solution. don’t read beyond this point if you want to give it a try yourself!

a few observations could be made right away from simply looking at the cryptogram:

  • there are more symbols than in the roman alphabet, but too few for a syllabary
  • we never find the same symbol twice in a row
  • there is a considerable number of single symbols which occur as words (i.e., between spaces)
  • there are no hyphenation marks, meaning there is either no hyphenation, or it is not indicated
  • because of the layout with a title and an indented first line of the paragraph, the text is clearly to be read left-to-right and top-to-bottom
  • one symbol that was evident was the one after the apostrophe in the title, which clearly had to be /s/

from these observations, i could draw one conclusion with certainty: because of the absence of doubled symbols, it couldn’t be a simple encoding of written english. rather, it was likely to be an encryption based on spoken english.

besided that, i faced a number of problems:

  • the amount of symbols (48) was certainly too low for a syllabary (since english has complex syllable codas, we’d expect at least 150), but surprisingly large for the amount of phonemes of english (usually assumed to be around 30). guessing that long vowels might be encoded separately, and that phonemes like the glottal stop /ʔ/ and possibly affricates like /tʃ/ might have their own symbols, it seemed possible to reach a number of phonemes as high as about 40. but 48 just seemed too high.
  • the high number of symbols, together with the high amount of 6 different symbols occurring alone, made me think that at least some of the symbols must represent a syllable rather than a phoneme. the only english word i could think of that consists of only one phoneme is the indefinite article a. it might be possible to reach more when counting diphthongs (personal pronoun i) and interjections (oh, ah etc.), but at least the latter seemed an unlikely assumption. this paradox was in fact a main obstacle for breaking the cryptogram, since i frequently discarded my attempts of assigning sounds to symbols when it resulted in some of the single symbols being assigned values like t or n, which are not words of the english language by any account.
  • because of the insufficient quality of the printed image, i was unsure about the reading of some signs. especially in the title, some symbol shapes were not properly printed, leading to some problems (how do we decide if small differences in appearence are significant? are we dealing with graph variants, or separate graphemes?), and the punctuation marks were overall hard to judge (comma or period?). if you check the list of symbols below, you will even notice that i listed one symbol (“ascending-spike”) which in fact turned out to be not a separate letter of the alphabet, but identical with the symbol “ascending-bent”, modified by a printing stain.

to make any progress, i decided that i needed to name and index all the symbols, so that i could make a transcript of the cryptogram and start compiling some statistics. here’s the list with names:

symbol names

using standard unix commands like sed, uniq and sort, as well as some regular expression magic, it was quite easy to compile statistics of 1) symbol frequencies 2) word frequencies 3) frequency list of symbols occurring in the beginning of words 4) frequency list of symbols occurring in the end of words.

but what could these statistics be compared to? i clearly needed some data about spoken english. luckily, i found a large transcript of a british english text on the internet, and i used it to compile similar statistics to the ones mentioned above, so that they could be compared easily.

thanks to that, i came up with some guesses of possible symbol – sound correlations. but how could i verify/falsify them easily? it seemed too tiresome to use a pencil and an eraser every time. that’s why i wrote a small HTML/javascript application that would substitute letters automatically and quickly. you can see it here (it’s actually still a bit buggy, but it was good enough for what i needed it).

one of the early ideas i had was that the sign i called “q” might have the value /ð/. the reason for that was that it was very frequent in anlaut, but never occured in auslaut, which matched the distribution of /ð/ in english.

but this – again – let to the problem that /ð/ is no word of the english language (not even in fast talk). this made me reconsider this guess repeatedly, even though it turned out to be correct in the end.

the breakthrough was then made possible by two observations. first, the second to last word in the title consisted of two symbols, and it was in the position right before a name. therefore, it was likely to be either an article or a preposition. i also spotted the same sequence in a frequent three-symbol word (being identical to symbol two and three in that word). so i checked the list of most frequent words in english to see if i could find a word consisting of three phonemes, where the phonemes two and three together would be identical to an article or a preposition. and i did in fact find one (actually, more than one, but this one seemed most promising): /ðæt/ and /æt/. since this fitted also with my guess that the “q” symbol might be /ð/, i felt that this was a good path to explore. with that, i was in fact already on the right track.

second, i noticed two sequences where /ðæt/ was followed by another word of length three, and they both started with /ð-/. these were likely to be one of /ðæt ðei/, /ðæt ðer/, /ðæt ðis/ or /ðæt ði:z/, and it turned out that the latter two fitted well. with that, i saw the hypothesis confirmed that long vowels were represented by separate symbols. in addition, i discovered that “it iz” had appeared in other parts of the cryptogram, giving me some confidence that i was going in the right direction.

next, i concentrated on a frequent two-symbol word ending in long /i:/, and i came to the conclusion that a /w/ would fit best for the first symbol. the sequence /w?t/ was then likely to be /wɒt/.

in the meantime, it had finally dawned on me how to read the single-letter words: it had to be the case that the most frequent functional words were given only by their characteristic consonant. therefore, /t/ was /to/, /ð/ was /ðə/, and one of the remaing ones had to be /n/ = /ænd/.

with these letters given, i managed to guess /wi: kænɒt/, occurring thrice in the first line, and this couldn’t possibly be a coincidence. so from there on forward, it wasn’t too hard anymore to guess the remaining symbols, though i admittedly didn’t get some of the symbols which represent a syllable (there were in fact a few of those in the alphabet, just as i had suspected based on the total number of symbols).

later, after decrypting the entire text, i found out that this script is in fact real. it’s known under the name of the shavian alphabet (check that link for the values of all the signs), and it was created at the occasion of a contest to invent an improved orthography for english (though it never became popular, for obvious reasons).

[the colored areas were crucial for the deciphering: blue – /æt/ and /ðæt/, green – /ðæt ði:z/ and /ðæt ðis/, orange – /wi:/]

the text turned out to be a part of lincoln’s gettysburg address. in full:

From Lincoln’s speech at Gettysburg

But, in a larger sense, we cannot dedicate…we cannot
consecrate…we cannot hallow…this ground. The brave men,
living and dead, who struggled here, have consecrated it
far above our poor power to add or detract. The world
will little note nor long remember what we say here, but
it can never forget what they did here. It is for us, the
living, rather, to be dedicated here to the unfinished
work which they who fought here have thus far so nobly
advanced. It is rather for us to be here dedicated to the
great task remaining before us…that from these honored
dead we take increased devotion to that cause for which
they gave the last full measure of devotion; that we here
highly resolve that these dead shall not have died in vain;
that this nation, under God, shall have a new birth of
freedom; and that government of the people, by the people,
for the people, shall not perish from the earth. (courtesy wikisource)

coulmas, florian: the writing systems of the world. oxford 1989.

Further Thoughts on Computational Linguistics

I’ve been talking to various people about the ideas expressed in my last post on computational linguistics. It seems that most people recognize the problems I talked about, but do not draw equally pessimistic conclusions from them. A few interesting points have been brought up. One of them is a possible analogy to the relation between the mind and the brain, using the image of a computer: The brain is the hardware, and the mind is the software, which runs on it. If that analogy is accurate, then we could claim that the mind is basically something like the following algorithm:

while (true) {
	rawData = nerves.readRawData();
	data = patternRecognizer.analyze(rawData);
	store(data, memory);
	conclusions = logicUnit.drawConclusions(memory);
	store(conclusions, memory);
	reactions = judgingModule.calculateReactions(memory);
	send(reactions.getSignals(), nerves);

It’s just an infinite loop, that cycles through three steps: Read data from nerves, process data, and send output to nerves. That doesn’t sound like it’s hard to implement, does it? But is that really all our minds do? Does the algorithm cover all ascpects of what’s going on in our minds?

A second point that has been brought to my attention is that it might be possible to get some sort of semantics by using a computer to simulate not just a single speaking agent, but to simulate a complete world inside the computer, and put some agents in it that talk about this simulated world. These agents could of course mean things in the simulated world, since the agents and the things all belong to the same domain, all inside the computer. Of course, they could still not talk about the real world, but that does not matter, because at that point, you are only concerned with the simulated world. So that would be the ideal way to experiment with language learning etc. in a “snow globe” environment. The idea originated apparently in KI-research, where it had become obvious that it could often be more rewarding not to work on real-life robots, but to use simulated robots interacting with a simulated world.

So one of the conclusions I’ve come to so far is that computational linguists are currently somewhat off the track, because they attempt really fancy stuff like syntactical parsing etc., but they completely ignore semantics. In my view, computational linguists should approach the problems differently: They should start out with artificial intelligence and try to model basic requirements like cognition and (primitive) language learning. Semantics should be a central focus from the beginning. Once that is achieved (but only then!), they could proceed to model the more complex aspects of a language, like its morphological or syntactical structure.

Thanks for everyone who commented and showed willingness to discuss these issues with me!

Computational Linguistics and Semiotics

The longer I think about computational linguistics, what the discipline is, what it’s trying to achieve and how far it has come by now, the clearer I see some fundamental problems in the field. And with that, I don’t mean problems that are currently not solvable, e.g. because research has not advanced far enough, but rather problems which are not solvable on principle.

My reasoning is as follows:

The most fundamental feature of language is the concept of signs (dt. Zeichen). Signs are used to refer to things in the world, and that is what makes them usable for communication. They always consist of two parts: One is the appearance of the sign (its form), the other is its function to refer to something (its meaning). Following F. de Saussure, the man who first established the theory of (linguistic) signs as a scientific discipline (called semiotics), these two parts are often called signifiant and signifié, i.e. “that which is indicating” and “that which is indicated”. Language couldn’t exist without either one of these two features: Without a form, the signs couldn’t be passed on between humans, and without meaning, they would not be able to refer to anything, and would therefore be without value. Because of that, both of these things are essential for linguistics signs, and consequently for language in general.

It follows from this that anyone who tries to model or reverse engineer human language must be concerned with both aspects of a sign: Both the signifiant and the signifié must be dealt with. This does not pose a problem for the signifiant part of the sign – it can easily be represented graphically or acoustically. However, we are likely to run into a problem with the signifié part of the sign. To model this, the device we are trying to model the language on needs to be able to deal with the concept of reference. It needs to understand the fact that a sign points to something outside of the language, something in the real world.

The way I see it, this makes it necessary for the device we use for modeling the language on to have a concept of the real world and to know about its own place in that world. That, however, is the same thing as saying it needs a consciousness.

So can we emulate consciousness on a computer? I think not. First, the philosophers are still hard at work to figure out what consciousness really is (a subdiscipline which in German is called Bewusstseinsphilosophie), and second, what can be achieved in the field of artificial intelligence that I am aware of is still far, far away from anything like self-consciousness in machines. So as far as I can see, there is currently no way that a computer can have a consciousness of itself in this world, and it does not look likely that this is going to be possible in the near future, if at all.

From the fact that a computer can not have a consciousness, it follows that a computer can not have a concept of reference. From there, it follows that a computer can not properly deal with linguistic signs. And if it can not deal with linguistic signs, it can not deal with language in general.

That’s it. Unless I made a mistake, I have just shown that computers can not have real linguistic competence on principle. Please prove me wrong if you can!

I have two more things to say about this:

(1) It may be objected that computational linguists have in fact been concerned with the meaning of linguistic signs, i.e. with semantics. Actually, any text book on computational linguistics has some chapters on semantics. However, if you ask me, what the computational linguists call “semantics” is not actual semantics, but rather something else, which I find quite silly: They model some aspects of the real world inside a computer (calling these models, among other things, ontologies), and then they make the “linguistic signs” they work with point to something inside this virtual model of the world. But if you think about it, you’ll see that this is not real “semantics” at all, because the reference is not to the real world outside of the computer, but only to the model of the real world inside the computer. The only thing achieved by this is to add a level of indirection, because the linguistic signs are now referring to elements inside the virtual model, which are in turn pointing to things in the real world. But the problem was just shifted: Instead of the reference from signs to things in the real world, we now have to deal with the reference of elements of our virtual model to things in the real world. In other words, it is not possible to overcome the limitation that a computer can never mean anything in the real world, no matter how elaborate the models of the real world are which we create inside the computer.

(2) I am not saying that computational linguistics is useless. I see two ways in which it can be worthwile: (A) Practical uses. Obviously, some aspects of language can very well be modeled on the computer, and they can be used to get actual work done. For example, we have tools today which can do part-of-speech tagging, syntactic parsing or even rough translations of texts into foreign languages, and they do some of those tasks rather well. But you have to realize that these tools are severely limited by the fact that they operate solely on the signifiant part of the linguistics signs, and that they lack some elementary things for proper treatment of language, i.e. meaning. Put differently, these tools operate on words, but they operate on dead words, and they can never hide the fact that they have no understanding of it. These tools are hacks much rather than proper implementations of linguistic competence. However, I won’t question that this is enough for some purposes, some of the time. Those tools work up to some point, and people will be happy with them (and even pay you money for them), if it helps them to get their work done quicker. (B) Theoretical uses. I believe that computational linguistics can be an “auxiliary discipline” for traditional linguistics, because trying to reverse-engineer a complex thing such as language will undoubtedly teach you a lot about the way it works (and maybe just as much about the way it doesn’t work). From my personal experience, I know that I was not aware of much of the syntactical complexity of language before I saw how hard it is to write a parser. Keep in mind, however, that only some aspects of language can be modeled on a computer, and that a complete “implementation” can not be the goal of such reverse engineering.

So even though we are already used to the concept of linguistically competent agents, thanks to StarTrek etc., and even though technology evangelists have been predicting it in the real world for years, and still do it today, I do not believe that we will have talking computers/robots in the near future. Actually, I believe that it is quite likely that we will never have talking computers. Computers are extremely powerful tools, and they make many unlikely things possible, but that does not mean that they make everything possible. They are, after all, just that: tools.

NOTE: original comments on this post have unfortunately been lost :(

programmiersprachen und sprachen

gerade habe ich das buch ‘eine kleine geschichte der sprache’ von steven roger fischer gelesen (2. auflage 2004; engl. original aus dem jahr 1999). es ist ein nettes kleines büchlein, das einen sehr allgemeinen und für laien ausgericheten überblick über die entstehung der sprache, über die sprachwissenschaft, die wichtigsten sprachfamilien, sprachtypologie etc. bietet. natürlich kratzt der autor bei den einzelnen themengebieten nur an der oberfläche, doch liegt dies in der natur eines solchen buches, das auf ein breites, nicht ausgebildetes publikum zielt. im teil zu den germanischen sprachen, den ich aufgrund meiner ausbildung beurteilen kann, hat es zwar einige ungenauigkeiten und fehler drin (z.B. kann man altnordisch nicht als die ‘ursprüngliche germanische sprache’ bezeichnen, S. 132), aber es ist nichts allzu schlimmes dabei.

allerdings, und dies ist der grund für meinen blog post, bin ich in einem punkt ganz und gar nicht mit fischer einverstanden: nämlich damit, dass er programmiersprachen immer wieder mit natürlichen sprachen in einem atemzug nennt, und damit die grundlegenden unterschiede zwischen beiden missachtet. um es auf den punkt zu bringen: programmiersprachen sind überhaupt keine sprachen. sie haben, wenn man es genau nimmt, in einem linguistischen einführungsbuch über die geschichte der sprache nichts verloren.

zwei punkte mögen genügen, um dies zu klären:

  • es sind verschiedene definitionen von ‘sprache’ denkbar, aber man kommt in keinem fall um die grundlegende feststellung herum, dass die sprache ein medium für die kommunikation ist. das trifft zu bei gesprochener sprache, bei geschriebener sprache, bei gehörlosensprachen, und, wenn man will, sogar bei der pheromon-kommunikation von insekten und anderen tieren. bei programmiersprachen trifft dies aber gerade nicht zu. programmiersprachen ermöglichen keine kommunikation. oder hast du dich schon einmal mit einem computer in java unterhalten? hat er vielleicht eine frage von dir mit einigen zeilen python beantwortet? natürlich nicht, denn programmiersprachen dienen einem ganz anderen zweck: sie sind konstrukte, um komplizierte berechnungen an einem computer effizient und übersichtlich beschreiben zu können. von einer kommunikation zwischen mensch und computer kann somit keine rede sein. mit einer maschine kann man lediglich interagieren, aber nicht kommunizieren. für die kommunikation braucht es einen gleichwertigen gesprächspartner.
  • wenn die sprachwissenschaft zu einer grundlegenden einsicht gekommen ist, was das wesen der sprache betrifft, dann ist es die, dass sich sprachen im verlaufe der zeit verändern. gerade dieser fundamentale satz trifft bei programmiersprachen aber nicht zu – jedenfalls nicht in der gleichen art wie bei natürlichen sprachen, wo sich ein wandel der sprache ohne bewusstes eingreifen einer ‘normativen kraft’ vollzieht. programmiersprachen können zwar von ihren erfindern überarbeitet werden, doch hat dies mit sprachwandel gar nichts zu tun.

in meinen augen ist die benennung von programmiersprachen als ‘sprachen’ nur eine metapher, die aufgrund von einigen oberflächlichen eigenschaften (z.B. dass beide eine syntax haben) zustande gekommen ist. damit sollte aber nur gesagt werden, dass c++, assembler, fortran und java so etwas ähnliches wie sprachen sind, aber keineswegs, dass es sich tatsächlich um sprachen handelt. aussagen in der art, dass computer miteinander “sprechen” könnten, dass sie programmiersprachen “benutzten” und dass dies ganz ähnlich wie bei der kommunikation zwischen mensch und tier ablaufen solle (alles nachzulesen auf s. 223), sind irreführend und zeugen von einem grundsätzlichen unverständnis darüber, was programmiersprachen sind und wie sie funktionieren.