Friday, July 1, 2011

Statistical Transmogrification

Machine translation has improved so much that we can now joke about it, pointing and laughing when it comes up with something particularly inane. I do mean "improved so much" - ten years ago the results on arbitrary text were so incomprehensible that they weren't even amusing. Statistical machine translation is a big data approach to the problem, looking for statistical correlations in the way humans translate between languages. It radically improved the results compared to trying to get the machine to understand grammar, at least when there is sufficient data available.

Which brings us, of course, to Star Trek and the Universal Translator. As a plot device, the translator is essential: you either portray alien species as inexplicably speaking human languages, or you employ a magical device to learn the alien language and translate. One of my favorite episodes revolves around the limitations of the device: Darmok. The translator can translate works with individual words and phrases, but cannot translate metaphor and cultural references.

"Darmok and Jalad at Tanagra."
"Kiteo, his eyes closed."
"Shaka, when the walls fell."
"Zinda, his face black, his eyes red!"
"Mirab, with sails unfurled."
"The beast at Tanagra."

Ever notice how many cultural references we use in everyday conversation, as a shorthand to convey deeper meaning in a small number of words? I tried to keep track of them for a week. Its difficult to even take note of the ones you employ yourself: the mind doesn't categorize them as being special, they're just another part of the language. They are mostly noticeable when someone else uses an unfamiliar reference that you really have to think about.

"Multiplying like a wet gremlin."
"It's my precious."
"Don't cross the streams."
"Use the carrot, not the stick."
"I drink your milkshake."
"He was the red shirted ensign of that project."
"That is Kryptonite to her."
"He has a portrait up in the attic getting older and older."

An interesting thing about statistical translation is that it is even able to handle references like these, if they are sufficiently common. Its looking for correlation, not meaning. If humans can come up with a reasonable translation for a cultural reference, then the machine will as well.

This doesn't help Picard, though: no data corpus to work with.