Building Babel: Lost in machine translation

18 November 2014

Paul Rubens

Features correspondent

Scientists have been trying to automatically translate languages for almost as long as computers have been in existence. So why is it so hard?

Earlier this year, the Malaysian Ministry of Defence unveiled its glossy new website, designed to show off its military prowess and high standards to the world. Unfortunately, nobody had bothered to check the English translations.

One section said that the Malaysian government had taken “drastic measures to increase the level of any national security threat” after the country's independence in 1957. Another page suggested women should not wear items that “poke out the eye”, an apparent translation of a rule that women should not wear revealing clothing.

Initially it was just sniggering Malaysians who passed the gaffes around on social media, but the chortles soon became global, triggering the Defence Minister to admit that the ministry had used the free online tool Google Translate. He subsequently ordered the new military site to be removed.

The episode was embarrassing for the Malaysian ministry, but it also provides an object lesson in the limitations of today’s machine translation technology, which despite billions of pounds of research and massive demand from businesses, politicians and the military, not to mention tourists, is still only stuttering along.

According to Phil Blunsom, a lecturer and machine translation researcher at the University of Oxford, the field has made a lot of progress. But a time when a computer can match the interpretive skills of a professional is “still a long way off”.

So why is it so hard to automatically translate texts?

Scientists and academics have been trying to automate translation for almost as long as computers have been in existence. In the 1940s and 1950s it was widely assumed that once the vocabulary and the rules of grammar of a language had been codified, it would make automated translation easy, according to Dr Blunsom. But attempts to make computers learn languages in this way over the next forty years were largely unsuccessful, unless the range of words they were expected to translate was very limited.

"The main problem is that language is too complex," explains Philipp Koehn, a machine translation researcher at the University of Edinburgh School of Informatics. "Language is always ambiguous, so you can’t always use rules, and new vocabulary is always coming in, so you need someone to continually maintain those rules." What it boils down to is that there are simply too many possible rules for them all to be written down, and there are also too many exceptions to those rules, he adds.

Then in the 1980s, computer giant IBM carried out pioneering research into the use of words in sentences. Specifically, its researchers examined the relative frequency of different groups of three words occurring in a sentence. For example, they noted "going to go" occurs far more frequently than "going too, go" or "going two go". So although the three phrases sound almost identical, the first is statistically most likely to be correct.

This apparently simple insight had huge repercussions, opening up a new statistical approach to translation.

"The vast majority of research into machine translation is now pursuing the statistical approach," says Dr Blunsom.

Online services such as Google Translate and Yahoo! Babel Fish both use statistical machine translation techniques – although Yahoo!'s system is best described as a hybrid approach that makes heavy use of rules, as well as statistics.

More than words

The statistical translation approach works by analysing parallel corpora – bodies of text that have already been translated from one language to another. Put simply, the translation system looks out for a word or phrase in one language, that crops up whenever a word or phrase appears in the other language. If it spots "un chien noir" in French every time the phrase "a black dog" occurs in English, then it stores these two phrases together in a "phrase table".

The system also analyses large amounts of text in individual languages and makes a note of the frequency that certain words or groups of words follow others. It uses this to build a "language model", which is employed to help make sure that sentences are put together with words in an order that is statistically likely to be correct, and also to help decide between different choices of words – especially when a source word can have more than one meaning.

"Essentially, we are translating using probabilities to find the best solution," says Dr Blunsom. "The computer doesn't understand the languages or know any grammar, but might use statistics to determine that 'dog the' is not as likely as 'the dog.' What we are doing is a larger-scale version of what was done with the Rosetta Stone."

To improve the ability of computers to make these kinds of decisions requires massive numbers of source texts.

"It depends on the languages concerned, but we usually need at least 30 million words or one million sentences," says Dr Blunsom. And fortunately for the machine translation researchers, huge parallel corpora are freely available in multiple language pairs, thanks to organisations like the European Union and United Nations, which use human translators to generate documents in multiple languages to distribute among their member states.

These kinds of texts allow systems to easily translate between common languages – say German and English, or Italian and French. But it becomes more difficult when the translation involves less widely spoken languages.

Step forward search giant Google, which is trying to solve this problem. Despite high-profile gaffes like those in Malaysia, its proprietary system is state of the art. It is able to hoover up vast amounts of information from the web and can currently translate between 63 commonly spoken languages.

But it is also hunting and harvesting parallel corpora from the web in less widespread languages – such as Tamil, Armenian and Basque – wherever it can find them.

“There are plenty of translations into many languages that all sorts of people have done and placed on the web. Some are high quality translations, and some are not, but we use both types because what's important is the sheer volume of data we get,” says Franz Och, one of the world's leading machine translation experts and the man who heads up Google's machine translation group.

“Mistranslations are not such a problem, as not everyone makes the same mistakes," he says. News reports of the same story in different languages are also valuable: although these are not parallel corpora, they usually contain enough common information to be of use to machine translation systems.

English: Premier lingua

The vast repository of information that Google is able to sort through means it can pull off another trick: translating between two languages where parallel corpora do not exist, say between Arabic and Welsh. In situations like this, it turns out that English is the key.

“A long time ago, it was thought that the best approach to translation was to go via some symbolic meaning representation language – an artificial inter lingua that was an idealised way of representing meaning,” says Dr Och.

“I guess that this was wrong. English is turning out to be the great inter lingua replacement.”

English works well because of its prevalence on the web. Machines can translate from Welsh into Arabic using English corpora as an intermediary step. This is less accurate than when languages can be translated directly, because errors multiply during each translation process. But with many language pairs, there is no alternative but to use this method, according to Dr Och.

However, parallel corpora and inter-language replacement only get you so far. Language is so complex that models need to take into account other factors. For example, in some languages words are gender specific – the word for a male cousin is different from the one for a female cousin, for example – so a good system has to be able to look forward or back in an English text to see if it can find a reference to "he" or "she" to determine the appropriate form of the word when it translates it, which is not a simple process.

Things also tend to go wrong when you go beyond simple sentences to more complex ones. "If you use a metaphor or any poetic language, then things get much more difficult," says Dr Blunsom. "If you use a pun that the system has not seen before, then it will just translate it literally."

The vagaries and origins of different languages also mean that some things cannot be expressed – a concept known as untranslatability. Throw in some neologisms or the use of portmanteau, and subtle meanings are totally lost.

Google vortex

Google’s techniques also bring in another more obscure problem – the "Google time loop". This happens when the search firm finds what appears to be new parallel corpora, but is in fact is a piece of text that has been translated using Google's own service. "We try to detect if a translation that we find is one of our own, because it can cause problems," Dr Och explains. "It means that we would get stuck in the past – it can reintroduce old mistakes, as finding them again appears to reaffirm that they are correct." As a result, the search firm has had to develop its own bespoke range of algorithms just to try and detect its own translations, Dr Och reveals.

But as researchers grapple with these problems, every advance takes us one step closer to the idea of a universal translator. These mobile devices, a common feature of science fiction, allow speech to be translated on the fly, with no need to be in front of a computer. Perhaps the most famous depiction is the Babel Fish, a surreal invention of the late novelist Douglas Adams, and the inspiration of the name of Yahoo!’s translation engine. In his book The Hitchhiker’s Guide to the Galaxy, Adams described a small, yellow creature that, when inserted into the ear, translated alien languages instantly.

However, the reality of creating such a device presents even bigger challenges than just translating text.

"Spoken language is actually quite different from the written word, because sentences contain 'ums' and 'errs'," says Bill Byrne, a reader in Information Engineering at the University of Cambridge. "They may also have false starts, and reference things that were said earlier with phrases such as 'like I said.' Then of course there are some spoken languages that don't have a written form at all," he says, adding that researchers are only just beginning to figure out how to grapple with problems like these.

But several devices have been developed over the years, including one used by the US military in Iraq that stored around 2500 unique phrases and allowed soldiers to communicate – albeit in a basic fashion – with the local population.

But perhaps the nearest thing to a Babel Fish that is currently available is Google's Translate smartphone app. This builds on its web tool and can currently recognise speech in 17 languages.

To carry out a translation, a person speaks into the phone with the app running. This is recorded, and the speech clip is sent over the internet to Google's speech recognition servers. This processes the sound and transcribes it into text, allowing it to be run through the company’s existing web tools to produce a translation. This is then passed to a text-to-speech system, which produces an audio file that can be then sent back over the internet to the phone. This all happens in a matter of seconds.

It is not perfect – background noise and regional accents can mean it gets muddled, and it often mistranslates ambiguous terms – but as it depends on examples to learn, Google says that the quality should improve.

However, there is one thing that currently holds this back from being a truly universal, mobile Babel Fish: it needs an internet connection to work, as even the most powerful smartphones cannot hold and process all of the necessary data to translate effectively.

"The language models we have developed are so big that they are stored on many different computers and involve many, many gigabytes of tables," explains Dr Och.

Even if the computing power of smartphones continues its onward march, it will be many years before the phone in your pocket can work as a Babel Fish by itself.

Dr Och, perhaps sensibly, will not be drawn on if and when a true Babel Fish-like device may be built. However, he says, there may be unforeseen consequences if they are successful.

“The end effect could be that everyone gets to know, understand and like each other better, and I get the Nobel Prize. But don't forget that in Douglas Adams' book, people who used the Babel Fish ended up understanding how much they hated each other."

If you would like to comment on this video or anything else you have seen on Future, head over to our Facebook page or message us on Twitter.