Cisco Autotranslation?

Network World reports on Cisco's plans to deliver a telepresence rig with automatic translation:
Cisco will add real-time translation to its TelePresence high-definition conference technology next year, enabling people in several different countries to meet virtually and each hear the other participants' comments in their own languages.


It will include speech recognition in the speaker's native language, a translation engine, and text-to-speech technology to deliver the words in a synthesized voice on the other end. Users will also be able to display subtitles if they choose, he said. Both Asian and Western languages will be represented in the initial set, which will later be expanded.

I don't want to sound reflexively negative, but I'm pretty skeptical that this is going to work in any kind of practical way. As described above, it depends on three separate technologies none of which work particularly well. Domain-specific speech recognition systems sort of quasi-work, though they're quite imperfect—United's IVR can barely recognize my frequent flier number. This is of course partly an artifact of bad phone quality (though it's not clear the remote mikes that these telepresence rigs use will be that much better), but it's much easier to build a domain specific system than a generic system. My understanding is that generic speech recognition systems have pretty high error rates. Wikipedia claims 98-99% for generic, continuous speech systems under "optimal condition", which includes training the system for the speaker.

This brings us to the topic of machine translation. You don't need to read up on this. Just try Google's machine translator. Even when it does a good job, it produces annoying, ungrammatical artifacts on the order of one every other sentence or so. And remember that this is written speech which is actually fairly grammatical to start with. Spoken text contains all sorts of odd artifacts, pauses, etc. that don't make the translation any easier. Quasi-grammatical English passed through an error-prone recognition system and then a not-that-accurate translator does not sound like a recipe for accurate results.

If the final stage of the translation is text-to-speech, which introduces a whole new level of fun. Again, voice synthesis does work, but often sounds kind of odd, which is part of why systems often use pre-recorded voice rather than voice synthesis.

So, this may work at some technical level, but I have a hard time believing that listening to a robotic-sounding, ungrammatical, error-prone partial translation during a teleconference is going to be anything other than annoying.


I agree with your analysis. If the source text is clear, machine translation gives 'good enough' results. As you write, human speech contains ungrammatical artefacts. Additionally, human speech is not optimised for machine translation. Sometimes, it is ambiguous. For example, "Find the man with a dog" can mean either of the following:
* Use a dog to find the man.
* Find the man who has a dog.

Possibly, with human interpretation, the context lets an interpreter know which meaning is correct. Therefore, the human interpreter can supply the correct interpretation. Until a machine has an 'understanding' of the real world, automatic speech translation is no likely to work.

(For the record, I think that automatic translation of text has an excellent future, because source text can be optimised to give good results.)

I think of the 3 areas, the last has made the most progress in the last decade or so. The latest text to speech engines sound almost natural (though a little croaky sometimes). Giant tables of word/phrase -> phonemes have been accumulating over time for "exceptions", at least for US English. The result is that TTS (for US English anyway) is now surprisingly good, though in a generic system like this it'll be full of mangle.

This is supposedly for business use. I wonder what the Chinese translation of "I'm going to action you to operationalize that ask" would be.

Hard to imagine that the translation of "I'm going to action you to operationalize that ask" would be much better than "All your routers are belong to us" but in the name of science I decided not to just be cynical and see what happened with some software I had lying around. Don't confuse this with anything my employer might ship to a customer because, well the results speak for themselves.

The speech reco when done with no background noise came out as. "I'M GOING TO ACTION YOU TO RATIONALIZE THAT YEAH"

and the translation software seemed to produce

I' 去行动的M您合理化那呀

Now if someone could provide a human translation of that back to english, I would appreciate it.

The spanish version was

And the french


