Networking: December 2008 Archives


December 9, 2008

Network World reports on Cisco's plans to deliver a telepresence rig with automatic translation:
Cisco will add real-time translation to its TelePresence high-definition conference technology next year, enabling people in several different countries to meet virtually and each hear the other participants' comments in their own languages.


It will include speech recognition in the speaker's native language, a translation engine, and text-to-speech technology to deliver the words in a synthesized voice on the other end. Users will also be able to display subtitles if they choose, he said. Both Asian and Western languages will be represented in the initial set, which will later be expanded.

I don't want to sound reflexively negative, but I'm pretty skeptical that this is going to work in any kind of practical way. As described above, it depends on three separate technologies none of which work particularly well. Domain-specific speech recognition systems sort of quasi-work, though they're quite imperfect—United's IVR can barely recognize my frequent flier number. This is of course partly an artifact of bad phone quality (though it's not clear the remote mikes that these telepresence rigs use will be that much better), but it's much easier to build a domain specific system than a generic system. My understanding is that generic speech recognition systems have pretty high error rates. Wikipedia claims 98-99% for generic, continuous speech systems under "optimal condition", which includes training the system for the speaker.

This brings us to the topic of machine translation. You don't need to read up on this. Just try Google's machine translator. Even when it does a good job, it produces annoying, ungrammatical artifacts on the order of one every other sentence or so. And remember that this is written speech which is actually fairly grammatical to start with. Spoken text contains all sorts of odd artifacts, pauses, etc. that don't make the translation any easier. Quasi-grammatical English passed through an error-prone recognition system and then a not-that-accurate translator does not sound like a recipe for accurate results.

If the final stage of the translation is text-to-speech, which introduces a whole new level of fun. Again, voice synthesis does work, but often sounds kind of odd, which is part of why systems often use pre-recorded voice rather than voice synthesis.

So, this may work at some technical level, but I have a hard time believing that listening to a robotic-sounding, ungrammatical, error-prone partial translation during a teleconference is going to be anything other than annoying.