VoIP Phone-Call Transcription

VoIP Phone-Call Transcription is a research pilot for the automatic transcription of PC-based Communicator voice calls using speech recognition. Transcriptor! generates both a live transcript during a call (for multi-tasking users) as well as a merged transcript in e-mail after the call (searchable, for later reference). Transcriptor! requires the use of a headset, and a powerful computer, and for now, an American accent.

About Translator!

Translator is a complete 2-way speech-to-speech translation communication experience.

  • Automatic call transcripts generated live.
  • Translated transcript displayed live for read-along understanding.
  • Speech synthesized in callee’s native language.
  • Communication aid where alternatives don’t exist.
  • Not perfect but already useful for cooperative users solving a problem.


Automatic transcription of phone calls, tele-conferences, and meetings is the dream of many information workers. Transcripts allow for more effective communication and collaboration. Written call records, auto-generated with users' consent, unified with text communication, seamlessly threaded with e-mail and searchable, allow users to easily recount phone conversations. A real-time transcript of an ongoing tele-conference helps non-native speakers and distracted multi-tasking participants. Consumers benefit too: through free calls funded by transcript-based contextual ads. This all could be a significant, game-changing enhancement to Microsoft's enterprise and consumer voice communication and collaboration offerings.

Transcription of conversational speech also remains one of speech technology's grand challenges, and the required accuracies of 90+% are far from feasible for general phone calls. The case differs for PC-based VoIP calls where several favourable factors come together: high-quality audio from headset or USB-handset microphones, access to uncompressed audio, ever-growing CPU power, and known user id allowing for personalized speech models. With this, we believe 90%+ transcription accuracy is achievable within 5 years.

Now is the time to take first steps towards automatic call transcription, starting with PC-based VoIP calls. Microsoft has a unique combination of related building blocks - real-time communication, speech recognition, and its IM user bases. Although this is ongoing research, we hope for this project to be considered in the development of future roadmaps of related products, including Unified Communications, Office, Live Messenger, and speech recognition. This is an opportunity to not only impact individual products, but to position Microsoft better for defining the future of the communications industry.

