Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Multimodal Conversational User Interface

Most of us use computers to create text, understand numbers, view images, and send messages. There's only one problem with this marvelous machine. Our computer lives on a desktop, and though we command it with a keyboard and a mouse, it commands us with its immovable size. The office is its domain, and it's ill at ease where people are most comfortable: snacking in the kitchen, walking around a mall, hanging out at the local pub, and driving in our cars. Researchers in the Speech Technology group at Microsoft are working to allow the computer to travel through our living spaces as a handy electronic HAL pal that answers questions, arrange our calendars, and send messages to our friends and family.

These systems will use continuous speech recognition and spoken language understanding, the natural communication devices we carry with us all the time. Eventually, this software will power the ultimate communication device: a Pocket PC that doubles as a web browser, e-mail terminal, and cellular telephone. Perfected, it could control an array of other machines in an Easy Living environment. MiPad was the first prototype we built, back in 2000, as a baby step in that direction.

The distributed nature of many of the components can allow services to follow users wherever they are. Kuansan Wang, a researcher on the spoken language understanding (SLU) component of the engine, says, "If you have a docking station in your car, you could dock your device while you are driving. You could make yourself available to all your friends. That's basically the concept of a global service that follows you around, regardless of whether you have a PC in front of you or not. That's the grand vision of the whole thing."

Wang says that such systems will enable Web services to be invited into a conversation between two users. He describes this scenario. "Let's say we are talking on Instant Messenger, and we're talking about where I should take you for lunch. You realize there is a web service out there that is a restaurant guide, and we can probably ask it to join our conversation. Now this web service is not human, still, with our understanding engine we can type text or use speech to talk to this service. So we can ask it to find a restaurant that's a French restaurant, and not too far away from Microsoft. With a spoken or typed text dialog system we extend the horizon of the project."

This understanding engine won't just enable speech input, it will understand what you're asking. Wang's group is trying different approaches that don't necessarily use conventional computational linguistic approaches. One of his approaches uses limited semantic domains. In the example of the restaurant guide, that particular Web service would be limited to understanding questions about restaurants. Ask it a sports trivia question and it's likely to give you the address for a sports bar that serves burgers and beer.

The text-to-speech capability in such conversational systems can enable a PC to read any typed text in a staccato early-episode Star Trek voice. For instance, it will allow you to receive a voice message on your cell phone that your friend has typed on his laptop and sent to you by e-mail.

Xuedong "X.D." Huang, General Manager of the Speech.NET Group, and the other speech researchers, want to do more than create a useful product. They want to revolutionize computing. "We cannot accept the fact that we cannot solve this hard problem, that we can't make a machine as good as people's brains," Huang says. He believes that the hard problems of speech technology can be solved, and speech recognition can be perfected to become the center of an interface that will allow people to interact more naturally with their computers.