The talking computer HAL in the 1968 film "2001-A Space Odyssey" had an almost human voice, but it was the voice of an actor, not a computer. Getting a real computer to talk like HAL has proven one of the toughest problems posed by "2001."
Microsoft's contribution to this field is "Whistler" (Windows Highly Intelligent STochastic taLkER), a trainable text-to-speech engine which was released in 1998 as part of the SAPI4.0 SDK, and then as part of Microsoft Phone and Microsoft Encarta and Windows 2000 and Windows XP operating systems. You type words on your keyboard, and the computer reads them back to you almost immediately. While it still has that distinct machine sound, it's a big improvement on the flat, robotic voices of the past, particularly when large voice inventories are used.
Many of the improvements in speech synthesis over the past years have come from creative use of the technologies developed for speech recognition. The Whisper speech recognition engine isolates the sounds, called phonemes, which make up human speech. Counting each sound that each vowel can make, plus consonants by themselves and in combination and the all-important placeholder: silence, there are about 40 phonemes for English. But each phoneme has a different sound depending on what comes before and after it: the "o" in "hold" is longer than the "o" in "hot." So it turns out that English speech consists of roughly 64,000 different phoneme variations, called phonemes in context or allophones. The Whistler and Whisper engines use a simplified database of about 3,000 allophones, which were isolated by cutting digital waveform recordings of the human voice into sections. The sections were organized into databases for use by the speech recognition engine.
Senior Researcher Alex Acero saw that "the tools were just lying there," to build a speech synthesis device. The researchers, who included Scott Meredith, Mike Plumpe and Xuedong Huang, combined those phoneme databases with a text analyzer to make Whistler, which combines those recorded sounds back into words and phrases.
Just as a shattered plate that's been glued back together again doesn't look quite right, a word or phrase that's been assembled from phonemes often sounds a little off pitch. The bigger the segment of sound, the more natural it sounds in the reconstruction, but using syllables or whole words as the building blocks would require a vast database. Because of product limitations the versions of Whistler shipped in Windows could only include an voice inventory slightly above 1MB. While it still has that distinct machine sound, it's a big improvement on the flat, robotic voices of the past, particularly when on our laboratory versions that use large voice inventories.
The inflection in the speaker's voice is often the key to understanding the meaning of a spoken phrase. We learn inflections as children by imitating the speech patterns of our elders until they are ingrained as an accent. The nuance that a native speaker picks up from the tone of another's voice is difficult to impart to a non-native speaker, let alone a computer. Researchers had to add prosody, the pitch and duration of sounds that give them additional meaning, to make Whistler's voice sound more natural and pleasant. Singing speech synthesizers sound better because the prosody is already specified by the song.
The difference between a person and a talking computer is that the person understands the ideas and emotions conveyed through speech, and the computer doesn't. This is part of the larger problem of artificial intelligence, which is what "2001" author Arthur C. Clarke imagined in HAL. Our ability to replicate our own minds in a machine is limited by our incomplete knowledge of how our own minds work. The ultimate goal for speech synthesis, as with all AI applications, is to make it pass the Turing Test - a blindfolded user shouldn't be able to tell whether he is talking to a human or a machine. Like the voice of HAL, that's a long way away. But Acero believes he knows how to get there: "I'm interested in using what I've learned in speech synthesis to modify speech recognition,"; he says. "Ultimately the right model might just be the same for both synthesis and recognition." After all, he notes, our brains perform these functions simultaneously.