With
applications such as spoken dialog systems, call center services,
voiced-enabled web and email services being introduced, an increasing emphasis
is placed on generating natural sounding speech. However, single language TTS
(text-to-speech) is often not enough. Many applications need to deal with
multiple languages. In our usability study of Mandarin TTS, the lack of ability
to handle English words and phrases embedded in Chinese text deters the adoption
of TTS technology, since much Chinese content, especially IT related articles
or emails, contain English words, phrases or names. Some applications solve
this problem by switching between two TTS engines. The main drawback of this
approach is that the voices coming out of the two engines sound different.
Users are always annoyed when hearing such two-voice utterances. Furthermore,
switching between two engines will destroy the overall sentence intonation.
Mulan is the first real bilingual TTS system that can switch between Mandarin
and English freely and smoothly without changing the engine and it always keeps
the sentence level intonation for mixed-lingual texts.
Constructed
based on the No Prediction No Scaling prosodic strategy and Prosodic Constraint
Oriented unit selection strategy, Mulan avoids the voice distortions in many
other systems, such as the monotonous sound caused be the finite ability of
prosody prediction models, or, the mechanical or buzzing sound caused by the
pitch and time scaling algorithms. Therefore, Mulan can generate very
natural speech that sounds like the original voice talent and it can inherit
the prosody behavior from the voice talent.
Please click here to hear samples
from Mulan