|
Speech Technology (Asia)
Overview
Using speech to communicate continues to be the most natural and easy way to exchange ideas and thoughts. However, the challenge becomes greater when communicating with machines like computers. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications. The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.
A Glimpse at Several Core Innovations:
A Trainable Text-to-Speech Synthesis We developed a new, statistically trained, automatic text-to-speech (TTS) system. Unlike our previous, concatenation-based TTS, the new one includes these distinctive features: 1) a universal, maximum-likelihood criterion for model training and speech generation; 2) a relatively small training database, needing just about 500 sentences to train a decent voice font; 3) a small-footprint (less than 2 megabytes) hidden Markov model (HMM); 4) flexible, easy modification of spectrum, gain, speaking rate, pitch range of synthesized speech, and other relevant parameters; 5) fast adaptation to a new speaker; and, 6) more predictable synthesis for pronouncing name entities. With its easy training and compact size, the new HMM is ideal for quick prototyping of a personalized TTS. Finding Music: Query-By-Humming and Music Steering What would the world be without music? With MSRA’s "Query-By-Humming" search technology, you can find your favorite songs by humming, singing, or whistling the melody, when you do not know or have forgotten the title of or the artist providing a song. This is useful for downloading ringtones on a mobile phone, where it is inconvenient to type an artist's name or song title and voice is the most natural means of input. We have teamed up with Windows Live Mobile China to build a prototype service - call the service, hum a tune, get your favorite ringtones. "Music steering" means interactive music playlist generation through music content analysis, music recommendation, and music filtering. With a personal music collection of thousands of songs in our portable devices such as iPhone, Zune, and smart phones, selecting songs has become a challenge. Music steering provides a “smart shuffle” function: Pick a starting (seed) song, and the system will automatically build a playlist of similar songs. It can be refined by voting up/down and setting a “mood filter.” At the foundation is technology to analyze music content and automatically detect musical attributes (tags) from each song like genre, instruments, tonality, and tempo/rhythm. A Voice User Interface with Intelligent Correction We developed an intelligent voice user interface for text input. It employs continuous speech as its main input mode and handwriting as its correction mechanism. Continuous speech is adopted for its fast text-input speed, and handwriting is for easy pointing and/or correction of speech-recognition errors. This natural interface is also intelligent; statistically, it can correct more errors than indicated by the user’s handwritten input. We have verified in a speech database that we can rewrite a word graph produced in speech recognition to generate a revised sentence with fewer word errors than indicated by the user. Discriminative Training of HMM Models To train highly discriminative HMM models for recognizing speech, free hand input of math formula and East Asian characters are used. A unified framework and code base is established for dealing with various different discriminative criteria such as maximum mutual information, minimum classification error, minimum phone or radical error. Enhancing human-human communication: Speech Indexing Humans exchanges most often involve speech. However, when humans keep records, they resort to notes, memos, meeting minutes, and other documents. Unfortunately, today’s technologies do not allow for the efficient use of recorded audio conversations. Enabling computers to be smart about speech and audio is a primary focus of the Speech Group. A core innovation that has come out of this is a search engine that can index the words spoken in recorded conversations whether they are from meetings, conference calls, voice mails, presentations, online lectures, or even video. Microsoft OneNote 2007, a part of Microsoft Office, is the first Microsoft product to include our speech-indexing technology to allow users to search for keywords spoken in recorded meetings and phone calls.
People
Group member:
Publication
ICASSP 2006
WEIGHTED
LIKELIHOOD RATIO (WLR) HIDDEN MARKOV MODEL FOR NOISY SPEECH RECOGNITION In proceeding of ICASSP 2006.
TONE-ENHANCED
GENERALIZED CHARACTER POSTERIOR PROBABILITY(GCPP) FOR CANTONESE LVCSR In proceeding of ICASSP 2006.
A COMPARATIVE
STUDY OF DISCRIMINATIVE METHODS FOR RERANKING LVCSR N-BEST HYPOTHESES IN DOMAIN
ADAPTATION AND GENERALIZATION In proceeding of ICASSP 2006.
AN ITERATIVE
TRAJECTORY REGENERATION ALGORITHM FOR SEPARATING MIXED SPEECH SOURCES In proceeding of ICASSP 2006.
IMPROVED
CHINESE CHARACTER INPUT BY MERGING SPEECH AND HANDWRITING RECOGNITION HYPOTHESES In proceeding of ICASSP 2006.
AUTO-SEGMENTATION BASED PARTITIONING AND CLUSTERING APPROACH TO ROBUST
ENDPOINTING In proceeding of ICASSP 2006.
SYLLABLE
LATTICE BASED RE-SCORING FOR SPEAKER VERIFICATION In proceeding of ICASSP 2006.
MEASURING
TARGET COST IN UNIT SELECTION WITH KL-DIVERGENCE BETWEEN CONTEXT-DEPENDENT HMMS In proceeding of ICASSP 2006.
IDENTIFYING
LANGUAGE ORIGIN OF PERSON NAMES WITH N-GRAMS OF DIFFERENT UNITS In proceeding of ICASSP 2006.
A
HIERARCHICAL APPROACH TO AUTOMATIC STRESS DETECTION IN ENGLISH SENTENCES In proceeding of ICASSP 2006.
MAXIMUM
ENTROPY BASED NORMALIZATION OF WORD POSTERIORS FOR PHONETIC AND LVCSR LATTICE
SEARCH In proceeding of ICASSP 2006.
ICASSP 2005
Fast
Two-Stage Vocabulary-Independent Search In Spontaneous Speech
Hierarchical
Correlation Compensation For Hidden Markov Models
Improved
Covariance Modeling For Maximum Likelihood Multiple Subspace Transformations
Deriving
High-Level Concepts Using Fuzzy-ID3 Decision Tree for Image Retrieval
Towards A
Unified Framework for Content-based Audio Analysis
Unsupervised
Auditory Scene Categorization via Key Audio Effects and Information-Theoretic
Co-Clustering
ICASSP 2004
TONE
RECOGNITION WITH FRACTIONIZED MODELS AND OUTLINED FEATURES In proceeding of ICASSP 2004.
REFINING
SEGMENTAL BOUNDARIES FOR TTS DATABASE USING FINE CONTEXTUAL-DEPENDENT BOUNDARY
MODELS In proceeding of ICASSP 2004.
SEGMENTAL
TONAL MODELING FOR PHONE SET DESIGN IN MANDARIN LVCSR In proceeding of ICASSP 2004.
TONE
ARTICULATION MODELING FOR MANDARIN SPONTANEOUS SPEECH RECOGNITION In proceeding of ICASSP 2004.
VOCABULARY-INDEPENDENT SEARCH IN SPONTANEOUS SPEECH In proceeding of ICASSP 2004.
ICASSP 2003 SPECTROGRAM-BASED FORMANT TRACKING VIA PARTICLE FILTERS Yu Shi, Eric Chang, In proceeding of ICASSP 2003.
COMPARISON OF DISCRIMINATIVE TRAINING METHODS FOR SPEAKER VERIFICATION Chengyuan Ma, Eric Chang, In proceeding of ICASSP 2003.
MICROSOFT MULAN ?A BILINGUAL TTS SYSTEM Min Chu, Hu Peng, Yong Zhao, Zhengyu Niu, Eric Chang, In proceeding of ICASSP 2003.
Jian-Lai Zhou, Frank Seide, In proceeding of ICASSP 2003.
Frank Seide, Jian-Lai Zhou, Li Deng, In proceeding of ICASSP 2003.
Interspeech 2006 A Multi-Space Distribution (MSD) Approach to Speech Recognition of Tonal Languages Huanliang Wang, Yao Qian, Frank K. Soong, Jian-Lai Zhou, Jiqing Han In proceeding of INTERSPEECH 2006.
Generalization of the Minimum Classification Error (MCE) Training Based on Maximizing Generalized Posterior Probability (GPP) Qiang Fu, Antonio Moreno-Daniel, Biing-Hwang Juang, Jian-Lai Zhou, Frank K. Soong In proceeding of INTERSPEECH 2006.
Identify Language Origin of Personal Names with Normalized Appearance Number of Web Pages Jiali You, Yining Chen, Min Chu, Yong Zhao, Jinlin Wang In proceeding of INTERSPEECH 2006.
Constructing Stylistic Synthesis Databases from Audio Books Yong Zhao, Di Peng, Lijuan Wang, Min Chu, Yining Chen, Peng Yu, Jun Guo In proceeding of INTERSPEECH 2006.
Auto-Segmentation Based VAD for Robust ASR Yu Shi, Frank K. Soong, Jian-Lai Zhou, In proceeding of INTERSPEECH 2006.
Minimum Divergence Based Discriminative Training Jun Du, Peng Liu, Frank K. Soong, Jian-Lai Zhou, Ren-Hua Wang In proceeding of INTERSPEECH 2006.
Interspeech 2005 Harmonic Filtering for Joint Estimation of Pitch and Voiced Source with Single-microphone Input S. W. Lee, Frank K. Soong, and P. C. Ching In proceeding of INTERSPEECH 2005.
Refining Phoneme Segmentations Using Speaker-Adaptive Context Dependent Boundary Models Yong ZHAO, Lijuan WANG, Min CHU, Frank K. SOONG, and Zhigang CAO In proceeding of INTERSPEECH 2005.
Background Model Based Posterior Probability for Measuring Confidence Peng Liu, Ye Tian, Jian-Lai Zhou and Frank K. Soong In proceeding of INTERSPEECH 2005.
Phonetic Transcription Verification with Generalized Posterior Probability Lijuan WANG, Yong ZHAO, Min CHU, Frank K. SOONG, Zhigang CAO In proceeding of INTERSPEECH 2005.
ICSLP 2004 A Hybrid Word / Phoneme-Based Approach for Improved Vocabulary-Independent Search in Spontaneous Speech Peng Yu, Frank Seide In proceeding of ICSLP 2004.
Transformation and Combination of Hidden Markov Models for Speaker Selection Training Chao Huang, Tao Chen, Eric Chang In proceeding of ICSLP 2004.
EUROSPEECH 2003 Voice Conversion with Smoothed GMM and MAP Adaptation Yining Chen, Min Chu, Eric Chang, Jia Liu, and Runsheng Liu In proceeding of EUROSPEECH 2003.
Custom-Tailoring TTS Voice Font — Keeping the Naturalness When Reducing Database Size Yong Zhao, Min Chu, Hu Peng and Eric Chang In proceeding of EUROSPEECH 2003.
AN IMPROVED MODEL-BASED SPEAKER SEGMENTATION SYSTEM Peng Yu, Frank Seide, Chengyuan Ma, and Eric Chang In proceeding of EUROSPEECH 2003.
SEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS Peng Yu, Kaijiang Chen, Lie Lu, and Frank Seide In proceeding of EUROSPEECH 2003.
Journal
A
flexible framework for key audio effects detection and auditory context
inference
Automatic mood detection and tracking of music audio signals
Content analysis for audio classification and segmentation
Audio textures: theory and applications
Vocabulary-Independent Indexing of Spontaneous Speech
Context-dependent boundary modeling for automatic segmentation of TTS units Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jianlai Zhou, Zhigang Cao IEICE transaction on information and systems.
|