Speech

Using speech to communicate continues to be the most natural and easy way to exchange ideas and thoughts. However, the challenge becomes greater when communicating with machines like computers. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications.

Overview

Using speech to communicate continues to be the most natural and easy way to exchange ideas and thoughts. However, the challenge becomes greater when communicating with machines like computers. At MSR Asia, the Speech Group is working to make the "speech chain" smooth and robust when there is a machine involved, working to develop spoken language technologies that enable human-computer voice interaction and enrich human-to-human voice communications.

The group's current focus includes automatic speech recognition to enable computers to facilitate access to data, help create content, and perform tasks; speech synthesis to enable computers to speak with a human-sounding voice, to respond and provide information, and to read; spoken-document retrieval and processing to enrich communication between people like converting voice-mail into text; signal processing to improve the conditioning of signals, change speech signal parameters like pitch, speaking rates, and voice characteristics in a seamless way. Extension of statistical learning algorithms developed in speech-to-other pattern recognition applications like hand-written math equations and East-Asian character recognition are being pursued jointly with other groups.

A Glimpse at Several Core Innovations:

A Trainable Text-to-Speech Synthesis

We developed a new, statistically trained, automatic text-to-speech (TTS) system. Unlike our previous, concatenation-based TTS, the new one includes these distinctive features: 1) a universal, maximum-likelihood criterion for model training and speech generation; 2) a relatively small training database, needing just about 500 sentences to train a decent voice font; 3) a small-footprint (less than 2 megabytes) hidden Markov model (HMM); 4) flexible, easy modification of spectrum, gain, speaking rate, pitch range of synthesized speech, and other relevant parameters; 5) fast adaptation to a new speaker; and, 6) more predictable synthesis for pronouncing name entities. With its easy training and compact size, the new HMM is ideal for quick prototyping of a personalized TTS.

Finding Music: Query-By-Humming and Music Steering

What would the world be without music? With MSRA’s "Query-By-Humming" search technology, you can find your favorite songs by humming, singing, or whistling the melody, when you do not know or have forgotten the title of or the artist providing a song. This is useful for downloading ringtones on a mobile phone, where it is inconvenient to type an artist's name or song title and voice is the most natural means of input. We have teamed up with Windows Live Mobile China to build a prototype service - call the service, hum a tune, get your favorite ringtones. "Music steering" means interactive music playlist generation through music content analysis, music recommendation, and music filtering. With a personal music collection of thousands of songs in our portable devices such as iPhone, Zune, and smart phones, selecting songs has become a challenge. Music steering provides a “smart shuffle” function: Pick a starting (seed) song, and the system will automatically build a playlist of similar songs. It can be refined by voting up/down and setting a “mood filter.” At the foundation is technology to analyze music content and automatically detect musical attributes (tags) from each song like genre, instruments, tonality, and tempo/rhythm.

A Voice User Interface with Intelligent Correction

We developed an intelligent voice user interface for text input. It employs continuous speech as its main input mode and handwriting as its correction mechanism. Continuous speech is adopted for its fast text-input speed, and handwriting is for easy pointing and/or correction of speech-recognition errors. This natural interface is also intelligent; statistically, it can correct more errors than indicated by the user’s handwritten input. We have verified in a speech database that we can rewrite a word graph produced in speech recognition to generate a revised sentence with fewer word errors than indicated by the user.

Discriminative Training of HMM Models

To train highly discriminative HMM models for recognizing speech, free hand input of math formula and East Asian characters are used. A unified framework and code base is established for dealing with various different discriminative criteria such as maximum mutual information, minimum classification error, minimum phone or radical error.

Enhancing human-human communication: Speech Indexing

Humans exchanges most often involve speech. However, when humans keep records, they resort to notes, memos, meeting minutes, and other documents. Unfortunately, today’s technologies do not allow for the efficient use of recorded audio conversations. Enabling computers to be smart about speech and audio is a primary focus of the Speech Group. A core innovation that has come out of this is a search engine that can index the words spoken in recorded conversations whether they are from meetings, conference calls, voice mails, presentations, online lectures, or even video. Microsoft OneNote 2007, a part of Microsoft Office, is the first Microsoft product to include our speech-indexing technology to allow users to search for keywords spoken in recorded meetings and phone calls.

People

Contacts:

Name Role Alias Frank Soong Speech Research Group Manager and PRINCIPAL Researcher frankkps Frank Seide lead researcher/research mgr. audio information management and extraction fseide

Group member:

Name

Alias

title Frank Soong frankkps PRINCIPAL RESEARCHER/R MGR. Frank Seide fseide LEAD RESEARCHER/RESEARCH MGR. Chao Huang chaoh RESEARCHER Lei Ma lema ASSISTANT RESEARCHER Lie Lu llu ASSOCIATE RESEARCHER Lijuan Wang lijuanw ASSOCIATE RESEARCHER Kit Thambiratnam kit RESEARCHER Min Chu minchu RESEARCHER Peng Yu rogeryu RESEARCHER Peng Liu pengliu ASSOCIATE RESEARCHER Yao Qian yaoqian ASSOCIATE RESEARCHER Yijian Wu yijwu ASSOCIATE RESEARCHER Yining Chen ynchen ASSOCIATE RESEARCHER Yong Zhao yzhao ASSOCIATE RESEARCHER Yu Shi yushi ASSOCIATE RESEARCHER

Publication

ICASSP 2006

WEIGHTED LIKELIHOOD RATIO (WLR) HIDDEN MARKOV MODEL FOR NOISY SPEECH RECOGNITION
Chao Huang, Yingchun Huang, Frank K. Soong, Jian-Lai Zhou,

In proceeding of ICASSP 2006.

TONE-ENHANCED GENERALIZED CHARACTER POSTERIOR PROBABILITY(GCPP) FOR CANTONESE LVCSR
Yao Qian, Frank K. Soong, Tan Lee,

In proceeding of ICASSP 2006.

A COMPARATIVE STUDY OF DISCRIMINATIVE METHODS FOR RERANKING LVCSR N-BEST HYPOTHESES IN DOMAIN ADAPTATION AND GENERALIZATION
Zhengyu Zhou, Jianfeng Gao, Frank K. Soong, Helen Meng,

In proceeding of ICASSP 2006.

AN ITERATIVE TRAJECTORY REGENERATION ALGORITHM FOR SEPARATING MIXED SPEECH SOURCES
Siu wa Lee, Frank K. Soong, P. C. Ching,

In proceeding of ICASSP 2006.

IMPROVED CHINESE CHARACTER INPUT BY MERGING SPEECH AND HANDWRITING RECOGNITION HYPOTHESES
Xi Zhou, Ye Tian, Jian-Lai Zhou, Frank K. Soong, Bei-qian Dai,

In proceeding of ICASSP 2006.

AUTO-SEGMENTATION BASED PARTITIONING AND CLUSTERING APPROACH TO ROBUST ENDPOINTING
Yu Shi, Frank K. Soong, Jian-Lai Zhou,

In proceeding of ICASSP 2006.

SYLLABLE LATTICE BASED RE-SCORING FOR SPEAKER VERIFICATION
Minho Jin, Frank K. Soong, Chang D. Yoo,

In proceeding of ICASSP 2006.

MEASURING TARGET COST IN UNIT SELECTION WITH KL-DIVERGENCE BETWEEN CONTEXT-DEPENDENT HMMS
Yong Zhao, Peng Liu, Yusheng Li, Yining Chen, Min Chu,

In proceeding of ICASSP 2006.

IDENTIFYING LANGUAGE ORIGIN OF PERSON NAMES WITH N-GRAMS OF DIFFERENT UNITS
Yining Chen, Jiali You, Min Chu, Yong Zhao, Jinlin Wang,

In proceeding of ICASSP 2006.

A HIERARCHICAL APPROACH TO AUTOMATIC STRESS DETECTION IN ENGLISH SENTENCES
Min Lai, Yining Chen, Min Chu, Yong Zhao, Fangyu Hu,

In proceeding of ICASSP 2006.

MAXIMUM ENTROPY BASED NORMALIZATION OF WORD POSTERIORS FOR PHONETIC AND LVCSR LATTICE SEARCH
Peng Yu, Duo Zhang, Frank Seide,

In proceeding of ICASSP 2006.

ICASSP 2005

Fast Two-Stage Vocabulary-Independent Search In Spontaneous Speech
Peng Yu, Frank Seide, Microsoft Research Asia, China

Hierarchical Correlation Compensation For Hidden Markov Models
Hui Lin, Tsinghua University, China; Ye Tian, JianLai Zhou, Microsoft Research Asia, China; Hui Jiang, York University, Canada

Improved Covariance Modeling For Maximum Likelihood Multiple Subspace Transformations
Xi Zhou, University of Science and Technology of China, China; Ye Tian, JianLai Zhou, Microsoft Research Asia, China; Bei-qian Dai, University of Science and Technology of China, China

Deriving High-Level Concepts Using Fuzzy-ID3 Decision Tree for Image Retrieval
Ying Liu, Dengsheng Zhang, Guojun Lu, Monash University, Australia; Wei-Ying Ma, Microsoft Research Asia, China

Towards A Unified Framework for Content-based Audio Analysis
Lie Lu, Microsoft Research Asia, China; Rui Cai, Tsinghua University, China; Alan Hanjalic, Delft University of Technology, Netherlands

Unsupervised Auditory Scene Categorization via Key Audio Effects and Information-Theoretic Co-Clustering
Rui Cai, Tsinghua University, China; Lie Lu, Microsoft Research Asia, China; Lian-Hong Cai, Tsinghua University, China

ICASSP 2004

TONE RECOGNITION WITH FRACTIONIZED MODELS AND OUTLINED FEATURES
Ye Tian, Jian-Lai Zhou, Min Chu, Eric Chang,

In proceeding of ICASSP 2004.

REFINING SEGMENTAL BOUNDARIES FOR TTS DATABASE USING FINE CONTEXTUAL-DEPENDENT BOUNDARY MODELS
Lijuan Wang, Yong Zhao, Min Chu, Jian-Lai Zhou, Zhigang Cao,

In proceeding of ICASSP 2004.

SEGMENTAL TONAL MODELING FOR PHONE SET DESIGN IN MANDARIN LVCSR
Chao Huang, Yu Shi, Jian-Lai Zhou, Min Chu, Terry Wang, Eric Chang,

In proceeding of ICASSP 2004.

TONE ARTICULATION MODELING FOR MANDARIN SPONTANEOUS SPEECH RECOGNITION
Jian-Lai Zhou, Ye Tian, Yu Shi, Chao Huang, Eric Chang,

In proceeding of ICASSP 2004.

VOCABULARY-INDEPENDENT SEARCH IN SPONTANEOUS SPEECH
Frank Seide, Peng Yu, Chengyuan Ma, Eric Chang,

In proceeding of ICASSP 2004.

ICASSP 2003

SPECTROGRAM-BASED FORMANT TRACKING VIA PARTICLE FILTERS

Yu Shi, Eric Chang,

In proceeding of ICASSP 2003.

COMPARISON OF DISCRIMINATIVE TRAINING METHODS FOR SPEAKER VERIFICATION

Chengyuan Ma, Eric Chang,

In proceeding of ICASSP 2003.

MICROSOFT MULAN ?A BILINGUAL TTS SYSTEM

Min Chu, Hu Peng, Yong Zhao, Zhengyu Niu, Eric Chang,

In proceeding of ICASSP 2003.

COARTICULATION MODELING BY EMBEDDING A TARGET-DIRECTED HIDDEN TRAJECTORY MODEL INTO HMM -- MODEL AND TRAINING

Jian-Lai Zhou, Frank Seide,

In proceeding of ICASSP 2003.

COARTICULATION MODELING BY EMBEDDING A TARGET-DIRECTED HIDDEN TRAJECTORY MODEL INTO HMM -- MAP DECODING AND EVALUATION

Frank Seide, Jian-Lai Zhou, Li Deng,

In proceeding of ICASSP 2003.

Interspeech 2006

A Multi-Space Distribution (MSD) Approach to Speech Recognition of Tonal Languages

Huanliang Wang, Yao Qian, Frank K. Soong, Jian-Lai Zhou, Jiqing Han

In proceeding of INTERSPEECH 2006.

Generalization of the Minimum Classification Error (MCE) Training Based on Maximizing Generalized Posterior Probability (GPP)

Qiang Fu, Antonio Moreno-Daniel, Biing-Hwang Juang, Jian-Lai Zhou, Frank K. Soong

In proceeding of INTERSPEECH 2006.

Identify Language Origin of Personal Names with Normalized Appearance Number of Web Pages

Jiali You, Yining Chen, Min Chu, Yong Zhao, Jinlin Wang

In proceeding of INTERSPEECH 2006.

Constructing Stylistic Synthesis Databases from Audio Books

Yong Zhao, Di Peng, Lijuan Wang, Min Chu, Yining Chen, Peng Yu, Jun Guo

In proceeding of INTERSPEECH 2006.

Auto-Segmentation Based VAD for Robust ASR

Yu Shi, Frank K. Soong, Jian-Lai Zhou,

In proceeding of INTERSPEECH 2006.

Minimum Divergence Based Discriminative Training

Jun Du, Peng Liu, Frank K. Soong, Jian-Lai Zhou, Ren-Hua Wang

In proceeding of INTERSPEECH 2006.

Interspeech 2005

Harmonic Filtering for Joint Estimation of Pitch and Voiced Source with Single-microphone Input

S. W. Lee, Frank K. Soong, and P. C. Ching

In proceeding of INTERSPEECH 2005.

Refining Phoneme Segmentations Using Speaker-Adaptive Context Dependent Boundary Models

Yong ZHAO, Lijuan WANG, Min CHU, Frank K. SOONG, and Zhigang CAO

In proceeding of INTERSPEECH 2005.

Background Model Based Posterior Probability for Measuring Confidence

Peng Liu, Ye Tian, Jian-Lai Zhou and Frank K. Soong

In proceeding of INTERSPEECH 2005.

Phonetic Transcription Verification with Generalized Posterior Probability

Lijuan WANG, Yong ZHAO, Min CHU, Frank K. SOONG, Zhigang CAO

In proceeding of INTERSPEECH 2005.

ICSLP 2004

A Hybrid Word / Phoneme-Based Approach for Improved Vocabulary-Independent Search in Spontaneous Speech

Peng Yu, Frank Seide

In proceeding of ICSLP 2004.

Transformation and Combination of Hidden Markov Models for Speaker Selection Training

Chao Huang, Tao Chen, Eric Chang

In proceeding of ICSLP 2004.

EUROSPEECH 2003

Voice Conversion with Smoothed GMM and MAP Adaptation

Yining Chen, Min Chu, Eric Chang, Jia Liu, and Runsheng Liu

In proceeding of EUROSPEECH 2003.

Custom-Tailoring TTS Voice Font — Keeping the Naturalness When Reducing Database Size

Yong Zhao, Min Chu, Hu Peng and Eric Chang

In proceeding of EUROSPEECH 2003.

AN IMPROVED MODEL-BASED SPEAKER SEGMENTATION SYSTEM

Peng Yu, Frank Seide, Chengyuan Ma, and Eric Chang

In proceeding of EUROSPEECH 2003.

SEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS

Peng Yu, Kaijiang Chen, Lie Lu, and Frank Seide

In proceeding of EUROSPEECH 2003.

Journal

A flexible framework for key audio effects detection and auditory context inference
Cai, R.; Lie Lu; Hanjalic, A.; Hong-Jiang Zhang; Lian-Hong Cai;
Audio, Speech and Language Processing, IEEE Transactions on
Volume 14, Issue 3, May 2006 Page(s):1026 - 1039

Summary:Key audio effects are those special effects that play critical roles in human's perception of an auditory context in audiovisual materials. Based on key audio effects, high-level semantic inference can be carried out to facilitate various content-bas.....

Automatic mood detection and tracking of music audio signals
Lie Lu; Liu, D.; Hong-Jiang Zhang;
Audio, Speech and Language Processing, IEEE Transactions on
Volume 14, Issue 1, Jan. 2006 Page(s):5 - 18

Summary:Music mood describes the inherent emotional expression of a music clip. It is helpful in music understanding, music retrieval, and some other music-related applications. In this paper, a hierarchical framework is presented to automate the task of moo.....

Content analysis for audio classification and segmentation
Lie Lu; Hong-Jiang Zhang; Hao Jiang;
Speech and Audio Processing, IEEE Transactions on
Volume 10, Issue 7, Oct. 2002 Page(s):504 - 516

Summary:We present our study of audio content analysis for classification and segmentation, in which an audio stream is segmented according to audio type or speaker identity. We propose a robust approach that is capable of classifying and segmenting an audio.....

Audio textures: theory and applications
Lie Lu; Liu Wenyin; Hong-Jiang Zhang;
Speech and Audio Processing, IEEE Transactions on
Volume 12, Issue 2, March 2004 Page(s):156 - 167
Digital Object Identifier 10.1109/TSA.2003.819947

Summary:In this paper, we introduce a new audio medium, called audio texture, as a means of synthesizing long audio stream according to a given short example audio clip. The example clip is first analyzed to extract its basic building patterns. An audio stre.....

Vocabulary-Independent Indexing of Spontaneous Speech
Peng Yu; Kaijiang Chen; Chengyuan Ma; Seide, F.;
Speech and Audio Processing, IEEE Transactions on
Volume 13, Issue 5, Part 1, Sept. 2005 Page(s):635 - 643

Context-dependent boundary modeling for automatic segmentation of TTS units

Lijuan Wang, Yong Zhao, Min Chu, Frank K. Soong, Jianlai Zhou, Zhigang Cao

IEICE transaction on information and systems.

Summary:We present a system for vocabulary-independent indexing of spontaneous speech, i.e., neither do we know the vocabulary of a speech recording nor can we predict which query terms for which a user is going to search. The technique can be applied to inf.....