Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Zhijie Yan

Zhijie Yan

Hello! and Welcome

This is the personal webpage of Zhi-Jie Yan (Chinese name: 鄢志杰). I'm a lead researcher with the speech group of Microsoft Research Asia. I joined MSR Aisa since July, 2008. Before that, I received my Ph.D degree in the Department of EEIS, University of Science and Technology of China. When I was a graduate student, I worked with iFlytek Speech Lab from 2003 to 2008. During that period, I visited MSR Asia as a speech group intern, from Jun., 2005 to Jan., 2006. I also visited the School of ECE, Georgia Tech, as a visiting scholar in 2007. I received the Microsoft Fellowship award in 2006, and the ICASSP student paper contest winner award in 2007. I'm now a senior member of IEEE.

E-mail: | Speech Group of MSR Asia | MSR Asia

Research Topics

My research interests include speech recognition, machine learning, speech synthesis and processing. Currently I'm mainly working on research in automatic speech recognition and machine learning. My current research topics includ acoustic modeling for speech recognition, training criteria and optimization methods for training deep neural networks, and also large-scale machine learning platform for speech applications.

  • Acoustic Modeling for Speech Recognition

We are working on discriminative training using both GMM-HMM and DNN-HMM. In GMM-HMM framework, a tied-state based training criterion is used to train context-expanded region dependent linear transforms (CE-RDLTs), which achieves improved recognition performance comparing with state-of-the-art discriminative training methods. After combining this method with features derived from a deep neural network (DNN), a scalable approach to using DNN-GMM-HMM acoustic models is proposed for speech recognition and adaptation. Related papers can be found in ICASSP 2013 and InterSpeech 2013 (see selected publications below).

We did research on both training criteria and optimization methods of acoustic modeling for speech recognition. Our research includes the Irrelevant Variability Normalization (IVN) based training, and i-vector based approach for speech data clustering. Related papers can be found in ICASSP 2011/2012 and InterSpeech 2011 (see selected publications below).

  • Deep Learning and Deep Neural Network

We've been working on both training criteria and optimization methods to scale out deep learning, especially DNN / CNN training, using large-scale training data. The algorithm runs in an HPC GPU cluster and achieves promising results both in terms of accuracy and speeding-up factor. Details of this work will be published soon.

  • Large-scale Machine Learning Platform Optimized for Speech

We have built a large-scale machine learning platform optimized for speech applications, especially acoustic model training. This platform is implemented in an HPC (High Performance Computing) cluster using MPI (Message Passing Interface). It handles the "big data" which is essential for building a state-of-the-art speech recognition service. The detail of this project can be found in our IWSML 2012 paper entitled "Designing an MPI-Based Parallel and Distributed Machine Learning Platform on Large-Scale HPC Clusters."

  • Rich Context Model-Based Speech Synthesis

We propose to directly use rich context models to model training speech in HMM-based TTS, and to generate testing speech in synthesis. Compared with conventional decision-tree tied models, rich context models are crisper in nature, and carry with richer segmental and supra-segmental information. So the over-smoothing problem in conventional approach is significantly alleviated, which enables the synthesis of high quality speech.

Rich context models can also be used to build an HMM-guided unit selection TTS system. Rich-context Unit Selection (RUS) has been transferred to Microsoft products to build high quality speech synthesis engines. Related papers can be found in InterSpeech 2009, ICASSP 2010 and InterSpeech 2010.

Selected Publications