Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Gang Li


Hi, welcome to my website of MSR, my name is Gang Li (李钢), join MSRA Speech Group, since 2010. I have been working on large scale pattern recognition and deep learning technologies since I join MSRA, mainly focus on large scale conversational speech recognition and audio information management and retrieval technologies. Some of my experience in following:

Manage and Engineer on Microsoft Research Asia parallel speech model training platform: I own end-to-end process from raw data collection, data preprocessing, HMM training by our platform, DNN training pipelines, decoding, evaluation for different languages (English, Chinese, Spanish), and deliver MS state-of-the-art Conversational Acoustic Model (AM), move the Microsoft speech recognition baseline scores for those languages.

Research on Deep Neural Networks technologies with big data: Work closely with Frank Seide and Dong Yu on Deep Neural Networks technologies, research and validate training algorithms on large scale bench mark data set.

Drive and transfer core technologies to Cloud based multimedia indexing service (MAVIS), which enable full text search for content inside audio or video files, this is a game changer or killing feature for multimedia search engine. (project web:
Some research topics or technologies transferred or migrated to MAVIS: 
  - Research and implement file based vocabulary adaptation: use single file’s content to collect similar content files from internet and extract high value Out-Out-Vocabulary words to extend background knowledge to improve recognition accuracy by relative 10%;
  - Research and implement file based automatic keyword extraction: use single file’s content and leverage internet similar content to extract high value keywords for current file, which get better transcription preview user experience;
  - Research and implement DNN based Voice-Activity-Detection: build DNN based Voice Silence classifier model to detect voice part of media in order to better parallelize decoding process and reduce unnecessary runtime for silence, this reduce 90% of run time while keep the same recognition accuracy compared to previous version;
  - Implement text normalization for audio indexing service: for English where are different form of text which is actually are same words, normalized them to an uniform will give better language model training and better user reading experience;
  - Research and implement automatic transcription generation: for given media file and rush transcription (imprecise transcription without time information), automatically generate transcription with time stamp, which is much faster than human does;

Contribute to Skype Translator, a product that combine automatic speech recognition, machine translate and text-to-speech to recognized source language speech then translate into another language and output into another language focus on conversation scenario, which will break down communication barrier between people speaking different languages. Focus on speech recognition module: build CRF based automatic punctuation generation model to improve better user reading experience; research and implement process to extract training ready data from internet downloadable file with rush transcript; build models for speech recognition and build demos.