Natural Language Computing

Natural Language Computing (NLC) Group is focusing its efforts on text analysis, machine translation, information retrieval, question-answering and language gaming. Since it was founded 1998, this group has worked with partners on many significant innovations including MS-IME, Chinese couplets, Bing Dictionary, Bing Translator, Bing IME, Engkoo Question-Answer, SNS text mining, new generation of search engine, spoken translator, sign language translation, riddle gussing and generation...


The information era has brought us vast amounts of digitized text that are generated, propagated, exchanged, stored, and accessed through the Internet each day across the world. The accumulation of this data is making information acquisition increasingly difficult, with language becoming a critical obstacle to growth. To overcome these difficulties, the Natural Language Computing (NLC) Group is focusing its efforts on a variety of research topics, including multi-language text analysis, machine translation, cross language information retrieval, text mining of big web, social and enterprise, question answering with web, knowledge base and social repositories, and various applications of utilizing NLP technology for search engine, office and cloud computing. Recently, we have made a series of meaningful exploration applying deep learning to the typical NLP tasks such as machine translation, sentiment analysis and question-answering, trying to reconstruct the NLP methodology foundation.

Over the years, the group has made significant contributions to Microsoft products, including a Japanese and Chinese Input Method Editor (IME) for Office in 2001, English writing assistant for Office in 2007, Chinese couplet game for Windows Live, Chinese word breaker, pinyin search and search speller for the MSN search engine, text mining for SQL Servers and SharePoint, and meta data extraction for MSN, Bing Dictionary, Bing IME, sentiment analysis, Light Question-Answering. Our research achievements have been published at most prestigious NLP conferences, including 50 papers at ACL, 17 papers at COLING, 9 papers at SIGIR, from 2000-2013. This group was awarded MSRA stamina award in 2006, MSRA collaboration award (both group and individual) in 2012 and 2013, MSRA best demo award in 2012, MSRA social impact award in 2012, MSRA technical transfer award in 2012 due to the above-mentioned excellent achievements. The Engkoo Dictionary, later rebranded as Bing Dictionary, an important innovation on English learning and online dictionary integrating machine translation and speech synthesis as the result of the collaboration between this group, MSRA Speech Group, and MSRA IEG Group won many awards including Wall Street Journal's 2010 Asian Innovation Readers' Choice Award. The Chinese-English translation engine has been deployed in Bing Translator. The Chinese-English translation engine has been deployed in Bing Translator. Our translation system has strongly supported the famous MS spoken translator which as successfully live demoed by Rick Rashid at 21 Century Computing Conference in Tianjin in 2012 . The powerful Question-Answering platform Engkoo Answers (aka project Light) has incubated a set of key technologies for web search, entity search and social search.

This group has broad collaboration with dozens of universities from China, Japan, Korea, Singapore, Taiwan and Hong Kong on various topics spanning from machine translation, web mining, sentiment analysis, question-answering to SNS text mining, summarization and search. Among many successful collaborations, this group and Internet Graphics collaborated with Institute of Computing Technology, Chinese Academy of Sciences and Beijing Union University to develop Sign Language Recognition system using Kinect has generated big impact. We also recruit research interns from over 20 universities worldwide to work together with researchers on important topics. This group has actively contributed to NLP research community. The notable contribution includes working with Harbin Institute of Technology (since 2004) and Chinese Information Processing Society (since 2013) to run summer school on NLP and Internet Innovations since 2001, helping China Computer Federation to establish the NLPCC conference, and promoting MS joint labs with Harbin Institute of Technology and Tsinghua University.





Areas of Focus

Our research strategy is data driven and statistical learning: we collect large-scale monolingual/bilingual corpora from the web and third parties, and use machine learning approaches to acquire linguistic/translation knowledge. This knowledge is then used to support our research projects. Below is an introduction of our main research areas.

Corpus Collection, Classification, and Annotation

This is a continuous effort to build a large text corpus as the infrastructure for statistical learning. Text can be acquired from various documents and from the Web. Text classification by topic and writing style is useful for the construction of a balanced corpus as well as various domain specific corpora. Corpus annotation is a challenging task. It includes word segmentation, named entity identification, parts-of-speech tagging, syntactic parsing, word sense tagging, and anaphora tagging. The different tagging tools can be used directly in a number of natural language applications. The different annotated corpora can serve as supervised training data for statistical language modeling for different purposes.

Asian Language Natural Language Processing

Text Information Mining and Extraction (TIME) is a platform used to extract key information from a variety of documents such as web pages, word documents, and PowerPoint presentations in different languages. The extracted information can be used to support information retrieval and search engines, machine translation, summarization, and question answering. This innovation covers a variety of technologies such as tokenization, named entity identification, semantic labeling or skeleton information extraction, key term extraction, and summarization.

Statistical Machine Translation

The focus of the Statistical Machine Translation project is on helping and guiding non-native English users, such as Chinese, Japanese and Koreans, search, read and write English more fluently. To this end, the NLC Group has applied statistical machine translation to provide meaningful translation solutions at the word, phrase or collocation, and sentence levels. Supported by translation technologies, the group is conducting research into new applications for search engines, such as multilingual Search. This application works at the word level, for inputted queries, and the sentence level, for translation of returned snippets.

The translation engine developed has been used in Engkoo Dictionary, later rebranded as Bing Dictionary which has solidly supported the English Search of Bing. The Chinese-English translation engine has been deployed in Bing Translator. Our translation engines have been used to support MS spoken translator and Sign Language Recognition system using Kinect, important collaborations with multiple research groups in MSR and external partners.

Information Retrieval

Our goal is to explore using natural language processing (NLP) technologies to improve the performance of classical information retrieval (IR) including indexing, query suggestion, spelling, and to relevance ranking. We will try these approaches with a vertical domain first and gradually extend to open domains. We have explored the best indexing terms for Chinese, new approaches for query expansion, mining word association and similarity from a text corpus, the fusing method of the retrieval results from different IR systems, base NP identification, accurate query translation using a statistical approach and example-based approaches. We participated in the cross-lingual track of TREC-9 and NTCIR-III and got best results on cross-language information retrieval. We focused on the query translation and optimizing indexing for a Chinese IR system. We also participated in the Web track of TREC-10.

Based on above mentioned technologies, we have built a successful NLP based search engine (lingo) which do deep NLP analysis to build indexing and allows complicated queries to search database. This search engine was used in Engkoo Dictionary, later rebranded as Bing Dictionary to do powerful search of huge data of bilingual example sentences mined from the web.  It was also used in our semantic tweet search (QuickView) in 2010.

Question Answering

Question answering is a key technology being developed for the next generation search engine. Given a question, a search engine user hopes to get an exact answer rather than face a huge number of query results. NLC Group is creating question reformulation, paraphrasing, and various answer extraction techniques for factoid questions and non-factoid questions. Based on this work, the group also hopes to build domain specific chatbots with question answering technologies that mine text forums, web blogs, and other web resources.

Since 2011, we started to build a QA research platform, called Light (now it is called Engkoo Answers) which is designed to provide fundamental tools and benchmarks to support the long term sustained development of the research on key elements of QA including question understanding, question paraphrasing, query rewriting and correction, query expansion, entity extraction from query, documents, webpages and search snippets, answer extraction, ranking, confidence rate assignment of the candidate answers, sentiment analysis and opinionated summary. We built web-QA which uses web search results to support question-answering, KB-QA which uses large scale knowledge base such as Freebase and MS knowledge bases to support question-answering, and social-QA which uses large repository of community QA pairs, tweets and forums to support question-answering. This platform is capable to answer factoid questions, non-factoid questions such as definitional questions, yes/no question,, subjective questions, and even Jeopardy! quiz.


Semantic Analysis and Search for Big Text Data (Project QuickView)

This project started in 2000. We would like to build a semantic analysis, search for big volume of text data, for both unstructured data, structured data and semi-structured data. This semantic analysis is a pipeline of text data processing, information extraction, search engine, summarization, question-answering and visualization. In addition to help evolve search engine from current sorely search function to decision making and task completion, but perhaps more importantly, we would like to help the enterprise users to distill the information and knowledge from enormous data sources via cloud computing service in order to support business intelligence, information access and document generation. We hope we could develop unified and standard methods to cover different genres of data, starting with standard text, to noisy tweets, and then move to structured database.

Currently we have developed a semantic analysis and search engine for tweets (QuickView) with the full functions of semantic analysis including tweet categorization, clustering, NER, semantic role labelling, sentiment analysis, opinion mining, keyword search and simple question-answering.


Language Gaming

Can you imagine a computer capable of generating Chinese couplets? The NLC Group has made this a reality for the first time in the world by creating Chinese Couplet Generation software as part of its language gaming project for the Internet and mobile games ( The software works by accepting a sentence provided by a user and then extrapolating a couplet sentence. This technology can be used to further Chinese language learning by entertaining and engaging users.

We further extended this research to do classical Chinese poetry generation and riddle guessing and generation.


Selected Publications



  • Wei Wu, Hang Li, Jun Xu: Learning query and document similarities from click-through bipartite graph with metadata. WSDM 2013: 687-696
  • Dongdong Zhang, Shuangzhi Wu, Nan Yang, Mu Li. Punctuation Prediction with Transition-based Parsing. ACL, Auguest 2013.
  • Lei Cui, Dongdong Zhang, Shujie Liu, Mu Li, and Ming Zhou, Collective Corpus Weighting and Phrase Scoring for SMT using Graph-based Random Walk, NLP-CC, November 2013
  • Xiaohua Liu, Ming Zhou: Two-stage NER for tweets with clustering. Inf. Process. Manage. 49(1): 264-273 (2013)
  • Xiaohua Liu, Furu Wei, Shaodian Zhang, Ming Zhou: Named entity recognition for tweets. ACM TIST 4(1): 3 (2013)
  • Jinhan Kim, Seung-won Hwang, Long Jiang, Young-In Song, Ming Zhou: Entity Translation Mining from Comparable Corpora: Combining Graph Mapping with Corpus Latent Features. IEEE Trans. Knowl. Data Eng. 25(8): 1787-1800 (2013)
  • Li Dong, Furu Wei, Yajuan Duan, Xiaohua Liu, Ming Zhou, Ke Xu: The Automated Acquisition of Suggestions from Tweets. AAAI 2013
  • Dehong Gao, Furu Wei, Wenjie Li, Xiaohua Liu, Ming Zhou: Co-Training Based Bilingual Sentiment Lexicon Learning. AAAI (Late-Breaking Developments) 2013
  • Zhengyan He, Shujie Liu, Mu Li, Ming Zhou, Longkai Zhang, Houfeng Wang: Learning Entity Representation for Entity Disambiguation. ACL (2) 2013: 30-34
  • Chenguang Wang, Nan Duan, Ming Zhou, Ming Zhang: Paraphrasing Adaptation for Web Search Ranking. ACL (2) 2013: 41-46
  • Nan Yang, Shujie Liu, Mu Li, Ming Zhou, Nenghai Yu: Word Alignment Modeling with Context Dependent Deep Neural Network. ACL (1) 2013: 166-175
  • Lei Cui, Dongdong Zhang, Shujie Liu, Mu Li, Ming Zhou: Bilingual Data Cleaning for SMT using Graph-based Random Walk. ACL (2) 2013: 340-345
  • Xiaohua Liu, Yitong Li, Haocheng Wu, Ming Zhou, Furu Wei, Yi Lu: Entity Linking for Tweets. ACL (1) 2013: 1304-1311
  • Yuki Arase, Ming Zhou: Machine Translation Detection from Monolingual Web-Text. ACL (1) 2013: 1597-1607
  • Keisuke Sakaguchi, Yuki Arase, Mamoru Komachi: Discriminative Approach to Fill-in-the-Blank Quiz Generation for Language Learners. ACL (2) 2013: 238-242
  • Xiujuan Chai, Guang Li, Xilin Chen, Ming Zhou, Guobin Wu, Hanjing Li: VisualComm: a tool to support communication between deaf and hearing persons with the Kinect. ASSETS 2013: 76
  • Hyun-Kyo Oh, Sang-Wook Kim, Sunju Park, Ming Zhou: Trustable aggregation of online ratings. CIKM 2013: 1233-1236
  • Zhengyan He, Shujie Liu, Yang Song, Mu Li, Ming Zhou, Houfeng Wang: Efficient Collective Entity Linking with Stacking. EMNLP 2013: 426-435
  • Lei Cui, Xilun Chen, Dongdong Zhang, Shujie Liu, Mu Li, Ming Zhou: Multi-Domain Adaptation for SMT Using Multi-Task Learning. EMNLP 2013: 1055-1065
  • Hong Sun, Nan Duan, Yajuan Duan, Ming Zhou: Answer Extraction from Passage Graph for Question Answering. IJCAI 2013


  • Jing He, Ming Zhou, Long Jiang: Generating Chinese Classical Poems with Statistical Machine Translation Models. AAAI 2012
  • Nan Yang, Mu Li, Dongdong Zhang, Nenghai Yu: A Ranking-based Approach to Word Reordering for Statistical Machine Translation. ACL (1) 2012: 912-920
  • Yang Feng, Dongdong Zhang, Mu Li, Qun Liu: Hierarchical Chunk-to-String Translation. ACL (1) 2012: 950-958
  • Yang Feng, Dongdong Zhang, Qun Liu. Prepositional Phrase Reordering for Hierarchical Phrase-Based Translation. Journal of Chinese Information Processing. 2012: 26(1).
  • Xiaohua Liu, Zhongyang Fu, Furu Wei, Ming Zhou: Collective Nominal Semantic Role Labeling for Tweets. AAAI 2012
  • Xiaohua Liu, Xiangyang Zhou, Zhongyang Fu, Furu Wei, Ming Zhou: Exacting Social Events for Tweets Using a Factor Graph. AAAI 2012
  • Xiaohua Liu, Furu Wei, Ming Zhou: QuickView: NLP-based Tweet Search. ACL (System Demonstrations) 2012: 13-18
  • Hong Sun, Ming Zhou: Joint Learning of a Dual SMT System for Paraphrase Generation. ACL (2) 2012: 38-42
  • Seung-Wook Lee, Dongdong Zhang, Mu Li, Ming Zhou, Hae-Chang Rim: Translation Model Size Reduction for Hierarchical Phrase-based Statistical Machine Translation. ACL (2) 2012: 291-295
  • Shujie Liu, Chi-Ho Li, Mu Li, Ming Zhou: Learning Translation Consensus with Structured Label Propagation. ACL (1) 2012: 302-310
  • Xiaohua Liu, Ming Zhou, Xiangyang Zhou, Zhongyang Fu, Furu Wei: Joint Inference of Named Entity Recognition and Normalization for Tweets. ACL (1) 2012: 526-535
  • Xinfan Meng, Furu Wei, Xiaohua Liu, Ming Zhou, Ge Xu, Houfeng Wang: Cross-Lingual Mixture Model for Sentiment Classification. ACL (1) 2012: 572-581
  • Yajuan Duan, Furu Wei, Ming Zhou, Heung-Yeung Shum: Graph-based collective classification for tweets. CIKM 2012: 2323-2326
  • Yajuan Duan, Zhimin Chen, Furu Wei, Ming Zhou, Heung-Yeung Shum: Twitter Topic Summarization by Ranking Tweets using Social Influence and Content Quality. COLING 2012: 763-780
  • Xinfan Meng, Furu Wei, Ge Xu, Longkai Zhang, Xiaohua Liu, Ming Zhou, Houfeng Wang: Lost in Translations? Building Sentiment Lexicons using Context Based Machine Translation. COLING (Posters) 2012: 829-838
  • Xiaohua Liu, Yitong Li, Furu Wei, Ming Zhou: Graph-Based Multi-Tweet Summarization using Social Signals. COLING 2012: 1699-1714
  • Nan Duan, Mu Li, Ming Zhou: Forced Derivation Tree based Model Training to Statistical Machine Translation. EMNLP-CoNLL 2012: 445-454
  • Shujie Liu, Chi-Ho Li, Mu Li, Ming Zhou: Re-training Monolingual Parser Bilingually for Syntactic SMT. EMNLP-CoNLL 2012: 854-862
  • Xinfan Meng, Furu Wei, Xiaohua Liu, Ming Zhou, Sujian Li, Houfeng Wang: Entity-centric topic-oriented opinion summarization in twitter. KDD 2012: 379-387
  • Mitsuo Yoshida, Yuki Arase: Exploiting Twitter for Spiking Query Classification. AIRS 2012: 138-149


  • Matthew R. Scott, Xiaohua Liu, Ming Zhou. Towards a Specialized Search Engine for Language Learners. Proceedings of the IEEE, Vol.99, No.9, pp.1462-1465, Sept. 2011
  • Lei Cui, Dongdong Zhang, Mu Li and Ming Zhou. Function Word Generation in Statistical Machine Translation Systems. Machine Translation Summit XIII, September 2011Xiaohua Liu, Kuan Li, Ming Zhou, Zhongyang Xiong: Enhancing Semantic Role Labeling for Tweets Using Self-Training. AAAI 2011
  • Matthew R. Scott, Xiaohua Liu, Ming Zhou: Engkoo: Mining the Web for Language Learning. ACL (System Demonstrations) 2011: 44-49
  • Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, Tiejun Zhao: Target-dependent Twitter Sentiment Classification. ACL 2011: 151-160
  • Xiaohua Liu, Shaodian Zhang, Furu Wei, Ming Zhou: Recognizing Named Entities in Tweets. ACL 2011: 359-367
  • Nan Duan, Mu Li, Ming Zhou: Hypothesis Mixture Decoding for Statistical Machine Translation. ACL 2011: 1258-1267
  • Xiaohua Liu, Bo Han, Ming Zhou: Correcting Verb Selection Errors for ESL with the Perceptron. CICLing (2) 2011: 411-423
  • Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, Ming Zhang: Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach. CIKM 2011: 1031-1040
  • Jinhan Kim, Long Jiang, Seung-won Hwang, Young-In Song, Ming Zhou: Mining entity translations from comparable corpora: a holistic graph mapping approach. CIKM 2011: 1295-1304
  • Xiaohua Liu, Kuan Li, Ming Zhou, Zhongyang Xiong: Collective Semantic Role Labeling for Tweets with Clustering. IJCAI 2011: 1832-1837
  • Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming Zhou, Ping Li: User-level sentiment analysis incorporating social networks. KDD 2011: 1397-1405
  • Xiaohua Liu, Long Jiang, Furu Wei, Ming Zhou: QuickView: advanced search of tweets. SIGIR 2011: 1275-1276
  • Duo Ding, Xingping Jiang, Matthew R. Scott, Ming Zhou, Yong Yu: Tulsa: web search for writing assistance. SIGIR 2011: 1287-1288
  • Chenhao Tan, Lillian Lee, Jie Tang, Long Jiang, Ming Zhou, Ping Li: User-level sentiment analysis incorporating social networks. CoRR abs/1109.6018 (2011)


  • Xiaohua Liu, Ming Zhou: Evaluating the Quality of Web-Mined Bilingual Sentence Pairs. Int. J. of Asian Lang. Proc. 20(4): 171-179 (2010)
  • Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Kam-Fai Wong, Hsiao-Wuen Hon: Exploiting query logs for cross-lingual query suggestions. ACM Trans. Inf. Syst. 28(2) (2010)
  • Lei Cui, Dongdong Zhang, Mu Li, Ming Zhou, Tiejun Zhao: A Joint Rule Selection Model for Hierarchical Phrase-Based Translation. ACL (Short Papers) 2010: 6-11
  • Shujie Liu, Chi-Ho Li, Ming Zhou: Discriminative Pruning for Discriminative ITG Alignment. ACL 2010: 316-324
  • Lei Cui, Dongdong Zhang, Mu Li, Ming Zhou, Tiejun Zhao: Hybrid Decoding: Decoding with Partial Hypotheses Combination over Multiple SMT Systems. COLING (Posters) 2010: 214-222
  • Yajuan Duan, Long Jiang, Tao Qin, Ming Zhou, Heung-Yeung Shum: An Empirical Study on Learning to Rank of Tweets. COLING 2010: 295-303
  • Nan Duan, Hong Sun, Ming Zhou: Translation Model Generalization using Probability Averaging for Machine Translation. COLING 2010: 304-312
  • Nan Duan, Mu Li, Dongdong Zhang, Ming Zhou: Mixture Model-based Minimum Bayes Risk Decoding using Multiple Machine Translation Systems. COLING 2010: 313-321
  • Gum-Won Hong, Chi-Ho Li, Ming Zhou, Hae-Chang Rim: An Empirical Study on Web Mining of Parallel Data. COLING 2010: 474-482
  • Mu Li, Yinggong Zhao, Dongdong Zhang, Ming Zhou: Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation. COLING 2010: 662-670
  • Xiaohua Liu, Kuan Li, Bo Han, Ming Zhou, Long Jiang, Zhongyang Xiong, Changning Huang: Semantic Role Labeling for News Tweets. COLING 2010: 698-706
  • Xiaohua Liu, Kuan Li, Bo Han, Ming Zhou, Long Jiang, Daniel Tse, Zhongyang Xiong: Collective Semantic Role Labeling on Open News Corpus by Leveraging Redundancy. COLING (Posters) 2010: 725-729
  • Shujie Liu, Chi-Ho Li, Ming Zhou: Improved Discriminative ITG Alignment using Hierarchical Phrase Pairs and Semi-supervised Training. COLING (Posters) 2010: 730-738
  • Xiaohua Liu, Bo Han, Kuan Li, Stephan Hyeonjun Stiller, Ming Zhou: SRL-Based Verb Selection for ESL. EMNLP 2010: 1068-1076
  • Xiaohua Liu, Ming Zhou: Evaluating the Quality of Web-Mined Bilingual Sentences Using Multiple Linguistic Features. IALP 2010: 281-284


  • Mu Li, Nan Duan, Dongdong Zhang, Chi-Ho Li, Ming Zhou: Collaborative Decoding: Partial Hypothesis Re-ranking Using Translation Consensus between Decoders. ACL/IJCNLP 2009: 585-592
  • Long Jiang, Shiquan Yang, Ming Zhou, Xiaohua Liu, Qingsheng Zhu: Mining Bilingual Data from the Web with Adaptively Learnt Patterns. ACL/IJCNLP 2009: 870-878
  • Wei Gao, John Blitzer, Ming Zhou, Kam-Fai Wong: Exploiting Bilingual Information to Improve Web Search. ACL/IJCNLP 2009: 1075-1083
  • Wei Gao, Cheng Niu, Ming Zhou, Kam-Fai Wong: Joint Ranking for Multilingual Web Search. ECIR 2009: 114-125 (best paper)
  • Tong Xiao, Mu Li, Dongdong Zhang, Jingbo Zhu, Ming Zhou: Better Synchronous Binarization for Machine Translation. EMNLP 2009: 362-370
  • Nan Duan, Mu Li, Tong Xiao, Ming Zhou: The Feature Subspace Method for SMT System Combination. EMNLP 2009: 1096-1104
  • Ming Zhou, Long Jiang, Jing He: Generating Chinese Couplets and Quatrain Using a Statistical Approach. PACLIC 2009: 43-52


  • Dongdong Zhang, Mu Li, Nan Duan, Chi-Ho Li, Ming Zhou: Measure Word Generation for English-Chinese SMT Systems. ACL 2008: 89-96
  • Shiqi Zhao, Cheng Niu, Ming Zhou, Ting Liu, Sheng Li: Combining Multiple Resources to Improve SMT-based Paraphrasing Model. ACL 2008: 1021-1029
  • Wei Gao, John Blitzer, Ming Zhou: Using English information in non-English web search. CIKM-iNEWS 2008: 17-24
  • Long Jiang, Ming Zhou: Generating Chinese Couplets using a Statistical MT Approach. COLING 2008: 377-384
  • Ming Zhou, Bo Wang, Shujie Liu, Mu Li, Dongdong Zhang, Tiejun Zhao: Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points. COLING 2008: 1121-1128
  • Lei Shi, Ming Zhou: Improved Sentence Alignment on Parallel Web Pages Using a Stochastic Tree Alignment Model. EMNLP 2008: 505-513


  • Guihua Sun, Gao Cong, Xiaohua Liu, Chin-Yew Lin, Ming Zhou: Mining Sequential Patterns and Tree Patterns to Detect Erroneous Sentences. AAAI 2007: 925-930
  • Chi-Ho Li, Minghui Li, Dongdong Zhang, Mu Li, Ming Zhou, Yi Guan: A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation. ACL 2007
  • Guihua Sun, Xiaohua Liu, Gao Cong, Ming Zhou, Zhongyang Xiong, John Lee, Chin-Yew Lin: Detecting Erroneous Sentences using Automatically Mined Sequential Patterns. ACL 2007
  • Qing Chen, Mu Li, Ming Zhou: Improving Query Spelling Correction Using Web Search Results. EMNLP-CoNLL 2007: 181-189
  • Jingjing Liu, Yunbo Cao, Chin-Yew Lin, Yalou Huang, Ming Zhou: Low-Quality Product Review Detection in Opinion Summarization. EMNLP-CoNLL 2007: 334-342
  • Dongdong Zhang, Mu Li, Chi-Ho Li, Ming Zhou: Phrase Reordering Model Integrating Syntactic Knowledge for SMT. EMNLP-CoNLL 2007: 533-540
  • Jizhou Huang, Ming Zhou, Dan Yang: Extracting Chatbot Knowledge from Online Discussion Forums. IJCAI 2007: 423-428
  • Long Jiang, Ming Zhou, Lee-Feng Chien, Cheng Niu: Named Entity Translation with Web
  • Mining and Transliteration. IJCAI 2007: 1629-1634
  • Shiqi Zhao, Ming Zhou, Ting Liu: Learning Question Paraphrases for QA from Encarta Logs. IJCAI 2007: 1795-1801
  • John Lee, Ming Zhou, Xiaohua Liu: Detection of Non-Native Sentences Using Machine-Translated Training Data. HLT-NAACL (Short Papers) 2007: 93-96
  • Wei Gao, Cheng Niu, Jian-Yun Nie, Ming Zhou, Jian Hu, Kam-Fai Wong, Hsiao-Wuen Hon: Cross-lingual query suggestion using query logs of different languages. SIGIR 2007: 463-470


  • Jianfeng Gao, Jian-Yun Nie, Ming Zhou: Statistical query translation models for cross-language information retrieval. ACM Trans. Asian Lang. Inf. Process. 5(4): 323-359 (2006)
  • Yi Chen, Ming Zhou, Shilong Wang: Reranking Answers for Definitional QA Using Language Modeling. ACL 2006
  • Mu Li, Muhua Zhu, Yang Zhang, Ming Zhou: Exploring Distributional Similarity Based Models for Query Spelling Correction. ACL 2006
  • Lei Shi, Cheng Niu, Ming Zhou, Jianfeng Gao: A DOM Tree Alignment Model for Mining Parallel Data from the Web. ACL 2006
  • Yunhua Hu, Hang Li, Yunbo Cao, Li Teng, Dmitriy Meyerzon, Qinghua Zheng: Automatic extraction of titles from general documents using machine learning. Inf. Process. Manage. 42(5): 1276-1293 (2006)
  • Jun Xu, Yunbo Cao, Hang Li, Min Zhao, Yalou Huang: A Supervised Learning Approach to Search of Definitions. J. Comput. Sci. Technol. 21(3): 439-449 (2006)
  • Min Zhao, Hang Li, Adwait Ratnaparkhi, Hsiao-Wuen Hon, Jue Wang: Adapting Document Ranking to Users' Preferences Using Click-Through Data. AIRS 2006: 26-42
  • Guoping Hu, Jingjing Liu, Hang Li, Yunbo Cao, Jian-Yun Nie, Jianfeng Gao: A Supervised Learning Approach to Entity Search. AIRS 2006: 54-66
  • Jun Xu, Yunbo Cao, Hang Li, Yalou Huang: Cost-Sensitive Learning of SVM for Ranking. ECML 2006: 833-840
  • Shenghua Bao, Yunbo Cao, Bing Liu, Yong Yu, Hang Li: Mining Latent Associations of Objects Using a Typed Mixture Model--A Case Study on Expert/Expertise Mining. ICDM 2006: 803-807
  • Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon: Adapting ranking SVM to document retrieval. SIGIR 2006: 186-193


  • Kun Yu, Gang Guan, Ming Zhou: Resume Information Extraction with Cascaded Hybrid Model. ACL 2005
  • Sung-Hyon Myaeng, Ming Zhou, Kam-Fai Wong, HongJiang Zhang (Eds.): Information Retrieval Technology, Asia Information Retrieval Symposium, AIRS 2004, Beijing, China, October 18-20, 2004, Revised Selected Papers. Lecture Notes in Computer Science 3411, Springer 2005, ISBN 3-540-25065-4
  • Jun Xu, Yunbo Cao, Hang Li, Min Zhao: Ranking definitions with supervised learning methods. WWW (Special interest tracks and posters) 2005: 811-819
  • Yunbo Cao, Jingjing Liu, Shenghua Bao, Hang Li: Research on Expert Search at Enterprise Track of TREC 2005. TREC 2005
  • Yunhua Hu, Guomao Xin, Ruihua Song, Guoping Hu, Shuming Shi, Yunbo Cao, Hang Li: Title extraction from bodies of HTML documents and its application to web page retrieval. SIGIR 2005: 250-257
  • Yunbo Cao, Jingjing Liu, Shenghua Bao, Hang Li: Research on Expert Search at Enterprise Track of TREC 2005. TREC 2005


  • Ya-Juan Lv,Ming Zhou,"Collocation Translation Acquisition Using Monolingual Corpora", 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, Jul. 2004.
  • Wei Wang, Ming Zhou: Improving Word Alignment Models using Structured Monolingual Corpora. EMNLP 2004: 198-205
  • Dong-Hui Feng, Ya-Juan Lv, Ming Zhou,"A New Approach for English-Chinese Named Entity Alignment", 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, Jul. 2004.
  • Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu and Guihong Cao."Dependence language model for information retrieval", In SIGIR-2004. Sheffield, UK, July 25-29, 2004.
  • Jianfeng Gao, Andi Wu, Mu Li, Chang-Ning Huang, Hongqiao Li, Xinsong Xia and Haowei Qin."Adaptive Chinese word segmentation" , 42nd Annual Meeting of the Association for Computational Linguistics, Barcelona, Spain, Jul. 2004.
  • Jianfeng Gao and Hisami Suzuki,"Capturing long distance dependency for language modeling: an empirical study", In IJCNLP-04. Sanya City, Hainan Island, China, March 22-24, 2004.
  • Hongqiao Li, Chang-Ning Huang, Jianfeng Gao and Xiaozhong Fan, "The use of SVM for Chinese new word identification", In IJCNLP-04. Sanya City, Hainan Island, China, March 22-24, 2004.
  • Hang Li and Cong Li," Word Translation Disambiguation Using Bilingual Bootstrapping", Computational Linguistics 30(1), 1-22, 2004.
  • Qiang Yang, Charles X. Ling and Jianfeng Gao. "Mining web logs for actionable knowledge". To appear as a book chapter.


  • Jianfeng Gao, Mu Li and Chang-Ning Huang, "Improved Source-Channel Models for Chinese Word Segmentation", 41nd Annual Meeting of the Association for Computational Linguistics. Sapporo. Japan, July 7-12, 2003.
  • Cong Li, Ji-Rong Wen, and Hang Li, "Text Classification Using Stochastic Keyword Generation", Proc. of ICML'03, 464-471.
  • Yunbo Cao, Hang Li, and Li Lian, "Uncertainty Reduction in Collaborative Bootstrapping: Measure and Algorithm", Proc. of ACL'03, 327-334.
  • Hang Li, Yunbo Cao, and Cong Li,"Using Bilingual Web Data to Mine and Rank Translations", IEEE Intelligent Systems, Vol. 18(4), 54-59, (2003).
  • Hang Li and Kenji Yamanishi, "Topic Analysis Using a Finite Mixture Model", Information Processing & Management, 39(4), 521-541, (2003).
  • Hua Wu, Ming Zhou: Synonymous Collocation Extraction Using Translation Information. ACL 2003: 120-127
  • Dekang Lin, Shaojun Zhao, Lijuan Qin, Ming Zhou: Identifying Synonyms among Distributionally Similar Words. IJCAI 2003: 1492-1493


  • Qing Ma, Min Zhang, Masaki Murata, Ming Zhou, Hitoshi Isahara: Self-Organizing Chinese and Japanese Semantic Maps. COLING 2002
  • Jian Sun, Jianfeng Gao, Lei Zhang, Ming Zhou, Changning Huang: Chinese Named Entity Identification Using Class-based Language Model. COLING 2002
  • Wei Wang, Ming Zhou, Jin-Xia Huang, Changning Huang: Structure Alignment Using Bilingual Chunking. COLING 2002
  • Jian-Min Yao, Ming Zhou, Tiejun Zhao, Hao Yu, Sheng Li: An Automatic Evaluation Method for Localization Oriented Lexicalised EBMT System. COLING 2002
  • Jianfeng Gao, Ming Zhou, Jian-Yun Nie, Hongzhao He, Weijun Chen: Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. SIGIR 2002: 183-190
  • Jianfeng Gao, Joshua Goodman, Guihong Cao, Hang Li: Exploring Asymmetric Clustering for Statistical Language Modeling. ACL 2002: 183-190
  • Cong Li, Hang Li: Word Translation Disambiguation Using Bilingual Bootstrapping. ACL 2002: 343-351
  • Yunbo Cao, Hang Li: Base Noun Phrase Translation Using Web Data and the EM Algorithm. COLING 2002


  • Wei Wang, Jin-Xia Huang, Ming Zhou, Changning Huang: Finding Target Language Correspondence for Lexicalized EBMT System. NLPRS 2001: 455-460
  • Tom B. Y. Lai, Changning Huang, Ming Zhou, Jiangbo Miao, Tony K. C. Siu: Span-based Statistical Dependency Parsing of Chinese. NLPRS 2001: 677-684
  • Jianfeng Gao, Endong Xun, Ming Zhou, Changning Huang, Jian-Yun Nie, Jian Zhang: Improving Query Translation for Cross-Language Information Retrieval Using Statistical Models. SIGIR 2001: 96-104


  • Ting Liu, Ming Zhou, Jianfeng Gao, Endong Xun, Changning Huang: PENS: A Machine-aided English Writing System for Chinese Users. ACL 2000
  • Jianfeng Gao, Kai-Fu Lee: Distribution-Based Pruning of Backoff Language Models. ACL 2000
  • Endong Xun, Changning Huang, Ming Zhou: A Unified Statistical Model for the Identification of English BaseNP. ACL 2000
  • Lei Zhang, Ming Zhou, Changning Huang, Haihua Pan: Automatic Detecting/Correcting Errors in Chinese Text by an Approximate Word-Matching Algorithm. ACL 2000
  • Jian-Yun Nie, Jianfeng Gao, Jian Zhang, Ming Zhou: On the use of words and n-grams for Chinese information retrieval. IRAL 2000: 141-148
  • Jianfeng Gao, Jian-Yun Nie, Jian Zhang, Endong Xun, Yi Su, Ming Zhou, Changning Huang: TREC-9 CLIR Experiments at MSRCN. TREC 2000
  • Joshua Goodman, Jianfeng Gao: Language model size reduction by pruning and clustering. INTERSPEECH 2000: 110-113
  • Jianfeng Gao, Mingjing Li, Kai-Fu Lee: N-gram distribution based language model adaptation. INTERSPEECH 2000: 497-500
  • Jianfeng Gao, Mingjing Li, Kai-Fu Lee: N-gram distribution based language model adaptation. INTERSPEECH 2000: 497-500