I am working as a Scientist (applied researcher) in the Bing Speech and Language Sciences Group at Microsoft, CA. The focus of my research is to apply machine learning techniques to solve spoken language understanding (SLU) problems. I have a background in speech recognition (ASR). Recently I have got interested in applying Neural Networks (Recurrent Neural Networks, Deep Belief Nets) for many tasks in ASR and SLU.
I got a PhD in the Electrical and Computer Engineering (ECE) from Center for Language and Speech Processing (CLSP) and Human Language Technology - Center of Excellence (HLT-COE) at JHU in 2011 under the tutelage of Prof. Fred Jelinek and then later Dr. Ken Church. I have a Masters in both ECE and Applied Mathematics and Statistics, both from JHU.
Phone: +1 (650) 693-3799
Honors and Awards
Microsoft Patent Awards (2011,2012,2013).
Award of student author grant, Int'l Conf. on Spoken Language Processing, 2008.
Reviewer: IEEE Transactions on Audio, Speech and Language Processing Journal, Speech Communication Journal, IEEE ICASSP, Interspeech, EMNLP, ACL
- Anoop Deoras and Ruhi Sarikaya, Deep Belief Network based Semantic Taggers for Spoken Language Understanding, in ISCA Interspeech, ISCA, September 2013
- Gokhan Tur, Anoop Deoras, and Dilek Hakkani-Tur, Semantic Parsing Using Word Confusion Networks With Conditional Random Fields, Annual Conference of the International Speech Communication Association (Interspeech), September 2013A challenge in large vocabulary spoken language understanding (SLU) is robustness to automatic speech recognition (ASR) errors. The state of the art approaches for semantic parsing rely on using discriminative sequence classification methods, such as conditional random fields (CRFs). Most dialog systems employ a cascaded approach where the best hypotheses from the ASR system are fed into the following SLU system. In our previous work, we have proposed the use of lattices towards joint recognition and parsing. In this paper, extending this idea, we propose to exploit word confusion networks (WCNs), compiled from ASR lattices for both CRF modeling and decoding. WCNs provide a compact representation of multiple aligned ASR hypotheses, without compromising recognition accuracy. For slot filling, we show significant semantic parsing performance improvements using WCNs compared to ASR 1-best output, approximating the oracle path performance.
- Anoop Deoras, Gokhan Tur, Ruhi Sarikaya, and Dilek Hakkani-Tur, Joint Discriminative Decoding of Word and Semantic Tags for Spoken Language Understanding, in IEEE Transactions on Audio, Speech, and Language Processing, IEEE, 2013Most Spoken Language Understanding (SLU) systems today employ a cascade approach, where the best hypothesis from Automatic Speech Recognizer (ASR) is fed into understanding modules such as slot sequence classifiers and intent detectors. The output of these modules is then further fed into downstream components such as interpreter and/or knowledge broker. These statistical models are usually trained individually to optimize the error rate of their respective output. In such approaches, errors from one module irreversibly propagates into other modules causing a serious degradation in the overall performance of the SLU system. Thus it is desirable to jointly optimize all the statistical models together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot sequence (semantic tag sequence) jointly given the input acoustic stream. Furthermore, the improved recognition output is then used for an utterance classification task, specifically, we focus on intent detection task. On a SLU task, we show 1.5% absolute reduction (7.6% relative reduction) in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of state-of-the-art large vocabulary ASR followed by conditional random field (CRF) based slot sequence tagger. Similarly, for intent detection, we show 1.2% absolute reduction (12% relative reduction) in classification error rate.
- Anoop Deoras, Ruhi Sarikaya, Gokhan Tur, and Dilek Hakkani-Tur, Joint Decoding for Speech Recognition and Semantic Tagging, Annual Conference of the International Speech Communication Association (Interspeech), September 2012Most conversational understanding (CU) systems today employ a cascade approach, where the best hypothesis from automatic speech recognizer (ASR) is fed into spoken language understanding (SLU) module, whose best hypothesis is then fed into other systems such as interpreter or dialog manager. In such approaches, errors from one statistical module irreversibly propagates into another module causing a serious degradation in the overall performance of the conversational understanding system. Thus it is desirable to jointly optimize all the statistical modules together. As a first step towards this, in this paper, we propose a joint decoding framework in which we predict the optimal word as well as slot (semantic tag) sequence jointly given the input acoustic stream. On Microsoft's CU system, we show 1.3% absolute reduction in word error rate (WER) and 1.2% absolute improvement in F measure for slot prediction when compared to a very strong cascade baseline comprising of the state-of-the-art recognizer followed by a slot sequence tagger.
- Anoop Deoras, Tomas Mikolov, Stefan Kombrink, and Ken Church, Approximate Inference: A Sampling Based Modeling Technique to Capture Complex Dependencies in a Language Model, in Elsevier Speech Communication, Elsevier, August 2012In this paper, we present strategies to incorporate long context information directly during the first pass decoding and also for the second pass lattice re-scoring in speech recognition systems. Long-span language models that capture complex syntactic and/or semantic information are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive increase in the size of the sentence-hypotheses search space. Typically, n-gram language models are used in the first pass to produce N-best lists, which are then re-scored using long-span models. Such a pipeline produces biased first pass output, resulting in sub-optimal performance during re-scoring. In this paper we show that computationally tractable variational approximations of the long-span and complex language models are a better choice than the standard n-gram model for the first pass decoding and also for lattice re-scoring.
- Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar Burget, and Jan Honza Cernocky, RNNLM - Recurrent Neural Network Language Modeling Toolkit, IEEE Automatic Speech Recognition and Understanding Workshop, December 2011We present a freely available open-source toolkit for training recurrent neural network based language models. It can be easily used to improve existing speech recognition and machine translation systems. Also, it can be used as a baseline for future research of advanced language modeling techniques. In the paper, we discuss optimal parameter selection and different modes of functionality. The toolkit, example scripts and basic setups are freely available at http://rnnlm.sourceforge.net/.
- Tomas Mikolov, Anoop Deoras, Dan Povey, Lukar Burget, and Jan Honza Cernocky, Strategies for Training Large Scale Neural Network Language Models, IEEE Automatic Speech Recognition and Understanding Workshop, December 2011We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10% relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens.
- Tomas Mikolov, Anoop Deoras, Stefan Kombrink, Lukas Burget, and Jan Honza Cernocky, Empirical Evaluation and Combination of Advanced Language Modeling Techniques, in Interspeech, ISCA, August 2011We present results obtained with several advanced language modeling techniques, including class based model, cache model, maximum entropy model, structured language model, random forest language model and several types of neural network based language models. We show results obtained after combining all these models by using linear interpolation. We conclude that for both small and moderately sized tasks, we obtain new state of the art results with combination of models, that is significantly better than performance of any individual model. Obtained perplexity reductions against Good-Turing trigram baseline are over 50% and against modified Kneser-Ney smoothed 5-gram over 40%.
- Anoop Deoras, Tomas Mikolov, and Kenneth Church, A Fast Re-scoring Strategy to Capture Long-Distance Dependencies, Empirical Methods in Natural Language Processing (EMNLP), July 2011A re-scoring strategy is proposed that makes it feasible to capture more long-distance dependencies in the natural language. Two pass strategies have become popular in a number of recognition tasks such as ASR (automatic speech recognition), MT (machine translation) and OCR (optical character recognition). The first pass typically applies a weak language model (n-grams) to a lattice and the second pass applies a stronger language model to N best lists. The stronger language model is intended to capture more long distance dependencies. The proposed method uses RNN-LM (recurrent neural network language model), which is a long span LM, to rescore word lattices in the second pass. A hill climbing method (iterative decoding) is proposed to search over islands of confusability in the word lattice. An evaluation based on Broadcast News shows speedups of 20 over basic N best re-scoring, and word error rate reduction of 8% (relative) on a highly competitive setup.
- Anoop Deoras, Search and Decoding Strategies for Complex Lexical Modeling in LVCSR, , June 2011The language model (LM) in most state-of-the-art large vocabulary continuous speech recognition (LVCSR) systems is still the n-gram. A major reason for using such simple LMs, besides the ease of estimating them from text, is computational complexity. It is also true, however, that long-span LMs, be they due to a higher n-gram order, or because they take syntactic, semantic, discourse and other long-distance dependencies into account, are much more accurate than low-order n-grams. The standard practice is to carry out a first pass of decoding using, say, a 3-gram LM to generate a lattice, and to rescore only the hypotheses in the lattice with a higher order LM. But even the search space defined by a lattice is intractable for many long-span LMs. In such cases, only the N-best full-utterance hypotheses from the lattice are extracted for evaluation. However, the N- best lists so produced, tend to be “baised” towards the model producing them, making the re-scoring sub-optimal, especially if the re-scoring model is complementary to the initial n-gram model. For this reason, we seek ways to incorporate information from long-span LMs by searching in a more unbiased search space. In this thesis, we first present strategies to combine many complex long and short span language models to form a much superior unified model of language. We then show how this unified model of language can be incorporated for re-scoring dense word graphs, using a novel search technique, thus alleviating the necessity of sub-optimal N-best list rescoring. We also present an approach based on the idea of variational inference, virtue of which, long-span models are efficiently approximated by some tractable but faithful models, allowing for the incorporation of long distance information directly into the first-pass decoding. We have validated the methods proposed in this thesis on many standard and competitive speech recognition tasks, sometimes outperforming state-of-the-art results. We hope that these methods will be useful for research with long span language models not only in speech recognition but also in other areas of natural language processing such as machine translation, where even there the decoding is limited to n-gram language models.
- Anoop Deoras, Tomas Mikolov, Stefan Kombrink, Martin Karafiat, and Sanjeev Khudanpur, Variational Approximation of Long-Span Language Models for LVCSR, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2011Long-span language models that capture syntax and semantics are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive search-space of sentencehypotheses. Instead, an N-best list of hypotheses is created using tractable n-gram models, and rescored using the long-span models. It is shown in this paper that computationally tractable variational approximations of the long-span models are a better choice than standard n-gram models for first pass decoding. They not only result in a better first pass output, but also produce a lattice with a lower oracle word error rate, and rescoring the N-best list from such lattices with the long-span models requires a smaller N to attain the same accuracy. Empirical results on the WSJ, MIT Lectures, NIST 2007 Meeting Recognition and NIST 2001 Conversational Telephone Recognition data sets are presented to support these claims.
- Anoop Deoras, Denis Filimonov, Mary Harper, and Fred Jelinek, Model Combination for Speech Recognition Using Empirical Bayes Risk Minimization, IEEE Spoken Language Technology Workshop, December 2010In this paper, we explore the model combination problem for rescoring Automatic Speech Recognition (ASR) hypotheses. We use minimum Empirical Bayes Risk for the optimization criterion and Deterministic Annealing techniques to search through the non-convex parameter space. Our experiments on the DARPA WSJ task using several different language models showed that our approach consistently outperforms the standard methods of model combination that optimize using 1-best hypothesis error.
- Jurgen Fritsch, Anoop Deoras, and Detlef Koll, Decoding-Time Prediction of Non-Verbalized Tokens, March 2010
- Anoop Deoras, Fred Jelinek, and Yi Su, Language Model Adaptation Using Random Forests, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2010In this paper we investigate random forest based language model adaptation. Large amounts of out-of-domain data are used to grow the decision trees while very small amounts of in-domain data are used to prune them back, so that the structure of the trees are suitable for the desired domain while the probabilities in the tree nodes are reliably estimated. Extensive experiments are carried out and results are reported on a particular task of adapting Broadcast News language model to the MIT computer science lecture domain. We show 0.80% and 0.60% absolute WER improvement over language model interpolation and count merging techniques, respectively.
- Anoop Deoras and Fred Jelinek, Iterative Decoding: A Novel Re-scoring Framework for Confusion Networks, IEEE Workshop on Automatic Speech Recognition and Understanding, December 2009Recently there has been a lot of interest in confusion network re-scoring using sophisticated and complex knowledge sources. Traditionally, re-scoring has been carried out by the N-best list method or by the lattices or confusion network dynamic programming method. Although the dynamic programming method is optimal, it allows for the incorporation of only Markov knowledge sources. N-best lists, on the other hand, can incorporate sentence level knowledge sources, but with increasing N, the re-scoring becomes computationally very intensive. In this paper, we present an elegant framework for confusion network re-scoring called ’Iterative Decoding’. In it, integration of multiple and complex knowledge sources is not only easier but it also allows for much faster re- scoring as compared to the N-best list method. Experiments with Language Model re-scoring show that for comparable performance (in terms of word error rate (WER)) of Iterative Decoding and N-best list re-scoring, the search effort required by our method is 22 times less than that of the N-best list method.
- Anoop Deoras and Jurgen Fritsch, Decoding-Time Prediction of Non-Verbalized Punctuation, in ISCA Interspeech, ISCA, September 2008This paper presents novel methods that integrate lexical prediction of non-verbalized punctuations with Viterbi decoding for Large Vocabulary Conversational Speech Recognition (LVCSR) in a single pass. We describe two different approaches - one based on a modified finite state machine representation of language models and one based on an extension of an LVCSR decoder. We discuss advantages over traditional punctuation prediction approaches based on post-processing of recognition hypotheses, including experimental evaluation of the proposed approach using a state-of-the-art LVCSR decoder. Experiments were performed on a medical documentation corpus and results demonstrate that the proposed methods yield improved punctuation prediction accuracy while at the same time reducing system complexity and memory requirements.
- Sachin Ghanekar and Anoop Deoras, Method and System for Automatic Gain Control of a Speech Signal, September 2007