Automatic learning of speech recognition grammars from example sentences to ease the development of spoken language systems.
Researcher Ye-Yi Wang wants to have more time for vacation, so he is teaching his computer to do some work for him.
Wang has been working on Spoken Language Understanding for the MiPad project since he was hired to Microsoft Research. He has developed a robust parser and the understanding grammars for several projects. "Grammar development is painful and error-prone. It is time-consuming, tedious and it requires expertise in computational linguistics. Occupied with the work to speech-enable applications, I've never had enough time to use up my three-week vacation these years," says Wang.
According to Wang, many state-of-the-art conversational systems use semantic-based robust understanding. In this approach, computers "understand" speech by normalizing the output from a speech recognizer into a canonical representation with a robust parser. The parser does this with a handcrafted semantic grammar. While the robust parser can be written once and used many times for different tasks, the difficulty is due to the requirement that a new semantic grammar be developed for every application domain. Because of this, speech-enabled applications are mostly developed in large human language technology labs as prototype research systems.
"Microsoft is a platform company. It is extremely important to provide developers with easy-to-use tools for our platforms, so that speech-enabled applications and web services can become mainstream," says Alex Acero, Wang's manager, who is also involved in the project.
They focus on developing technologies for smart tools that allow an average developer to speech-enable applications or web services. This differs from the work in automatic grammar inference, which tries to learn grammars automatically from a corpus of training sentences. Most research in grammar inference has focused on toy problems, and application of such approaches on grammar structure learning for natural language has not been satisfactory for natural language understanding applications. According to Wang, the limited success is due to the complexity of the problem and the typical sparseness of the training data relative to the complexity of the target grammar. There is not a good generalization mechanism to correctly cover a large variety of language constructions unseen in the training data. "Instead of ambiguous automatic grammar inference, we adopt a very practical approach by integrating multiple sources of easy-to-get information", says Wang.
Several general technologies are currently pursued to take advantage of these information sources, including:
- Automatic generation of template grammar from semantic schema: The semantic schema defines the entity relations of a specific domain. It serves as the specification for a language-enabled application. Their technology can automatically generate a Context Free Semantic Grammar template that inherits the semantic information specified in a semantic schema.
- Learning from semantic annotation: With the involvement of grammar developers and the help from the robust parser, a small amount of training sentences can be easily annotated to their canonical representations. From the annotations, Fast Learner can learn the language expressions for the components in the automatically generated semantic grammar template.
- Syntactic Constraints: Domain specific language must comply with the syntactic constraints of a language. Some simple syntactic clues, for example, part-of-speech constraints, can be used to reduce the search space in grammar learning.
- Grammar Library: Some low level semantic entities, such as date, time, duration, postal address, currency, numbers, percentage, etc, are not domain-specific. They are universal building blocks that can be written once and then shared by many applications.
In their ASRU 2001 paper, "Grammar Learning for Spoken Language Understanding," they reported some exciting results. On MiPad data, the grammar generated with their technologies already outperformed the manually developed grammar --- the understanding error rates have been consistently reduced by 40% to 60%.
"This is very promising, given the fact that many more powerful technologies have not been applied yet." says Wang. Acero agrees: "We believe that the learning of statistical grammar can further improve the performance, and there are still many things in our agenda to reduce interactions between the toolkit and grammar developers."
Based on the technologies, they have created SGStudio (Semantic Grammar Studio) that enables non-speech experts to develop semantic grammars for speech recognition and understanding.
- Tim Paek, Yun-Cheng Ju, and Christopher Meek, People Watcher: A Game for Eliciting Human-Transcribed Data for Automated Directory Assistance, International Speech Communication Association, 2007
- Ye-Yi Wang and Alex Acero, Rapid development of spoken language understanding grammars, in Speech Communication, vol. 48, no. 3-4, pp. 390-416, Elsevier , 2006
- Ye-Yi Wang and Alex Acero, Discriminative Models for Spoken Language Understanding., in the International Conference on Spoken Language Processing, International Speech Communication Association, Pittsburgh, PA, USA, 2006
- Ye-Yi Wang, John Lee, Milind Mahajan, and Alex Acero, Combining Statistical and Knowledge-Based Spoken Language Understanding in Conditional Models, in COLING/ACL06, Association for Computational Linguistics, Sydney, Australia, 2006
- Ye-Yi Wang, Li Deng, and Alex Acero, Spoken Language Understanding — An Introduction to the Statistical Framework, in IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 16-31, Institute of Electrical and Electronics Engineers, Inc., 2005
- Ye-Yi Wang and Alex Acero, SGStudio: Rapid Semantic Grammar Development for Spoken Language Understanding, in 9th European Conference on Speech Communication and Technology, International Speech Communication Association, Lisbon, Portugal, 2005
- Ye-Yi Wang, John Lee, Milind Mahajan, and Alex Acero, Statistical Spoken Language Understanding: from Generative Model to Conditional Model, in NIPS Workshop: Advances in Structured Learning for Text and Speech Processing, Whistler, BC, Canada, 2005
- Ye-Yi Wang and Yun-Cheng Ju, Creating Speech Recognition Grammars from Regular Expressions for Alphanumeric Concepts, in International Conference on Spoken Language Processing, International Speech Communication Association, Jeju, Korea, 2004
- Ye-Yi Wang and Alex Acero, Concept Acquisition in Example-Based Grammar Authoring, in IEEE International Conference on Acoustics, Speech, and Signal Processing, Institute of Electrical and Electronics Engineers, Inc., Hong Kong, China, 2003
- Ye-Yi Wang and Alex Acero, Combination of CFG and N-gram Modeling in Semantic Grammar Learning, in Eurospeech 2003, International Speech Communication Association, Geneva, Switzerland, 2003
- Ye-Yi Wang, Alex Acero, Ciprian Chelba, Brendan Frey, and Leon Wong, Combination of Statistical and Rule-Based Approaches for Spoken Language Understanding., in International Conference on Spoken Processing, International Speech Communication Association, Denver, Colorado, 2002
- Ye-Yi Wang and Alex Acero, Grammar Learning for Spoken Language Understanding, in IEEE Workshop on Automatic Speech Recognition and Understanding, Institute of Electrical and Electronics Engineers, Inc., Madonna di Campiglio, Italy, 2001



