NLPwin

Established: October 3, 2014

An introduction by Lucy Vanderwende*

* on behalf of everyone who contributed to the development of NLPwin

NLPwin is a software project at Microsoft Research that aims to provide Natural Language Processing tools for Windows (hence, NLPwin). The project was started in 1991, just as Microsoft inaugurated the Microsoft Research group; while active development of NLPwin continued through 2002, it is still being updated regularly, primarily in service of Machine Translation.

NLPwin was and is still being used in a number of Microsoft products, among which the Index Server (1992-3), Word Grammar Checker (parsing every sentence to logical form since 1996), the English Query feature for SQL Server (SQL Server 1998 – 2000), natural language query interface for Encarta (1999, 2000), Intellishrink (2000), and of course, Bing Translator (opens in new tab).

Since we knew that we were developing NLPwin in part to support a grammar checker, the NLPwin grammar is designed to be broad-coverage (i.e., not domain-specific) and robust, in particular, robust to grammar errors. While most grammars are learned from data annotated on the PennTreeBank (opens in new tab), it is interesting to consider that such grammars may not be able to parse ungrammatical or fragmented grammar, since those grammars have no training data for such input. The NLPwin grammar produces a parse for any input and if no spanning parse can be assigned, it creates a “fitted” parse, combining the largest constituents that it was able to construct.

The NLP rainbow: we envisioned that with ever more sophisticated analysis capabilities, it would be possible to create applications of a wide variety. As you can see below, the generation component was not well developed and we postulated NL applications for generation much as one hopes for a pot of gold at the end of the rainbow. Our first MT models transferred at the semantic level (papers through 2002), while today, our MT transfers primarily at the syntactic level, using a mixture of syntax-based and phrase-based models.

nlpwin_nlp_rainbow

Figure 1: The NLP rainbow (1991), our original vision for NLP components needed and applications possible.

The architecture follows a pipeline approach, as shown in Figure 2, where each component provides additional layers of analysis/annotation of the input data. We designed the system to be relatively knowledge-poor in the beginning, while making use of richer and richer data sources as the need for more semantic information increased; one of our goals of this architecture is to preserve ambiguity until we either needed to resolve that ambiguity or the data resources existed to allow the resolution. Thus, the syntactic analysis proceeds in two steps: the syntactic sketch (which today might be described as a packed forest) and the syntactic portrait, where we “unpack” the forest and construct a constituent level of analysis which is syntactic, but also semantically valid. The constituency tree continues to be refined even during Logical Form processing as more global information can be brought to bear.

nlpwin_nlp_components

Figure 2: The NLPwin components and a schematic of their output representation.

A few points are worth making about the parser (a term which loosely combines the morphology, sketch and portrait modules). First, the parser is comprised of human authored rules. This will cause incredulity among those who are only familiar with machine-learned parsers that have been trained on the PennTreeBank (opens in new tab). It should be kept in mind that the NLPwin parser was constructed before the first parser was trained on the PennTreeBank, that the parser had to be fast (to support the grammar checker) and that grammar rule-writing was the norm pre-PennTreeBank grammars. Furthermore, the grammarian tasked with writing rules was supported by a sophisticated array of NLP developer tools (opens in new tab) (created by George Heidorn), much as a programmer is now supported in Visual Studio, where grammar rules can be run to and from specific points in the code, variables can be changed interactively for exploration purposes, and most importantly, the developer environment supported running a suite of test files with interfaces for the grammarian to update the target files with improved parses. Secondly, the lead grammarian, Karen Jensen, broke with the implicit tradition where the constituent structure is implied by application of the parsing rules[1]. Jensen observed that binary rules are required to handle even common language phenomena such as free word order, and adverbial and prepositional phrase placement. Thus, in NLPwin, we use binary rules in an augmented phrase structure grammar formalism (APSG), computing the phrase structure as part of the actions of the rules, thereby creating nodes with unbounded modifiers, while maintaining binary rules, illustrated in Figure 3.

nlpwin_nlp_derivation_tree

Figure 3: The derivation tree displays the history of rule application, while the computed tree provides a useful visualization of phrase structure.

Another important aspect of NLPwin is that it is the record structure, not the trees, that is the fundamental output of the analysis component (shown in Figure 4). Trees are merely a convenient form of display, using only 5 of the many attributes that make up the representation of the analysis (premodifiers (PRMODS), HEAD, postmodifiers (PSMODS), segment-type (SEGTYPE), and string value. Here is the record, a collection of attributes and values, for the node DECL1:

nlpwin_nlp_records

Figure 4: The record structure of any constituent is the heart of the NLPwin analysis.

Once the basic shape of the constituency tree has been determined, it is possible to compute what the Logical Form is. The goal of Logical Form is twofold: to compute the predicate-argument structure for each clause (“who did what to whom when where and how?”) and to normalize differing syntactic realizations of what can be considered the same “meaning”. In so doing, concepts that are possibly distant in the sentence and in the constituent structure can be brought together, in large part because the Logical Form is represented as a graph, where linear order is no longer primary. The Logical Form is a directed, labeled graph, where arcs are labeled with those relations that are defined to be semantic and surface words that convey syntactic information only are represented not as nodes in the graph but rather as annotations on the nodes, preserving their syntactic information (not shown in the graph representation below). Consider the following Logical Form:

nlpwin_nlp_logical_form

Figure 5: A Logical Form example.

The Logical Form graph in Figure 5 represents the direct connection between “elephants” and “have”, which is interrupted by a relative clause at the surface syntax. Moreover, in analyzing the relative clause, the Logical Form has performed two operations: Logical Form normalizes the passive construction as well as assigns the referent of the relative pronoun “which”. Other operations commonly performed by Logical Form include (but are not limited to): unbounded dependencies, functional control, indirect object paraphrase, assigning modifiers.

Figure 5 also demonstrates some of the shortcomings of Logical Form: 1) should “have” be a concept node in this graph or should it be interpreted as an arc labeled Part between “elephant” and “tusk”? More generally: what should the inventory of relation labels be, and how should that inventory be determined? And 2) should we infer from this sentence only that “African elephants have been hunted” and that “African elephants have large tusks”, or can we infer that “elephants have been hunted” and that they happen to be “African elephants”. Deciding this question of scoping was postponed till discourse processing[2], when such questions may be addressed, and Logical Form does not represent the ambiguity in scoping.

During development of the NLPwin pipeline (see Figure 2), we considered that there would be a separate component determining word senses following the syntactic analysis of the input. This component was meant to select and/or collate lexical information from multiple dictionaries to represent and expand the lexical meaning of each content word.  This view on Word Sense Disambiguation (WSD) was in contrast to the then-nascent interest in WSD in the academic community, which formulated the WSD task as selecting one sense of a fixed inventory of word senses as being correct. Our primary objection to this formulation is that any fixed inventory will necessarily not be sufficient as the foundation for a broad-coverage grammar (see Dolan, Vanderwende and Richardson, 2000 (opens in new tab)). For similar reasons, we elected to abandon the pursuit of assigning Word Senses in NLPwin as well. Today, the field has made great strides in exploring a more flexible notion of lexical meaning with the advent of vector space, which would be promising to combine with the output of this parser.

While we did not view Word Sense Disambiguation as a separate task, we did design our parser and subsequent components to make use of ever richer lexical information. The sketch grammar relies on the subcategorization frames and other syntactic-semantic codes available from two dictionaries: Longman Dictionary of Contemporary English (LDOCE) and American Heritage Dictionary, 3rd edition, for which Microsoft acquired the digital rights. LDOCE in particular provides rich lexical information that facilitates the construction of Logical Form[3]. Such codes, rich as they are, do not support full semantic processing as is necessary when, for example, determining the correct attachment of prepositional phrases or nominal co-reference. The question was: is it possible to acquire such semantic knowledge automatically, in order to support a broad-coverage parser?

In the early to mid-90s, there was considerable interest in mining dictionaries and other reference works for semantic information broadly-speaking. For this reason, we envisioned that where lexical information was not sufficient to support the decisions that needed to be made in the Portrait component, we would acquire such information in machine readable reference works.

At the time, few broad-coverage parsers were available so the main thrust was to develop string patterns (regexes) that could be used to identify specific types of semantic information; Hearst (1992) describes the use of such patterns for the acquisition of Hypernymy (is-a terms). Alshawi (1989) parses dictionary definitions using a grammar especially designed for that dictionary (“Longmanese”).  We encountered two concerns about using this approach: first, as the need for greater recall increases, writing and refining string patterns becomes more and more complex, in the limit, approaching the complexity of full-grammar writing and so straying far from the straightforward string patterns you started with, and second, when extracting semantic relations beyond Hypernymy, we found string patterns to be insufficient (see Montemagni and Vanderwende 1992 (opens in new tab)).

Instead, we proposed to parse the dictionary text using the linguistic components already developed, Sketch, Portrait and Logical Form, ensuring access to robust parsing, in order to bootstrap the knowledge acquisition of the semantic information needed to improve the Portrait. This bootstrapping is possible because some linguistic expressions are unambiguous, and so, at each iteration, we can extract from unambiguous text to improve the parsing of ambiguous text (see Vanderwende 1995 (opens in new tab)).

As each definition in the dictionary and on-line encyclopedia was being processed and the semantic information was being stored for access by Portrait, a picture emerged from connecting all of the graph fragments. When viewed as a database rather than a look-up table (which is how people use dictionaries), the graph fragments are connected and interesting paths/inferences emerge. To enrich the data further, we then took the step of viewing each graph fragment from the perspective of each content node. Imagine looking at the graph as a mobile and picking it up at each of the objects in turn – the nodes under the object remain the same, but the nodes above that object become inverted (illustrated in Figure 6). For example, for the definition of elephant: :an animal with ivory tusks”, MindNet stores not only the graph fragment “elephant PART (tusk MATR ivory)” but also “tusk PART-OF elephant” and “ivory MATR-OF tusk”[4].

nlpwin_nlp_inverted_lf

Figure 6: Logical Form and its inversions.

We called this collection of intersecting graphs MindNet. Figure 7 reflects the picture we saw for the word “bird” when looking at all of the pieces of information that were automatically gleaned from dictionary text:

nlpwin_nlp_mindnet

Figure 7: A fragment NLPwin MindNet, centered on the word “bird”

As a person using only the dictionary, it would be very difficult to construct a list of all the different types of birds, all of the parts of a bird, all of the places that a bird may be found, or the types of actions that a bird may do. But by converting the dictionary to a database, and inverting all the semantic relations as shown in Figure 6, MindNet contains rich semantic information for any concept that occurs in text, esp. because it is produced by automated methods using a broad-coverage grammar, a grammar that parses fragments as well as it parses complete grammatical input.

We computed a similarity metric for MindNet (opens in new tab) by using Roget’s Thesaurus as annotated training data. Given a pair of words from Roget, we computed all paths in MindNet between those synonyms and then observed how often path patterns occur (patterns of relation types with and without the specific concepts linking those relations types). Thus, we learn that if X and Y are connected using the path pattern: (X – Hypernym – z – HypernymOf Y) or (X – HasObject – z – ObjectOf – Y), that X and Y are deemed to be similar with high weight.  We can then query arbitrary word pairs for their similarity, finding that “gold” and “zinc” are similar, while “gold” and “bicycle” are not.

A priori, there is no reason that MindNet cannot be produced from text other than dictionary or encyclopedia text. Indeed, if MindNet was being developed today, we would aim to automatically acquire semantic information from the web. The notable engineering concern is processing time, though the availability of massively parallel web services mitigates that concern in large part. The other notable concern is to establish the veracity of the source material (part of IBM Watson’s success in the Jeopardy game can be attributed to careful selection of its information sources (opens in new tab)). Even in the case where sources are equally trustworthy, what should happen to (apparent) contradictions? Weights computed for specific pieces of the knowledge graph can be used to balance how frequently that information is encountered, but the source itself should also be considered in the weight scheme. Moreover, MindNet is not simply a database of triples; we preserve the context from which the semantic relations were extracted, and so in theory, we could resolve apparent contradictions by taking context into account. We did not encounter these concerns as MindNet has only been computed from sources that are categorically true (dictionaries and encyclopedias), but these concerns should be addressed going forward with knowledge acquisition from the web.

The original intent, as shown in Figure 2, was to reduce paraphrases to a canonical representation in a module that we tentatively named “Concepts”, though “Concept Detection” would have been more descriptive. As with Word Sense Disambiguation, we abandoned this module as we were dissatisfied with the underlying assumption that one representation of a concept or complex event would be primary over others, while in reality, both expressions are equivalent; equivalence should be fluid and allow to vary depending on the need of the application. Here again, we believe that the current research which aims to represent parse fragments in vector space is a promising approach, while emphasizing that it is essential to take the parse and logical form structure into account.

Finally, a few words about the generation grammar (shown on the right hand side of the rainbow in Figure 1). In NLPwin, we developed two types of generation grammars: rule-based generation components (including those that shipped with Microsoft Word to enable the re-write of passive to active, e.g.) and Amalgam, a set of machine-learned generation modules. Both types of generation grammars were used in production for Machine Translation.

In Summary …

We’ve described some of the aspects of the NLPwin project at Microsoft Research[5]. The lexical and syntactic processing components are designed to be broad-coverage and robust to grammatical errors, allowing for parses to be constructed for fragmented, ungrammatical as well as grammatical inputs. These components are largely rule-based grammars, making use of rich lexical and semantic resources derived from online dictionaries. The output of the parsing component, a tree analysis, is converted to a graph-based representation called Logical Form. The goal of Logical Form is to compute the predicate-argument structure for each clause and to normalize differing syntactic realizations of what can be considered the same “meaning”. In so doing, the distance between concepts reflects the semantic distance and no longer the linear distance in the surface realization, bringing related concepts closer together than they might appear at the surface. MindNet is the automatic construction of the database of connected Logical Forms. When reference resources are the source text for MindNet, MindNet can be viewed as a traditional Knowledge Acquisition method and object, but when MindNet is constructed by processing arbitrary text input, MindNet represents a global representation of all the Logical Forms of that text which allows the browsing of the concepts and their semantic connections in that text. In fact, MindNet was considered most compelling as a means for browsing and exploring specific relations mined from a text collection.

[1] see Jensen, Karen. 1987. Binary rules and non-binary trees: Breaking down the concept of phrase structure. In Mathematics of language, ed. A. Manaster-Ramer, 65-86. Amsterdam: John Benjamins Pub.Co.

[2] In fact, the NLPwin system has not (yet) addressed this issue till today.

[3] The LDOCE box codes, for instance, provide information on type restrictions and the arguments for verbs. In LDOCE, “persuade” is marked ObjC, indicating, that “persuade” has Object Control (i.e. that the object of “persuade” is understood to be the subject of the verb complement). Thus, it is possible to construct a Logical Form with “John” as the subject of “go to the library” from the input sentence: “I persuaded John to go to the library”, while for the input sentence “I promised John to go to the library”, the Logical Form is constructed with “I” as the subject of “go to the library”.

[4] The algorithm of course also identifies the relation “elephant HYPERNYM animal”, but, in dictionary processing, the information extracted from the differentiae of the definition (the specifications on the hypernym), are true of the word being defined rather than true of the hypernym, and so we do not extract that “animals have tusks” but rather that “elephants have tusks”.

[5] At the time of this writing (2014) NLPwin is considered a mature system, with only limited development of the generation and logical form components.