|
|||||
Awards
AT&T Fellow, 2001 Education
Received Ph.D., MS, and SB degrees in 1983, 1980, and 1978, all
from MIT in Computer Science. Undergraduate thesis, supervised by
Richard Greenblatt, on chess programming. Thanks to considerable help and
encouragement from Hans Berliner, it was published in IJCAI-79 as Co-ordinate
Squares: A Solution to Many Chess Pawn Endgames. Masters thesis, supervised by Peter Szolovits,
argued that a finite state machine (FSM) was sufficient, and that one did not
need much more complex mechanisms (ATNs or Turing Machines) that were in
vogue at the time. The thesis was
originally intended to help Szolovits' Expert System MEDG (Medical Decision
Making Group) generate better explanations. Became convinced that the MEDG
explanations were difficult to understand because they were full of
center-embedding, a kind of recursion that requires unbounded memory to
process. This argument was based
on Chomsky's 1959 proof that center-embedded characterizes the difference
between push down automata and finite state machines (FSMs). The thesis ultimately evolved into a
demonstration that language could be processed on a FSM by implementing a
parser based on Mitch Marcus' thesis and Kaplan-Bresnan's Lexical-Functional
Grammar, and showing that the stack did not need more than about three cells
as long as one used certain compiler optimizations such as tail recursion to
conserve memory. Doctoral thesis, supervised by Jon
Allen, argued that well-known parsing methods could be used help take
advantage of allophonic variation in speech. The phoneme /t/, for example, is
typically realized with a heavily aspirated burst at the beginning of a syllable
as in Tom, but not at the end of a syllable as in atlas. At the
time, these sorts of variations were seen as a kind of noise that a speech
recognition machine would have to overcome. I preferred to view the speech system
as consisting of two types of clues, those that varied systematically with
the suprasegmental (syllable structure and stress) context such as aspiration
and those that tended to be relatively invariant to contextual influences
such as place, manner and voicing.
Both types of clues are useful.
Variant clues can be used in a parser to discover the underlying
context, and therefore should not be considered noise. An abbreviated version of the thesis was
published in Cognition in 1987; the complete thesis was published by
Kluwer the following year. Work
Experience
1983-1990 Joined AT&T Bell Laboratories
research in 1983 to work with Mitch Marcus and Mark Liberman in Osamu
Fujimura's Linguistic Research and Artificial Intelligence department. Worked on a number of problems in speech
and language, especially part of speech tagging, pronunciation of surnames
for speech synthesis and text analysis.
Published a statistical trigram part of speech tagger in 1988. The approach has since become
standard, but at the time it was believed that much more complex mechanisms
such as ATNs or Turing Machines were required. (There was also a strong bias
at the time against empirical/statistical approaches to language.) My interest in text analysis grew out of
a collaboration with the lexicographer Patrick
Hanks. I ran into a number of
lexicographers while working on letter-to-sound rules for speech
synthesis. At the time,
letter-to-sound rules were being used to guess when the letter
“t” should be pronounced with the phoneme /t/ and when it
shouldn't. Word accuracy rates
were unacceptably low. After
acquiring the rights to a dictionary, word accuracy quickly climbed from
about 80% to 99.9% or better. The
dictionary-based approach has since become standard, but at the time there
was considerable controversy over the “larger” memory requirement
(0.25 megabytes). Ever since I
have been acquiring data just about as fast as I have been able to acquire disks. These days, we buy disks by the
terabyte. After this successful interaction with
lexicography, Patrick Hanks visited Bell Laboratories for a month in early
1989. He showed me how
lexicographers were beginning to use large corpora to find collocations,
words that are often found together like doctor/nurse, or save...from. We found ways to automate some of the
drudgery by using well-understood statistical measures like mutual
information. Began working with
William Gale, a statistician at Bell Laboratories, on estimating the
probability of rare events in large corpora, comparing the Good-Turing method
with a held-out estimator. Over
the next five years, ending with Gale's retirement, we published a series of
papers together on spelling correction, word-sense disambiguation, aligning
parallel corpora, and alternatives to the Poisson for modeling the
distribution of words in text. 1990-1991 Visited Ed Hovy
at USC/ISI for an academic year, with support from ARPA for work on machine
translation and parallel corpora (texts such as the Canadian Parliamentary
debates that are published in two languages). Spent the time reading in many areas,
especially in machine translation (MT) and information retrieval (IR). Started attending meetings in both
areas. Became very active in
corpus lexicography, publishing a number of papers with Hanks, Gale, Yarowsky and others. Also used the time to help members of
the community get started in corpus-based research by serving on data
collection committees such as the ACL/DCI (Association for Computational Data
Collection Initiative), writing review articles (for Computational
Linguistics, Machine Translation, and CACM) and tutorials (Coling-90, 1992
European Summer School, ACL-95).
Taught a graduate seminar course at USC with Ed Hovy in the spring
term. Started reviewing large
numbers of papers for conferences, journals and funding agencies, perhaps as
many as a hundred papers in some years. 1991-1994 Returned to Bell Laboratories in fall
1991 and switched into a software engineering organization, headed by Peter
Weinberger. Continued my interest in large corpora, by showing how a large
multi-million line program could be thought of as a large corpora, and how
corpus-based techniques could be used in the discovery process, the process
of reading a large program for the first time. Wrote a program with Jon Helfman
called dotplot that helps find
self-similarity (copied regions) in text (large corpora, source code, object
code) using methods borrowed from biology for matching long sequences of DNA. Developed the Termworks platform with AT&T Language Line, a commercial
translation service. The platform
was designed to help translators with technical terminology and translation
reuse. Translators don't need
help with easy vocabulary and easy grammar (the focus of most machine
translation research) but they do need help with technical terms like dialog
box. The solutions to these
terminology puzzles can be surprising, even to a native speaker of the
language. Microsoft uses finestra(window)
in their Italian manuals, but boite(box) in their French manuals. The tools help translators conform to
house standards by extracting examples of the term and its translations from
a previous release of the manual that has already been published in both
languages. A statistical
alignment program is used to figure out which parts of the source text
correspond to which parts of the translation. The tool and its implications for
machine translation research were presented at an invited talk at EACL-93.
Pierre Isabelle and I edited a special issue of the journal Machine Translation
(1997 12:1/2) on tool-based alternatives to machine translation. Worked on a digital library project
based on a collection of 500,000 pages of AT&T memos supplied by the
Murray Hill Library in 400 dots-per-inch bitmap format (20 gigabytes). Demonstrated how off-the-shelf optical
character recognition (OCR) and information retrieval (IR) programs could be
used to implement a ``fax-a-query'' information retrieval service where both
the input and output could be faxes or bitmaps or any format that could be
converted to a fax or bitmap. Argued at Coling-94 that fax had advantages
over proposed ascii-based standards such as SGML (availability, ease-of-use,
longevity). Began work on
bitmap-level matching, avoiding the need for OCR. Found that bitmap-level matching could
also be used to improve up-sampling and down-sampling of documents, which is
important if the scanner, the fax transmission line and the printer use
different sampling rates. 1995-1996 Taught a seminar course on corpus-based
methods at Penn with Mitch Marcus.
Served on thesis committees for Pascale Fung and Vasileios
Hatzivassiloglou, both at 1996- Promoted to head a new department in
research. AT&T has vast
quantities of data, some is textual, but most is not. The department uses visualization and
datamining, methods that have become popular in corpus-based natural language
processing, to find patterns of interest. Imagine that we have a large steady
stream of usage records indicating which customers are buying which services
and when. This kind of data feed
is usually developed for billing purposes, but many companies are discovering
a broad range of other applications such as fraud detection, inventory
control, provisioning scarce resources, forecasting demand, etc. The problems are challenging from a
research point of view because of the massive scale. The telephone network, for example,
has about 100 million customers (nodes), who purchase about 100 million calls
(edges) per day. This data stream
is much larger than the very large corpora that I had just a few years ago. We used to think that a billion words
of text was a lot of data, but we are currently receiving a couple billion
records a week. 1998-1999 Developed with Glenn Fowler and Don
Caldwell a program called pzip that is like
gzip, but uses an automatic training procedure to induce an optimal
schema for compressing tabular data such as database records and spread
sheets. Sometimes it is better to
compress a table in row major order and sometimes it is better to compress a
table in column major order. Pzipuses
a schema to numerate the values in the table and then passes the results to
gzip. Another program, pin
(partition inducer), uses dynamic programming to find the optimal partition
(or schema) by applying gzip to all possible partitions of a sample of
the table. Pzip typically
saves a factor of two in both space and time over gzip, which has been
very much appreciated by several large data warehousing applications. A paper was published in SODA-2000 pdf. Won a “Circle of
Excellence” award in 1998 from the AT&T CIO (Chief Information
Office) for this work. Established excellent relationships with
AT&T marketing, especially the folks behind the new Lucky Dog flanker brand. My team maintains a web site that
tracks sales in “real time” (faster than previously
possible). This web site is
changing the culture in marketing.
Now that they know they can get quick feedback, they are much more
willing to experiment. Try a
little bit of this and a little bit of that, and do more of what works and
less of what doesn't. The web
site has moved SWIFT from a research prototype into a highly visible business
reality. SWIFT was a very nice
demo that visualized traffic on the AT&T network on a map projected on a
powerwall, a huge wall-sized terminal.
The point of the demo is that we ought to be able to react to a change
in the marketplace in less than a day.
To make the idea a business reality, we simplified the interface so
that marketing managers could drive it, and replaced the fancy powerwall
hardware with a standard web browser.
In addition, we worked closely with marketing managers to understand
what matters to them, a task that reminded me of my days in Peter Szolovits'
expert systems group at MIT. 1999-2003 Built upon our success with Lucky Dog to
establish relationships with other marketing groups, especially the launch of
AT&T local service in Despite my new interests in datamining, visualization, management and business, I remain active in computational linguistics, my first love. With so many responsibilities, it has been more difficult to publish, but nevertheless, together with Yamamoto, we showed how to compute term frequencies and document frequencies, two statistics used in Information Retrieval, for all substrings in a large corpus. It is well known how to compute these statistics for short substrings, e.g., unigrams, bigrams and trigrams, but we showed how to compute many of these kinds of statistics for all substrings including very long ones, e.g., million-grams. One might have thought that this would be impractical since there are n2 substrings, but it turns out that the n2 substrings can be grouped into n classes using a data structure called suffix arrays, and many of the statistics of interest can be computed over the n classes rather than the n2 substrings. Since we were able to compute these statistics over a larger set of words and phrases than previously possible, we found an interesting new heuristic for distinguishing good keywords from general vocabulary, which seems promising for both Japanese and English. Both mutual information and a new measure called residual IDF (inverse document frequency) compare an observation to chance, but they use different notions of chance. Mutual information looks for deviations from compositionality. It grew out of work with lexicographers (Hanks) and so it isn't surprising that it is better for finding general vocabulary, the kind of words and phrases that one would expect to find in a dictionary. On the other hand, residual IDF looks for words and phrases that are not randomly distributed across documents. It grew out of the information retrieval literature and so it isn't surprising that it is better for finding keywords, the words and phrases that distinguish one document from another.
2003- Joined Community Service: Served on editorial board of Computational Linguistics (CL), Machine Translation
(MT), Quantitative
Linguistics (QL), International
Journal of Corpus Linguistics (IJCL). Served on program committees for
numerous conferences including SIGIR
and ACL, both more than once. Helped to create SIGDAT, a special
interest group focusing on corpus data, and organized, chaired, or co-chaired
many of its annual workshops on very large corpora, which have evolved into
conferences with 100-200 participants per meeting, about equally distributed
over Europe, Asiaand
|
|||||