The internet has revolutionized the way we communicate, leading to a constant flood of informal text available in electronic format, including: email, Twitter, SMS and also informal text produced in professional environments such as the clinical text found in electronic medical records. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale.
In this talk I will discuss several challenges and opportunities which arise when applying NLP and IE to informal text, focusing specifically on Twitter, which has recently rose to prominence, challenging the mainstream news media as the dominant source of real-time information on current events. I will describe several NLP tools we have adapted to handle Twitter’s noisy style, and present a system which leverages these to automatically extract a calendar of popular events occurring in the near future (http://statuscalendar.cs.washington.edu).
I will further discuss fundamental challenges which arise when extracting meaning from such massive open-domain text corpora. Several probabilistic latent variable models will be presented, which are applied to infer the semantics of large numbers of words and phrases and also enable a principled and modular approach to extracting knowledge from large open-domain text corpora.