
Topic:
Information Extraction: Extracting Facts from Unstructured Data
Abstract:
Information Extraction (IE) is the task of extracting facts in a relational form from unstructured data. The facts to be extracted are typically pieces of semantic information of the type 'who did what to whom?’. In this talk, the speaker would introduce the typical subtasks addressed by state of the art IE systems: named entity detection, mention detection, co reference resolution, relation extraction and event extraction. Further the talk would describe the evolution of standardized evaluations of IE systems from MUC to ACE to GALE and also the current top systems, the approaches that the best systems take, and characteristics of some commercial IE offerings. Finally, we will discuss where IE systems are headed in the future.
Bio of the Speaker:
Nanda Kambhatla was born in Hyderabad, India in 1969. He received a B.Tech degree with first class honors in 1990 in Computer Science and Engineering from the Institute of Technology, Benaras Hindu University, India, and a Ph.D degree in Computer Science and Engineering from the Oregon Graduate Institute of Science & Technology, Oregon, USA, in 1996. Since 1996, Nanda has worked as a postdoctoral fellow under Prof. Simon Haykin at McMaster University, Canada and as a senior research scientist at WiseWire Corporation, Pittsburgh. He joined IBM's T.J.Watson Research Center in 1997 and worked on spoken dialog systems for call routing, telephony banking and stock trading, the universal interaction middleware architecture and a web based dialog system for helping people shop for thinkpads online. Since 2000, Nanda has led a team working on Information Extraction and text mining. Since 2002, the team has achieved top tier results in successive Automatic Content Extraction (ACE) evaluations conducted by NIST for extracting entities, events and relations from text from multiple sources, in multiple languages and forms. The team has created a toolkit for information extraction that includes a tool for human annotation of documents, a trainer for building machine learning models and a fast, scalable decoder for extracting information from raw text using the trained models. Until Feb 16th, 2007, Nanda was the manager of the Statistical Text Analytics Group at Watson and co-chair of the NLP PIC at Watson. He was also a co-PI and the task lead for the Language Exploitation Environment (LEE) subtask for the DARPA GALE project. The goal of the LEE task was to (leverage UIMA and) create an architecture and a framework to enable interoperability of heterogenous analysis components (e.g. speech recognition and translation engines running on remote machines in different own OS/language environments). Nanda transferred to IRL Bangalore starting Feb 19th, 2007. Nanda's research interests are focused on technology solutions for creating, storing, searching, and processing large volumes of unstructured data (text, audio, video, etc.) and specifically on applications of statistical learning algorithms to these tasks.
Homepage: http://www.research.ibm.com/cweb/kambhatla.html
E-Mail: nkambhat@in.ibm.com
Additional Material (References, Slides & Lecture Notes):