
Topic:
Developing Scalable, Domain Portable Information Extraction Systems
Abstract:
Information extraction (IE)
systems are the backbone of data mining solutions involving unstructured
textual data. Information extraction
technology has matured to the point that it is being used in both commercial
and military/intelligence applications. In order to be effective, both in terms
of accuracy as well as the type of information extracted, such systems must be
robust, scalable and domain portable.
This talk begins with a discussion of robust, scalable architectures for
IE systems, including hybrid architectures that combine both machine learning
and grammar approaches. This is followed
by a discussion of domain porting techniques for IE, including manual,
semi-automated and automated approaches with applications to verticals such as
biomedical, business, manufacturing, and intelligence. Handling challenging text modalities such as
chat and e-mail will also be discussed.
The talk concludes with some future directions of this field.
Recommended Reading:
· IJCAI-99 Tutorial on Information Extraction (Appelt, Israel), provides good overview: http://www.ai.sri.com/~appelt/ie-tutorial/
· Wikipedia page (Contains many useful links): http://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Information_extraction
· ACE (state-of-the-art evaluation) : http://www.itl.nist.gov/iad/894.01/tests/ace/
· GATE: http://gate.ac.uk/ie/
· Semantex: R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine," Journal of Natural Language Engineering, Cambridge U. Press, 2006.
Bio of the Speaker:

Dr. Srihari received a B. Math. in Computer Science from the University of Waterloo, Ontario, Canada, and a Ph.D. in Computer Science from the University at Buffalo in 1992 for a dissertation on using collateral text in interpreting photographs. Dr. Srihari's research is in text mining, particularly advanced representations for information retrieval. Her current research is in two areas of text analytics viz. Unapparent information revelation and Information Extraction. Dr. Srihari has also worked in statistical language modeling, in particular, statistical language models for recognizing handwritten text. She is a member of the Association for Computing Machinery (ACM) and the Association for Computational Linguistics (ACL).
Homepage: http://www.cedar.buffalo.edu/~rohini/
E-mail: rohini@cedar.buffalo.edu
Additional Material (References, Slides & Lecture Notes):