Banner

 

Topic:

Developing Scalable, Domain Portable Information Extraction Systems

Abstract:

Information extraction (IE) systems are the backbone of data mining solutions involving unstructured textual data.  Information extraction technology has matured to the point that it is being used in both commercial and military/intelligence applications. In order to be effective, both in terms of accuracy as well as the type of information extracted, such systems must be robust, scalable and domain portable.  This talk begins with a discussion of robust, scalable architectures for IE systems, including hybrid architectures that combine both machine learning and grammar approaches.  This is followed by a discussion of domain porting techniques for IE, including manual, semi-automated and automated approaches with applications to verticals such as biomedical, business, manufacturing, and intelligence.   Handling challenging text modalities such as chat and e-mail will also be discussed.   The talk concludes with some future directions of this field.

Recommended Reading:

·         IJCAI-99 Tutorial on Information Extraction (Appelt, Israel), provides good overview: http://www.ai.sri.com/~appelt/ie-tutorial/

 

·         Wikipedia page (Contains many useful links): http://en.wikipedia.org/wiki/Information_extractionhttp://en.wikipedia.org/wiki/Information_extraction

 

·         ACE (state-of-the-art evaluation) : http://www.itl.nist.gov/iad/894.01/tests/ace/

 

·         GATE:  http://gate.ac.uk/ie/

 

·         Semantex:  R. K. Srihari, W. Li, C. Niu and T. Cornell,"InfoXtract: A Customizable Intermediate Level Information Extraction Engine," Journal of Natural Language Engineering, Cambridge U. Press, 2006.

Bio of the Speaker:

http://www.cedar.buffalo.edu/~rohini/Rohini-Bio-2.jpg

Dr. Srihari received a B. Math. in Computer Science from the University of Waterloo, Ontario, Canada, and a Ph.D. in Computer Science from the University at Buffalo in 1992 for a dissertation on using collateral text in interpreting photographs. Dr. Srihari's research is in text mining, particularly advanced representations for information retrieval. Her current research is in two areas of text analytics viz. Unapparent information revelation and Information Extraction. Dr. Srihari has also worked in statistical language modeling, in particular, statistical language models for recognizing handwritten text. She is a member of the Association for Computing Machinery (ACM) and the Association for Computational Linguistics (ACL).

Homepage: http://www.cedar.buffalo.edu/~rohini/

E-mail: rohini@cedar.buffalo.edu


Additional Material (References, Slides & Lecture Notes):