Banner

 

Topic:

Shallow Parsing

Abstract:

In this talk, the speakers will introduce part of speech tagging and shallow parsing. Tokenization and part of speech tagging (POS tagging) form the first step for any natural language analysis task. POS tagging is the process of marking the words in a text by their grammatical category. This is followed by syntactic analysis or parsing to determine the grammatical structure of a sentence with respect to the grammar of the language.

Full parsing is a challenging task and the challenges involve the development of the full grammar of the language as well as the computational challenges involved in identifying the most plausible parse of a given sentence. However many applications in NLP do not require a complete syntactic analysis. Many of the tasks in Information Retrieval, Information Extraction, Question Analysis, can be performed adequately by identifying the noun phrases, verb phrases, etc and the relationships between these entities. Shallow parsing can be used to recover some limited syntactic information from natural language sentences. This often involves chunking which refers to the process of grouping the words into chunks given their POS tags.  

In this talk the speakers will discuss various approaches that have been used for POS tagging. POS taggers can be developed using linguistic rules, however most taggers developed in recent times are based on machine learning approaches. We will explain how different machine learning approaches like Hidden Markov Models, Maximum Entropy and Conditional Random fields can be used for POS tagging. Different contextual and lexical features are used with these methods for improved tagging accuracy. Also it has been observed that morphological information of the words play an important role in POS tagging. Some results will be discussed. Unsupervised POS tagging is an interesting direction of work for automatically word groups based on similarity in the contexts and forms of the words in a raw corpus. Some methods will be discussed.

The speakers will also discuss several methods for chunking or shallow parsing. A variety of methods have been employed for the chunking task ranging from patterns in the form of rules or regular expressions, as well as various types of graphical models. The rules may be handcrafted or learned. They will discuss how rules for chunks can be encoded in the NLTK system and also discuss some machine learning methods, and a few other interesting models that have been used for this task.

Bio of the Speakers:

ss.jpg

Sudeshna Sarkar is currently a Professor in the Department of Computer Science & Engineering at Indian Institute of Technology Kharagpur. She completed her B.Tech. in Computer Science & Engineering in1989 from IIT Kharagpur, MS in Computer Science from University of California, Berkeley in 1991, and PhD from IIT Kharagpur in 1995. She has served in the faculty of IIT Guwahati and at IIT Kanpur before joining IIT Kharagpur. Her broad research interests are in Artificial Intelligence and Machine Learning. Currently she has been working mainly in the areas of natural language processing, and personalization issues in information retrieval and recommending systems. She has executed several research projects in the related fields. Some of the NLP related projects on which she is currently working on are cross language information access, Machine Translation between Indian languages, and NER and POS tagging for Hindi. She has been also the principal scientist of Minekey, a company incubated at IIT Kharagpur that provides personalization services for content discovery.

Homepage: http://www.facweb.iitkgp.ernet.in/~sudeshna/

Email: sudeshna@cse.iitkgp.ernet.in


Additional Material (References, Slides & Lecture Notes):

&

Mr. Monojit Choudhury has recently submitted his PhD from Department of Computer Science and Engineering, IIT Kharagpur, on "Computation models of real world phonological change". Earlier, he received his B.Tech from the same department in 2002. Mr. Choudhury's research interests include computational models of language evolution and change, complex networks, NLP for resource-poor languages and computational musicology. He will be joining MSRI as a post-doctoral researcher from June 2007.

Email: monojit@cse.iitkgp.ernet.in


Additional Material (References, Slides & Lecture Notes):