Designing a Common POS-Tagset Framework for Indian Languages

  • Sankaran Baskaran ,
  • Kalika Bali ,
  • Tanmoy Bhattacharya ,
  • Pushpak Bhattacharyya ,
  • Girish Nath Jha ,
  • Rajendran S ,
  • Saravanan K ,
  • Sobha L ,
  • and Subbarao K V

Proceedings of The 6th Workshop on Asian Languae Resources, 2008 |

Research in Parts-of-Speech (POS) tagset design for European and East Asian languages started with a mere listing of important morphosyntactic features in one language and has matured in later years towards hierarchical tagsets, decomposable tags, common framework for multiple languages (EAGLES) etc. Several tagsets have been developed in these languages along with large amount of annotated data for furthering research. Indian Languages (ILs) present a contrasting picture with very little research in tagset design issues. We present our work in designing a common POS-tagset framework for ILs, which is the result of in-depth analysis of eight languages from two major families, viz. Indo-Aryan and Dravidian. Our framework follows hierarchical tagset layout similar to the EAGLES guidelines, but with significant changes as needed for the ILs.