William D. Lewis and Fei Xia
In this paper we explore the potential for identifying computationally relevant typological features from a multilingual corpus of language data built from readily available language data collected off the Web. Our work builds on previous structural projection work, where we extend the work of projection to building individual CFGs for approximately 100 languages. We then use the CFGs to discover the values of typological parameters such as word order, the presence or absence of definite and indefinite determiners, etc. Our methods have the potential of being extended to many more languages and parameters, and can have significant effects on current research focused on tool and resource development for low-density languages and grammar induction from raw corpora.
In Proceedings of The Third International Joint Conference on Natural Language Processing (IJCNLP)
Publisher Asia Federation of Natural Language Processing
copyright 2007 by AFNLP