A Web-based Text Corpora Development System

LREC 2000 2nd International Conference on Language Resources & Evaluation |

One of the most important starting points for any NLP endeavor is the construction of text corpora of appropriate size and quality. This paper presents a web-based text corpora development system that focuses both on the size and the quality of these corpora. The quantitative problem is solved by using the Internet as a practically limitless resource of texts. To ensure a certain quality, we enrich the text with relevant information to be fit for further use by resolving in an integrated manner the problems of diacritic characters restoration, lexical ambiguity resolution and morphosyntactic annotation. Although at this moment it is targeted at texts in Romanian, a number of mechanisms have been provided that allows it to be easily adapted to other languages.