Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection

David L. Chen and William B. Dolan

Abstract

Traditional methods of collecting translation and paraphrase data are prohibitively expensive, making the construction of large, new corpora difficult. While crowdsourcing offers a cheap alternative, quality control and scalability can become problematic. We discuss a novel annotation task that uses videos as the stimulus which discourages cheating. In addition, our approach requires only monolingual speakers, thus making it easier to scale since more workers are qualified to contribute. Finally, we employ a multi-tiered payment system that helps retain good workers over the long-term, resulting in a persistent, high-quality workforce.We present the results of one of the largest linguistic data collection efforts to date using Mechanical Turk, yielding 85K English sentences and more than 1k sentences for each of a dozen more languages.

Details

Publication typeProceedings
Published inBuilding a Persistent Workforce on Mechanical Turk for Multilingual Data Collection
PublisherHuman Computer Interaction International Conference
> Publications > Building a Persistent Workforce on Mechanical Turk for Multilingual Data Collection