Automatically Constructing a Corpus of Sentential Paraphrases

Third International Workshop on Paraphrasing (IWP2005) |

Published by Asia Federation of Natural Language Processing

An obstacle to research in automatic paraphrase identification and generation is the lack of large-scale, publiclyavailable labeled corpora of sentential paraphrases. This paper describes the creation of the recently-released MicrosoftResearch Paraphrase Corpus, which contains 5801 sentence pairs, each hand-labeled with a binary judgment as to whether the pair constitutes a paraphrase. The corpus was created using heuristic extraction techniques in conjunction with an SVM-based classifier to select likely sentence-level paraphrases from a large corpus of topicclustered news data. These pairs were then submitted to human judges, who confirmed that 67% were in fact semantically equivalent. In addition to describing the corpus itself, we explore a number of issues that arose in defining guidelines for the human raters.

Publication Downloads

Microsoft Research Paraphrase Corpus

March 3, 2005

This download consists of data only: a text file containing 5800 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. No more than 1 sentence has been extracted from any given news article. We have made a concerted effort to correctly associate with each sentence information about its provenance and any associated information about its author.