Support Vector Machines for Paraphrase Identification and Corpus Construction

Chris Brockett and William B. Dolan

Abstract

The lack of readily-available large corpora of aligned monolingual sentence pairs is a major obstacle to the development of Statistical Machine Translation-based paraphrase models. In this paper, we describe the use of annotated datasets and Support Vector Machines to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web. Features include: morphological variants; WordNet synonyms and hypernyms; log-likelihood-based word pairings dynamically obtained from baseline sentence alignments; and formal stringfeatures such as word-based edit distance. Use of this technique dramatically reduces the Alignment Error Rate of the extracted corpora over heuristic methods based on position of the sentences in the text.

Details

Publication typeInproceedings
Published in Third International Workshop on Paraphrasing (IWP2005)
PublisherAsia Federation of Natural Language Processing
> Publications > Support Vector Machines for Paraphrase Identification and Corpus Construction