Support Vector Machines for Paraphrase Identification and Corpus Construction

The lack of readily-available large corpora of aligned monolingual sentence pairs is a major obstacle to the development of Statistical Machine Translation-based paraphrase models. In this paper, we describe the use of annotated datasets and Support Vector Machines to induce larger monolingual paraphrase corpora from a comparable corpus of news clusters found on the World Wide Web. Features include: morphological variants; WordNet synonyms and hypernyms; log-likelihood-based word pairings dynamically obtained from baseline sentence alignments; and formal stringfeatures such as word-based edit distance. Use of this technique dramatically reduces the Alignment Error Rate of the extracted corpora over heuristic methods based on position of the sentences in the text.

I05-5001[1].pdf
PDF file

In   Third International Workshop on Paraphrasing (IWP2005)

Publisher  Asia Federation of Natural Language Processing
copyright 2005 by AFNLP

Details

TypeInproceedings
Share
Share this page on Facebook
Share this page on Twitter
Share this page on LinkedIn
E-mail this page
RSS feeds
> Publications > Support Vector Machines for Paraphrase Identification and Corpus Construction