Learning Vector Representations for Similarity Measures

Conventional vector-based similarity measures consider each term separately. In methods such as cosine or overlap, only identical terms occurring in both term vectors are matched and contribute to the final similarity score. Non-identical but semantically related terms, such as "car" and "automobile", are completely ignored. To address this problem, we propose a novel approach that learns a new vector construction from the original term vectors. The weight of each element in the output vector is a linear combination of the term-weighting scores of related terms. Depending on the configuration, our method can learn extended term vectors using the same vocabulary, as well as "concept" vectors with reduced dimensionality. In both settings, it outperforms existing methods significantly in the task of measuring document similarity, reflected in various metrics consistently.

TR-S2Net.pdf
PDF file

Details

TypeTechReport
NumberMSR-TR-2010-139
Share
Share this page on Facebook
Share this page on Twitter
Share this page on LinkedIn
E-mail this page
RSS feeds
> Publications > Learning Vector Representations for Similarity Measures