Wen-tau Yih and Chris Meek
25 October 2010
Conventional vector-based similarity measures consider each term separately. In methods such as cosine or overlap, only identical terms occurring in both term vectors are matched and contribute to the final similarity score. Non-identical but semantically related terms, such as "car" and "automobile", are completely ignored. To address this problem, we propose a novel approach that learns a new vector construction from the original term vectors. The weight of each element in the output vector is a linear combination of the term-weighting scores of related terms. Depending on the configuration, our method can learn extended term vectors using the same vocabulary, as well as "concept" vectors with reduced dimensionality. In both settings, it outperforms existing methods significantly in the task of measuring document similarity, reflected in various metrics consistently.