Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Fast Top-K Similarity Queries Via Matrix Compression

Yucheng Low and Alice X. Zheng

Abstract

In this paper, we propose a novel method to efficiently compute the top-K most similar items given a query item, where similarity is defined by the set of items that have the highest vector inner products with the query. The task is related to the classical k-Nearest-Neighbor problem, and is widely applicable in a number of domains such as information retrieval, online advertising and collaborative filtering. Our method assumes an in-memory representation of the dataset and is designed to scale to query lengths of 100,000s of terms. Our algorithm uses a generalized Hölder’s inequality to upper bound the inner product with the norms of the constituent vectors. We also propose a novel compression scheme that computes bounds for groups of candidate items, thereby speeding up computation and minimizing memory requirements per query. We conduct extensive experiments on the publicly available Wikipedia dataset, and demonstrate that, with a memory overhead of 21%, our method can provide 1-3 orders of magnitude improvement in query run-time compared to naive methods and state of the art competing methods. Our median top-10 word query time is 25 s on 7.5 million words and 2.3 million documents.

Details

Publication typeInproceedings
Published inProceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
PublisherACM International Conference on Information and Knowledge Management (CIKM)
> Publications > Fast Top-K Similarity Queries Via Matrix Compression