Learning Query and Document Similarities from Click-through Bipartite Graph with Metadata

  • Wei Wu ,
  • Hang Li ,
  • Jun Xu

MSR-TR-2011-126 |

We consider learning query and document similarities from a click-through bipartite graph with metadata on the nodes. The metadata consists of multiple types of features of queries and documents. We aim to leverage both the click-through bipartite and the features to learn similarity functions that can effectively measure query-document, document-document, and query-query similarities. The challenges include how to model the similarity functions based on the graph data, and how to accurately and efficiently learn the similarity functions. We propose a method that can solve the problems in a principled way. Specifically, we use two different linear mappings to project queries and documents in two different feature spaces into the same latent space, and take the dot product in the latent space as the similarity function between queries and documents. Query-query and document-document similarities can also be naturally defined as dot products in the latent space. We formalize the learning of similarity functions as learning of the mappings that maximize the similarities of the observed query-document pairs on the enriched click-through bipartite. When queries and documents have multiple types of features, the similarity function is defined as a linear combination of multiple similarity functions, each based on one type of features. We further solve the learning problem by using a new technique called Multi-view Partial Least Squares (M-PLS) developed by us. We prove that the learning problem has a global optimal solution. The global optimum can be efficiently obtained through Singular Value Decomposition (SVD).We also theoretically demonstrate that the proposed method is also capable of finding high quality similar queries. We conducted large scale experiments on enterprise search data and web search data. The experimental results on relevance ranking and similar query finding demonstrate that the proposed method works significantly better than the baseline methods.