Mining Query Subtopics from Search Log Data

Yunhua Hu; Yanan Qian; Hang Li; Daxin Jiang; Jian Pei; Qinghua Zheng

Mining Query Subtopics from Search Log Data

Yunhua Hu ,
Yanan Qian ,
Hang Li ,
Daxin Jiang ,
Jian Pei ,
Qinghua Zheng

International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) | January 2012

Published by International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

Download BibTex

Most queries in web search are ambiguous and multifaceted. Identifying the major senses and facets of queries from search log data, referred to as query subtopic mining in this paper, is a very important issue in web search. Through search log analysis, we show that there are two interesting phenomena of user behavior that can be leveraged to identify query subtopics, referred to as ‘one subtopic per search’ and ‘subtopic clarification by keyword’. One subtopic per search means that if a user clicks multiple URLs in one query, then the clicked URLs tend to represent the same sense or facet. Subtopic clarification by keyword means that users often add an additional keyword or keywords to expand the query in order to clarify their search intent. Thus, the keywords tend to be indicative of the sense or facet. We propose a clustering algorithm that can effectively leverage the two phenomena to automatically mine the major subtopics of queries, where each subtopic is represented by a cluster containing a number of URLs and keywords. The mined subtopics of queries can be used in multiple tasks in web search and we evaluate them in aspects of the search result presentation such as clustering and re-ranking. We demonstrate that our clustering algorithm can effectively mine query subtopics with an F1 measure in the range of 0.896-0.956. Our experimental results show that the use of the subtopics mined by our approach can significantly improve the state-of-the-art methods used for search result clustering. Experimental results based on click data also show that the re-ranking of search result based on our method can significantly improve the efficiency of users’ ability to find information.