Identifying Salient Entities in Web Pages

  • Michael Gamon ,
  • Tae Yano ,
  • Xinying Song ,
  • Johnson Apacible ,
  • Patrick Pantel

International Conference on Information and Knowledge Management (CIKM'13) |

Published by ACM

Publication

We propose a system that determines the salience of entities within web documents. Many recent advances in commercial search engines leverage the identification of entities in web pages. However, for many pages, only a small subset of entities are central to the document, which can lead to degraded relevance for entity triggered experiences. We address this problem by devising a system that scores each entity on a web page according to its centrality to the page content. We propose salience classification functions that incorporate various cues from document content, web search logs, and a large web graph. To cost-effectively train the models, we introduce a soft labeling methodology that generates a set of annotations based on user behaviors observed in web search logs. We evaluate several variations of our model via a large-scale empirical study conducted over a test set, which we release publicly to the research community. We demonstrate that our methods significantly outperform competitive baselines and the previous state of the art, while keeping the human annotation cost to a minimum.

Publication Downloads

Microsoft Document Aboutness Dataset

November 19, 2012

The Microsoft Document Aboutness Dataset consists of randomly sampled URLs (from a HEAD and TAIL distribution), all entities recognized in those documents, and a relevance assessment for each entity/URL pair as to whether or not the entity is salient to the content of the URL.