Pseudo-Anchor Text Extraction for Vertical Search

  • Shuming Shi ,
  • Fei Xing ,
  • Mingjie Zhu ,
  • Zaiqing Nie ,
  • Ji-Rong Wen

MSR-TR-2006-122 |

Anchor text plays a special important role in improving the performance of general Web search. The importance of anchor text comes from the fact that it is fairly objective description for a Web page by potentially a large amount of other Web pages. Vertical search provides indexing and search functionality for objects in a certain domain, and is becoming an important supplement for general Web search. It is desired to utilize anchor text in vertical search as well to improve search performance. Vertical objects typically lack explicit URLs to accurately identify them. The anchor-text of a vertical object is also hard to acquire explicitly. This paper proposes concepts of pseudo-URL and pseudo-anchor-text for vertical objects, corresponding to the URL and anchor-text of a general Web page. For extracting and utilizing pseudo-anchor-text information of vertical objects, we focus on candidate anchor block accumulation and pseudo-anchor extraction in this paper. State-of-the-art data integration techniques are utilized to accumulate candidate anchor blocks belonging to same objects. Pseudo-anchor text for each object is extracted from its candidate anchor blocks using a machine learning based approach. A case study in academic search domain indicates that our approach is able to dramatically improve search performance.