Jian-Tao Sun, Dou Shen, Huajun Zeng, Qiang Yang, Yuchang Lu, and Zheng Chen
Most previous Web-page summarization methods treat a Web page as plain text. However, such methods fail to un- cover the full knowledge associated with aWeb page to build a high-quality summary, because the Web contains many hidden relationships that are not used in these methods. Uncovering the inherent knowledge is important to building good Web-page summarizers. In this paper, we extract the extra knowledge from the clickthrough data of a Web search engine to improve Web-page summarization. We first ana- lyze the feasibility to utilize clickthrough data in text sum- marization, and then propose two adapted summarization methods that take advantage of the relationships discovered from the clickthrough data. For those pages not covered by the clickthrough data, we put forward a thematic lexi- con approach to generate implicit knowledge for them. Our methods are evaluated on a relatively small dataset consist- ing of manually annotated pages as well as a large dataset that is crawled from the Open Directory Project website. The experimental results indicate that significant improve- ments can be achieved through our proposed summarizer as compared with summarizers without using the clickthrough data.
|Publisher||Association for Computing Machinery, Inc.|
Copyright © 2004 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or email@example.com. The definitive version of this paper can be found at ACM’s Digital Library –http://www.acm.org/dl/.