HTM: A Topic Model for Hypertexts

Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li

Abstract

Previously topic models such as PLSI (Probabilistic Latent Semantic

Indexing) and LDA (Latent Dirichlet Allocation) were developed for

modeling the contents of plain texts. Recently, topic models

for processing hypertexts such as web pages were also

proposed. The proposed hypertext models are generative models giving

rise to both words and hyperlinks. This paper points out

that to better represent the contents of hypertexts it is more

essential to assume that the hyperlinks are fixed and to define the

topic model as that of generating words only. The paper then

proposes a new topic model for hypertext processing, referred to as

Hypertext Topic Model (HTM). HTM defines the distribution of words

in a document (i.e., the content of the document) as a mixture over

latent topics in the document itself and latent topics in the

documents which the document cites. The topics are further

characterized as distributions of words, as in the conventional

topic models. This paper further proposes a method for learning the

HTM model. Experimental results show that HTM outperforms the

baselines on topic discovery and document classification in three

datasets.

Details

Publication typeArticle
Published inProceedings of the 2008 conference on Empirical Methods in Natural Language Processing
> Publications > HTM: A Topic Model for Hypertexts