Congkai Sun, Bin Gao, Zhenfu Cao, and Hang Li
2008
Previously topic models such as PLSI (Probabilistic Latent Semantic
Indexing) and LDA (Latent Dirichlet Allocation) were developed for
modeling the contents of plain texts. Recently, topic models
for processing hypertexts such as web pages were also
proposed. The proposed hypertext models are generative models giving
rise to both words and hyperlinks. This paper points out
that to better represent the contents of hypertexts it is more
essential to assume that the hyperlinks are fixed and to define the
topic model as that of generating words only. The paper then
proposes a new topic model for hypertext processing, referred to as
Hypertext Topic Model (HTM). HTM defines the distribution of words
in a document (i.e., the content of the document) as a mixture over
latent topics in the document itself and latent topics in the
documents which the document cites. The topics are further
characterized as distributions of words, as in the conventional
topic models. This paper further proposes a method for learning the
HTM model. Experimental results show that HTM outperforms the
baselines on topic discovery and document classification in three
datasets.
![]() PDF file |
In Proceedings of the 2008 conference on Empirical Methods in Natural Language Processing
| Type | Article |