Yangqiu Song, Haixun Wang, Zhongyuan Wang, and Hongsong Li
Most of the text mining tasks, such as clustering, is dominated by statistical approaches that treat text as a bag of words. Semantics in the text is largely ignored in the mining process, and the mining results are often not easily interpretable. One particular challenge faced by such approaches is short text understanding, as short text lacks enough content from which a statistical conclusion can be drawn. We enhance machine learning algorithms by first giving the machine a probabilistic knowledgebase that contains as big, rich, and consistent concepts (of worldly facts) as those in our mental world. Then a Bayesian inference mechanism is developed to conceptualize words and short text. We conducted comprehensive tests of our method on conceptualizing set of text terms, as well as clustering Twitter messages. We show significant improvements in terms of tweets clustering accuracy.