Ye-Yi Wang, Xiao Li, and Alex Acero
Text classification has been widely applied to many practical tasks. Inductive models trained from labeled data are the most commonly used technique. The basic assumption underlying an inductive model is that the training data are drawn from the same distribution as the test data. However, labeling such a training set is often expensive for practical applications. On the other hand, a large amount of labeled data, which have been drawn from a different distribution, is often available in the same application domain. It is thus very desirable to take advantage of these data even though there is a discrepancy between their underlying distribution and that of the test set. This paper compares three text classification algorithms applied in this scenario, including two inductive Maximum Entropy (MaxEnt) models, one flatly initialized and the other initialized with a term-frequency/inverse document frequency (Tf*Idf) weighted vector space model, and an example-based learning algorithm, which assigns a class label to a text by learning from the labels assigned to the training data that are similar to the text. Experiment results show that example-based learning has achieved more than 5% improvement in precisions across almost all coverage levels.
Publisher International Speech Communication Association
© 2007 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.