Reducing the Human Overhead in Text Categorization

Many applications in text processing require significant human effort for either labeling large document collections (when learning statistical models) or extrapolating rules from them (when using knowledge engineering). In this work, we describe a way to reduce this effort, while retaining the methods’ accuracy, by constructing a hybrid classifier that utilizes human reasoning over automatically discovered text patterns to complement machine learning. Using a standard sentiment-classification dataset and real customer feedback

data, we demonstrate that the resulting technique results in significant reduction of the human effort required to obtain a given classification accuracy. Moreover, the hybrid text classifier also results in a significant boost in accuracy over machine-learning based

classifiers when a comparable amount of labeled data is used.

PDF file

In  Proceedings of KDD 2006

Publisher  Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or The definitive version of this paper can be found at ACM’s Digital Library --


> Publications > Reducing the Human Overhead in Text Categorization