Large Linear Classification when Data Cannot Fit In Memory

Linear classification is a useful tool for dealing with large-scale data in applications such as document classification and natural language processing. Recent developments of linear classification have shown that the training process can be efficiently conducted. However, when the data size exceeds the memory capacity, most training methods suffer from very slow convergence due to the severe disk swapping. In the first part of the talk, we describe a block minimization framework for data larger than memory. Under the framework, a solver splits data into blocks and stores them into separate files. Then, at each time, the solver trains a data block loaded from disk. In the second part of the talk, we introduce a selective block minimization (SBM) algorithm, a block minimization method that makes use of selective sampling. At each step, SBM updates the model using data consisting of two parts: (1) new data loaded from disk and (2) a set of informative samples already in memory from previous steps. We prove that, by updating the linear model in the dual form, the proposed method fully utilizes the data in memory and converges to a globally optimal solution on the entire data. Experiments show that the SBM algorithm dramatically reduces the number of blocks loaded from disk and consequently obtains an accurate and stable model quickly on both binary and multi-class classification.

Joint work with Hsiang-Fu Yu, Cho-Jui Hsieh, Chih-Jen Lin and Dan Roth.

Speaker Details

Kai-Wei Chang is a Ph.D. student at University of Illinois at Urbana-Champaign under the supervision of Prof Dan Roth. Prior to this, he obtained his master and undergraduate degrees in National Taiwan University under Prof. Chih-Jen Lin. His research interests focus on machine learning and its applications to natural language processing. Kai-Wei is awarded the KDD Best Paper Award in 2010 and won the Yahoo! Key Scientific Challenges Award in 2011. In addition, he is one of the main contributors to the widely used linear classification package, LIBLINEAR.

Date:
Speakers:
Kai-Wei Chang
Affiliation:
University of Illinois at Urbana-Champaign

Series: Microsoft Research Talks