On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases

Surajit Chaudhuri, Usama Fayyad, and Goetz Graefe

Abstract

For a wide variety of classification algorithms, scalability to large databases can be achieved by observing that most algorithms are driven by a set of sufficient statistics that significantly smaller than the data. By relying on a SQL backend to compute the sufficient statistics, we leverage the query processing system of SQL databases and avoid the eed for moving data to the client. We present a new SQL operator (Unpivot) that enables efficient gathering of statistics with minimal changes to the SQL backend. Our approach results in significant increase in performance data. We show analytically how this approach outperforms an alternative that requires changing in the data layout. We also compare effect of data representation and show that a "dense" representation may be preferred to a "sparse" one, even when the data are fairly sparse.

Details

Publication typeInproceedings
URLhttp://www.aaai.org/
Pages5
NumberMSR-TR-98-39
InstitutionMicrosoft Research
PublisherAmerican Association for Artificial Intelligence
> Publications > On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases