Zhimin Chen and Vivek Narasayya
Data analysts need to understand the quality of data in the warehouse. This is often done by issuing many Group By queries on the sets of columns of interest. Since the volume of data in these warehouses can be large, and tables in a data warehouse often contain many columns, this analysis typically requires executing a large number of Group By queries, which can be expensive. We show that the performance of today’s database
systems for such data analysis is inadequate. We also show that the problem is computationally hard, and develop efficient techniques for solving it. We demonstrate significant speedup over existing approaches on today’s commercial database systems.
Publisher Association for Computing Machinery, Inc.
Copyright © 2007 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or firstname.lastname@example.org. The definitive version of this paper can be found at ACM’s Digital Library --http://www.acm.org/dl/.