Arnd Christian Konig and Gerhard Weikum
Data distribution statistics are vital for database systems and other data-mining platforms in order to predict the running time of complex queries for data filtering and extraction. State-of-the-art database systems are inflexible in that they maintain histograms on a fixed set of single attributes, each with a fixed number of buckets regardless of the underlying distribution and precision requirements for selectivity estimation. Despite many proposals for more advanced types of "data synopses", research seems to have ignored the critical tuning issue of deciding on which attribute combinations synopses should be built and how many buckets (or, analogously, transform coefficients, etc.) these should have with a given amount of memory that is available for statistics management overall. This paper develops a method for the automatic tuning of variable-size spline-based data synopses for multidimensional attribute-value frequency as well as density distributions such that an overall error metric is minimized for a given amount of memory. Our method automatically uses more space for distributions that are harder or more important to capture with good precision. Experimental studies with synthetic and real data demonstrate the viability of the developed auto-tuning method.
|Published in||10th International Conference on Management of Data|