Arnd Christian Konig and Gerhard Weikum
This paper aims to improve the accuracy of query result-size estimations in query optimizers by leveraging the dynamic feedback obtained from observations on the executed query workload. To this end, an approximate representation of data-value distributions is devised that combines histograms with parametric curve fitting, leading to a specific class of linear splines. The approach reconciles the benefits of histograms, simplicity and versatility, with those of parametric techniques especially the adaptivity to statistically biased and dynamically evolving query workloads.
The paper presents efficient algorithms for constructing the linear-spline representation for data-value distributions from a moving window of the most recent observations on (the most critical) query executions. The approach is worked out in full detail for capturing frequency as well as density distributions of data values, and it is shown how result size estimations are inferred for exact-match and range queries. The developed methods generalize to multi-dimensional distributions in a straightforward manner, thus being able to capture correlations among attributes as well. In addition, an extension is developed for capturing equi-join result sizes directly in the spline-based approximate representation. Intensive experiments underline the accuracy of the developed estimation methods, outperforming the best known classes of histograms.
In 25th International Conference on Very Large Data Bases
Publisher Very Large Data Bases Endowment Inc.
All articles published in this journal are protected by copyright, which covers the exclusive rights to reproduce and distribute the article (e.g., as offprints), as well as all translation rights. No material published in this journal may be reproduced photographically or stored on microfilm, in electronic data bases, video disks, etc., without first obtaining written permission from Very Large Data Bases Endowment Inc.