Grouping and Duplicate Elimination: Benefits of Early Aggregation

  • Paul Larson

MSR-TR-97-36 |

Early aggregation is a technique for speeding up the processing of GROUP BY queries by reducing the amount of intermediate data transferred between main memory and disk. It can also be applied to duplicate elimination because duplicate elimination is equivalent to grouping with no aggregation functions. This paper describes six different algorithms for grouping and aggregation, shows how to incorporate early aggregation in each of them, and analyzes the resulting reduction in intermediate data. In addition to the grouping algorithm used, the reduction depends on several factors: the number of groups, the skew in group size distribution, the input size, and the amount of main memory available. All six algorithms benefit from early aggregation with grouping by hash partitioning producing the least amount of intermediate data. If the group size distribution is skewed, the overall reduction can be very signifcant, even with a modest amount of additional main memory.