Sampling Based Range Partition Methods for Big Data Analytics

  • Milan Vojnovic ,
  • Fei Xu ,
  • Jingren Zhou

MSR-TR-2012-18 |

Big Data Analytics requires partitioning datasets into thousands of partitions according to a specific set of keys so that different machines can process different partitions in parallel. Range partition is one of the ways to partition the data that is needed whenever global ordering is required. It partitions the data according to a pre-defined set of exclusive and continuous ranges that covers the entire domain of the partition key. Providing high-quality (approximately equal-sized) partitions is a key problem for the big data analytics because the job latency is determined by the most loaded node. This problem is especially challenging because typically no statistics about the key distribution over machines for an input dataset is available at the beginning of a range partition. The system needs to find a way to determine the partition boundaries that is both cost-effective and accurate. This paper presents a weighted-sampling based approach, implemented in Cosmos–the cloud infrastructure for big data analytics used by Microsoft Online Service Division. The approach has been used by many jobs daily and was found to be both efficient and providing desired partition quality.