Camdoop: Exploiting In-network Aggregation for Big Data Applications

Paolo Costa, Austin Donnelly, Antony Rowstron, and Greg O'Shea

Abstract

Large companies like Facebook, Google, and Microsoft as

well as a number of small and medium enterprises daily

process massive amounts of data in batch jobs and in real

time applications. This generates high network traffic,

which is hard to support using traditional, oversubscribed,

network infrastructures. To address this issue, several alternative

network topologies have been proposed, aiming

to increase the bandwidth available in enterprise clusters.

We observe that in many of the commonly used workloads,

data is aggregated during the process and the output

size is a fraction of the input size. This motivated us to explore

a different point in the design space. Instead of increasing

the bandwidth, we focus on decreasing the traffic

by pushing aggregation from the edge into the network.

We built Camdoop, a MapReduce-like system running

on CamCube, a cluster design that uses a direct-connect

network topology with servers directly linked to other

servers. Camdoop exploits the property that CamCube

servers forward traffic, to perform in-network aggregation

of data during the shuffle phase. Camdoop supports

the same functions used in MapReduce and is compatible

with existing MapReduce applications. We demonstrate

that, in common cases, Camdoop significantly reduces

the network traffic and provides high performance

increase over a version of Camdoop running over a

switch and against two production systems, Hadoop and

Dryad/DryadLINQ.

Details

Publication typeInproceedings
Published in9th USENIX Symposium on Networked Systems Design and Implementation (NSDI'12)
PublisherUSENIX
> Publications > Camdoop: Exploiting In-network Aggregation for Big Data Applications