Ying Yan, Jiaxing Zhang, Bojun Huang, Jiaqi Mu, Zheng Zhang, and Thomas Moscibroda
In this paper, we study the computation of statistics over data that is stored across a distributed set of nodes. In this setting, a data vector of size N is distributed into L nodes such that each node holds a partial data in such a way that the global data vector is the sum of the locally stored data vectors. The goal is that each node transmits information to some global aggregator node, which can then compute the desired statistical function. For this setting, there are well-known lower bounds that show that even for simple aggregation functions such as the max-value, the total amount of communication required is at least linear in the size of the data N in the worst-case. In this paper, we show that these lower bounds can be beaten if we assume that the underlying data has sparsity properties. Specifically, we devise a new algorithm for the distributed outlier detection problem that is based on compressive-sensing and exploits the sparse structure of the underlying data. We show both empirically on real web-scale production data as well as theoretically that such use of compressive sensing-based techniques can result in substantial improvements for distributed computing problems. Specifically, we prove under some conjecture that the algorithm can succeed in recovering outliers using only O(sc · logN) constant size messages, where c is a small constant and s is the sparsity of the underlying data.