Erasure Codes for Big Data over Hadoop and Large-scale Sparse PCA for Twitter Analysis

Speaker  Alex Dimakis and Dimitris S. Papailiopoulos

Host  Cheng Huang

Affiliation  University of Southern California

Duration  01:40:36

Date recorded  30 April 2012

1). Erasure Codes for Big Data over Hadoop

As big data grows faster than infrastructure, triple replication becomes prohibitively expensive for distributed storage systems. For this reason most large systems use erasure codes for archival files to provide high reliability with small storage overheads. Reed-Solomon codes are the common choice, a classical error-correcting construction relying on polynomials over finite fields. Unfortunately, classical error-correcting codes are not sufficient for distributed systems since repairing a single failure requires large data disk IO and network transfers. We will present a new family of erasure codes that minimize repair communication and disk IO during single node failures. An implementation over HDFS will be described and several tradeoffs and research directions will be discussed.

2). Large-scale Sparse PCA for Twitter Analysis

Large-scale data analytics are now becoming a trend in big data set applications, such as event detection and sentiment analysis in social networks that host millions of users. Sparse PCA, a new tool in dimensionality reduction, generates sparse collections of data features that identify major trends in a data set. The sparsity of this tool is fundamental to the high interpretability of the results, e.g., in event detection applications, a small collection of words that "describe" major events are easier to interpret than a bag of thousands of words. Unfortunately, sparse PCA is NP-hard and most of its polynomial-time relaxations involve solving subproblems that become intractable in high dimensions.

We will present a new, parallelizable spectral algorithm for sparse-PCA that inspires a novel feature elimination technique, which on real Twitter data sets reduces the problem size from hundreds of thousands to hundreds of features. Our algorithm comes with specific optimality and approximation guarantees and is based on a new way to visualize the span of matrices using auxiliary spherical variables. Future directions on implementing our algorithm in the Hadoop MapReduce framework will be discussed.

©2012 Microsoft Corporation. All rights reserved.
> Erasure Codes for Big Data over Hadoop and Large-scale Sparse PCA for Twitter Analysis