I am a member of the Cloud and Information Services Lab at Microsoft.
I am broadly interested in building storage and compute infrastructure for datacenter settings. I enjoy building and deploying systems in practice as well as releasing software I build as open source projects. In building these systems, my work leverages upon technology trends in datacenter computing.
My recent work has focused on predictable resource management in shared clusters. We have built, Rayon, a layer that supports resource reservation/planning for big-data frameworks and integrated with Apache YARN. Given knowledge of future workload, Rayon plans the cluster's agenda and the online scheduler executes the agenda. The combination of Rayon+YARN enables the cluster framework to meet allocation SLO's to jobs. Rayon has been released as OSS and code ships as part of Apache Hadoop 2.6.
In the past, I have designed/implemented/deployed Kosmos distributed filesystem (KFS) to manage PB's of storage. KFS is currently deployed on a cluster of over 1000 nodes. Taking advantage of faster processors, increasing network connectivity in the datacenter, KFS has since been extended to support erasure codes (i.e., using erasure codes for archiving "cold" data with R+S encoding).
I have also designed/implemented Sailfish, a compute infrastructure which improves handling of intermediate data (i.e., "shuffle" phase in a Map-Reduce computation). Sailfish is based on the observation that the bandwidth within a datacenter will increase substantially in the next few years (viz., 10Gbps between pairs of nodes will be commonplace). We leverage such an expected increase to do network-wide data aggregation to improve disk subsystem performance during the shuffle step. Our results show that Sailfish can improve job completion times at scale by 20% to 5x.
At CISL, I am working on building Hadoop related services on Windows Azure. I also collaborate extensively with colleagues in MSR-SVC, MSR-Redmond, and MSR-Extreme Computing Group (XCG).
A full list of my publications is here.
- Carlo Curino, Djellel E. Difallah, Chris Douglas, Subru Krishnan, Raghu Ramakrishnan, and Sriram Rao, Reservation-based Scheduling: If You’re Late Don’t Blame Us!, in SoCC'14, ACM – Association for Computing Machinery, November 2014
- Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella, Multi-resource Packing for Cluster Schedulers, ACM SIGCOMM, August 2014
- Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Micheal Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck, Tango: Distributed Data Structures over a Shared Log, in SOSP, November 2013
- ccurino, , , , , and sriramra, Reservation-based Scheduling: If You’re Late Don’t Blame Us!, no. MSR-TR-2013-108, October 2013
- Silvius Rus, Micheal Ovsiannikov, Damian Reeves, Paul Sutter, Sriram Rao, Jim Kelly, Chris Zimmerman, Dan Adkins, and Thilee Subramaniam, The Quantcast File System, in 39th International Conference on Very Large Data Bases (VLDB'13), August 2013
- Sriram Rao, Benjamin Reed, and Adam Silberstein, HotROD: Managing Grid Storage With On-Demand Replication, Workshop on Data Management in the Cloud (DMC'13), April 2013
- Ganesh Ananthanarayanan, Christopher Douglas, Raghu Ramakrishnan, Sriram Rao, and Ion Stoica, True Elasticity in Multi-Tenant Clusters through Amoeba, in ACM Symposium on Cloud Computing, October 2012
- Sriram Rao, Raghu Ramakrishnan, Adam Silberstein, Mike Ovsiannikov, and Damian Reeves, Sailfish: A Framework For Large Scale Data Processing, in ACM Symposium on Cloud Computing, October 2012
- Jianjun Chen, Chris Douglas, Michi Mutsuzaki, Patrick Quaid, Raghu Ramakrishnan, Sriram Rao, and Russell Sears, Walnut: a unified cloud object store, in SIGMOD Conference, May 2012