Share on Facebook Tweet on Twitter Share on LinkedIn Share by email
Navendu Jain
Researcher
Microsoft Research
Research areas:
Cloud computing, data management and large-scale computer systems: distributed systems, networking, operating systems, applied learning and security.

E-mail: navendu [AT] microsoft.com

Research | Technology Transfer | Service | Publications | Students | Bio | Contact


Research

My research focuses on designing and building distributed networked systems to improve their scalability, reliability and security. My current work spans several areas of data center systems, from designing network architectures to managing geo-distributed cloud services.

Current Projects

    SysSieve (Internal Linkhere): Automatically understanding the semantic of human-written free-form text (aka unstructured data) such as software bug reports and build errors in Windows/Windows Phone, knowledge base (KB) articles, Customer Service and Support (CSS) tickets, cloud post-mortems/incident reports, network tickets, and security reports.
    • ConfSeer: Automated detection of software misconfigurations by accurately matching configuration snapshots against Knowledge Base (KB) articles that describe the problems and their solutions in free-form text. [Under submission].
    • NetSieve: Automated problem inference from network trouble tickets to uncover the 'big picture'of network problems and developing best-practices towards their fast and accurate resolution. [NSDI 2013].

    NetWiser (featured here): Building scalable, cost-efficient, agile, and reliable network architecture for next-generation data centers.
    • Service impact of intra-dc and inter-dc network failures: A field study on understanding how failures at the intra-dc level (Top-of-Rack switches, Aggregation switches and Access Routers) and at the inter-dc level (long-haul WAN links) impact availability of online services, and deriving best practices to improve service availability. [SoCC 2013, SIGMETRICS 2013 (Extended Abstract)].
    • Middlebox reliability analysis: Characterizing the reliability of middleboxes in datacenters such as load balancers, firewalls, intrusion detection and prevention systems, and VPNs, and analyzing their implications to improve middlebox reliability. [IMC 2013].
    • Network failure characterization: Understanding network failures in data centers by analyzing failure incidents and correlating them with network traffic, estimating impact of failures, and deriving implications for designing future network architectures. [SIGCOMM 2011].
    • VL2: A scalable and flexible data center network architecture for hundreds of thousands of servers and built from commodity switches that enables high-bisection bandwidth between all communicating server pairs, agility in mapping any service to any server, and achieves graceful performance degradation under failures [SIGCOMM 2009, CACM 2011].

Prior Projects

    Marlowe: Automated and adaptive resource management in data centers.
    • Cloud Auto-scaling: Automated scale-out/in of batch workloads on the cloud to minimize the execution cost and the job completion time. [SPAA 2013].
    • URSA: Scalable load balancing and power management for large-scale cluster storage systems that aims to alleviate hot-spots while minimizing reconfiguration costs [Middleware 2011, TOS 2012].
    • WAVE: Topology-Aware VM Migration in Bandwidth Oversubscribed Datacenter Networks [ICALP 2012].
    • ACES: An adaptive power controller that manages the cost, performance, and reliability tradeoffs for energy-aware server provisioning [INFOCOM 2011].
    • Volley: Automated data placement for cloud services across geographically distributed data centers [NSDI 2010].
    • CloudSeer: Integrating Monitoring and Policy Enforcement for Cloud-Hosted Applications.

    Cloud Chakra (C2): Developing new pricing and application management frameworks for cloud services across geo-distributed data centers.
    • Batch job pricing and scheduling: A new pricing model and a truthful-in-expectation mechanism that performs efficient resource allocation for executing batch applications on cloud computing systems [TOPC 2014, ICAC 2014, TOCS 2012, SAGT 2011, SPAA 2012].
    • EOA: Online job migration algorithms for reducing the electricity bill of running cloud services across multiple data centers [Networking 2011].

Technology Transfers to Microsoft Business Groups

We work closely with several business groups in Microsoft, and we've been fortunate that some of our research has been incorporated in the following engineering innovation efforts (which have been publically disclosed):
  • NetWiser: NetWiser is a first-of-its-kind scalable service that enables automated real-time correlation and analysis of network failures across multiple datacenters. Specifically, the business groups use the NetWiser dashboard to answer three key questions: (1) Is a network problem causing a service outage? Did redundancy work? (2) Can we localize the fault and get details about the problem to perform fast troubleshooting? and (3) How can alarms be correlated to identify high severity outages? NetWiser has been featured here.
  • ConfSeer configuration diagnosis service: We have built a configuration diagnosis service that previously used to have human-defined rules to detect configuration errors in software (e.g., Exchange, Lync, Sharepoint, SQL server) deployed on customer machines. This human-driven process was expensive, time consuming and produced only about a limited number of expert rules. In collaboration with the Windows System Center and Advisor team, we developed a scalable learning engine running in production (service URL) that automatically analyzes the technical solutions in Knowledge Base (KB) articles and detects misconfigurations with high accuracy and in near real-time.
  • Reliability Analysis Framework: Our reliability analysis framework has been used to analyze the network telemetry in order to (a) take key decisions on network capacity upgrade, (b) build network domains for a major online service to deliver >99.99% availability, (c) compare reliability across device platforms and vendors, and (d) perform root-cause analysis of high-impact network failures; see the technical details here.
  • Flat, commodity-switch based datacenter networks: The VL2 project laid the foundation for deploying flat, agile, commodity-switch based datacenter networks which have been deployed in Windows Azure and Bing. The key contributions of VL2 were: (a) building an overlay network on top of the physical topology by separating infrastructure addresses from application addresses, (b) applying traffic oblivious routing to improve link utilization while avoiding out-of-order delivery and congestion, (c) designing and building the scalable directory service on top of Paxos that provides address resolution and access control to enable dynamic scale-out/in for applications, and (b) analyzing datacenter network failures and deriving their implications to build a fault-tolerant datacenter network. The paper appeared in SIGCOMM 2009 and it has been recognized by ACM as one of "the most important research results published in CS in recent years" and appeared as an invited paper in the Research Highlights section of the Communications of the ACM (CACM). This work has been featured here: here.

Professional Service

  • Judge:
    2014: ACM Student Research Competition Grand Finals
    2013: ACM Student Research Competition Grand Finals
    2012: ACM Student Research Competition Grand Finals
    2011: SIGCOMM Posters and Demo session
    2009: Open Source Software Award
  • Program Committee:
    2014: SODA 2014 (Invited Reviewer), NSF Panel Reviewer, NSERC Reviewer, WPBA 2014, SPAA 2014 (Invited Reviewer)
    2013: IWQoS 2013, NSERC reviewer
    2012: CCSW 2012, SMTPS 2012
    2011: ICDE 2011, SMTPS 2011, DISC 2011 (External reviewer)
    2010: Eurosys 2010 (Shadow PC), IWSC 2010, DEBS 2010, SMTPS 2010
    2009: DEBS 2009, SMTPS 2009
    (Please submit your best papers!)
  • Journal Reviewer: ACM/IEEE Transactions on Networking, IEEE Transactions on the Cloud, IEEE Transactions on Systems, IEEE Transactions on Dependable and Secure Computing, IEEE Transactions on Knowledge and Data Engineering, Journal of Computer Networks and ISDN Systems, IEEE Transactions on the Web, IEEE Transactions on Parallel and Distributed Systems, KSII Transactions on Internet and Information Systems.

Selected Publications (Full list)

2014
  • Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs in Large Computing Clusters
    Jonathan Yaniv, Seffi Noar, Ishai Menache and Navendu Jain
    ACM Transactions on Parallel Computing (TOPC-D-13-00049R1), 2014.
    [PDF] [Bibtex] [Project Page]

  • NIMBUS: Cloud-scale Attack Detection and Mitigation
    Rui Miao, Minlan Yu and Navendu Jain
    Proceedings of SIGCOMM'14 (Extended Abstract) Chicago, IL.
    [PDF] [Bibtex] [Project Page]

  • On-demand, Spot, or Both: Dynamic Resource Allocation for Executing Batch Jobs in the Cloud
    Ishai Menache, Ohad Shamir and Navendu Jain
    Proceedings of the USENIX 1th International Conference on Autonomic Computing (ICAC '14), Philadelphia, PA.
    [PDF] [Bibtex] [Project Page]
2013
  • When the Network Crumbles: An Empirical Study of Cloud Network Failures and their Impact on Services
    Rahul Potharaju and Navendu Jain
    Proceedings of the ACM Symposium on Cloud Computing (SoCC '13), Santa Clara, CA.
    [PDF] [Bibtex] [Project Page]

  • Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters
    Rahul Potharaju and Navendu Jain
    Proceedings of the Internet Measurement Conference (IMC '13), Barcelona, Spain.
    Runners-up for the Best Paper Award.
    [PDF] [Bibtex] [Project Page]

  • Cloud Scheduling with Setup Cost
    Yossi Azar, Naama Ben-Aroya, Nikhil Rangarajan and Navendu Jain
    Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA '13), Montreal, QE, July 2013.
    [PDF] [Bibtex] [Project Page]

  • Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets
    Rahul Potharaju, Navendu Jain and Cristina Nita-Rotaru
    Proceedings of the 10th USENIX Symposium on Network Design and Implementation (NSDI '13).
    [PDF] [Bibtex] [Project Page]
2012
  • A Truthful Mechanism for Value-Based Scheduling in Cloud Computing
    Navendu Jain, Ishai Menache, Seffi Naor, and Jonathan Yaniv
    Published in the Theory of Computing Systems (TOCS), 2012.
    [PDF] [Bibtex] [Project Page]

  • Ursa: Scalable Load Balancing and Power Management in Cloud Storage Systems
    Gae-Won You, Seung-Won Hwang, and Navendu Jain.
    Published in ACM Transactions on Storage (TOS), July 2012.
    [PDF] [Bibtex] [Project Page]

  • Near-Optimal Scheduling Mechanisms for Deadline-Sensitive Jobs
    Navendu Jain, Ishai Menache, Seffi Naor, and Jonathan Yaniv
    Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA '12), Pittsburgh, PA, May 2012.
    [PDF] [Bibtex] [Project Page]

  • Topology-Aware VM Migration in Bandwidth Oversubscribed Datacenter Networks
    Navendu Jain, Ishai Menache, Seffi Naor, and F. Bruce Shepherd
    Proceedings of the International Colloqium on Automata, Languages and Programming (ICALP '12), Warwick, UK, July 2012.
    [PDF] [Bibtex] [Project Page]
2011
  • Scalable Load Balancing in Cluster Storage Systems
    Gae-Won You, Seung-Won Hwang, and Navendu Jain.
    Proceedings of the ACM/IFIP/USENIX 12th International Middleware Conference (Middleware '11), Lisbon, Portugal, December 2011.
    [PDF] [Bibtex] [Project Page]

  • Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications.
    Phillipa Gill, Navendu Jain, and Nachi Nagappan.
    Proceedings of the ACM Special Interest Group on Data Communications (SIGCOMM '11), Toronto, Canada, August 2011.
    [PDF] [Bibtex] [Project Page]

  • A Scalable and Flexible Data Center Network.
    Albert Greenberg, James Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Praveen Patel, and Sudipta Sengupta.
    Communications of the ACM (CACM '11), Research highlights.
    [PDF] [PS] [Bibtex] [Project Page]
2010
  • Volley: Automated Data Placement for Geo-Distributed Cloud Services.
    Sharad Agarwal, John Dunagan, Navendu Jain, Stefan Sariou, Alec Wolman, and Harbinder Bhogan.
    7th USENIX Symposium on Network Design and Implementation (NSDI '10).
    San Jose, CA, April 2010.
    [PDF] [Bibtex] [Project Page]
2009
  • VL2: A Scalable and Flexible Data Center Network.
    Albert Greenberg, James Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, Dave Maltz, Praveen Patel, and Sudipta Sengupta.
    ACM Special Interest Group on Data Communications (SIGCOMM '09), Barcelona, Spain, August 2009.
    [PDF] [PS] [Bibtex] [Project Page]
2008
  • Network Imprecision: A New Consistency Metric for Scalable Monitoring.
    Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
    8th USENIX Symposium on Operating Systems Design and Implementation (OSDI '08)
    San Diego, CA, December 2008.
    [PDF] [PS] [Technical Report] [Bibtex]
2007
  • STAR: Self-Tuning Aggregation for Scalable Monitoring.
    Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula, Mike Dahlin, and Yin Zhang.
    33rd International Conference on Very Large Databases (VLDB '07)
    Vienna, Austria, September 2007.
    [PDF] [PS] [Bibtex] [Technical Report]
2006
  • Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core.
    Navendu Jain, Lisa Amini, Henrique Andrade, Richard King, Yoonho Park, Philipe Selo, Chitra Venkatramani.
    25th ACM SIGMOD International Conference on Management of Data (SIGMOD '06)
    Chicago, IL, June 2006.
    [PDF] [PS] [Bibtex]
2005
  • TAPER: Tiered Approach for Eliminating Redundancy in Replica Sychronization.
    Navendu Jain, Mike Dahlin, and Renu Tewari.
    4th USENIX Conference on File and Storage Technologies (FAST '05)
    San Francisco, CA, December 2005.
    [PDF] [PS] [Bibtex] [Project Page]
2004
  • Scaling Real-Time Telematics Applications using Programmable Middleboxes.
    Annie Chen, Navendu Jain, Tadeusz Pietraszek, Angelo Perniola, Sean Rooney, and Paolo Scotton.
    IEEE Consumer Communications and Networking 2004 (CCNC '04)
    Las Vegas, Nevada, January 2004.
    [PDF] [PS] [Bibtex]
2003
  • An Architectural Framework to deploy Scatternet-based Applications over Bluetooth.
    Nitin Pabuwal, Navendu Jain, and B. N. Jain.
    IEEE International Conference on Communications (ICC '03)
    Anchorage, Alaska, May 2003.
    [PDF] [PS] [Bibtex]
2002
  • Verification of Timed Automata via Satisfiability Checking.
    P. Niebert, M. Mahfoudh, Eugene Asarin, Marius Bozga, Navendu Jain and Oded Maler.
    7th International Symposium on Formal Techniques in Real-Time and Fault Tolerant Systems (FTRTFT'02)
    Oldenburg, Germany, September 2002.
    [PDF] [PS] [Bibtex]
2001
  • Improving Image Retrieval Performance using Negative Relevance Feedback.
    T.V. Ashwin, Navendu Jain, and Sugata Ghosal
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '01)
    Salt Lake City, Utah, May 2001.
    [PDF] [PS] [Bibtex]

Students

I've worked with an outstanding group of students:
  • Srinivas Narayana, Ph.D. student, Princeton University (Summer 2014).
  • Mobin Javed, Ph.D. student, UC Berkeley (Summer 2014).
  • Rahul Potharaju, Ph.D. student, Purdue University (Fall 2011, Summer 2012, Summer 2013); current employment: Microsoft, Redmond, WA.
  • Rui Miao, Ph.D. student, University of Southern California (Summer 2013).
  • Anurag Khandelwal (co-mentored with Seny Kamara): IIT Kharagpur (Summer 2013); currently a Ph.D. student at UC Berkeley.
  • Ahmed Khurshid (co-mentored with Ravi Pandya), Ph.D. student, UIUC (Summer 2011).
  • Phillipa Gill, Ph.D. student, University of Toronto, Canada (Fall 2010); current employment: Assistant Professor, SUNY Stonybrook.
  • Gae-won You, Ph.D. student, Postech University, South Korea (Summer 2010).
  • Jianting Cao, Ph.D. student, Georgia Institute of Techology (Summer 2010); current employment: Pure Storage.
  • Marcel Dischinger, Ph.D. student, MPI, Germany (Summer 2009); current employment: Barracuda Networks.

Bio

I'm a researcher with Microsoft Research, Redmond. I received my Ph.D. in Computer Sciences from the University of Texas at Austin, working with Prof. Mike Dahlin. I received B.Tech and M.Tech in CSE from IIT Delhi. After IIT, I spent a fun summer visiting IBM Zurich Research Laboratory. My research interests are broadly in cloud computing, data management, machine learning and distributed networked systems. My work has been a recipient of the Microsoft Trustworthy Computing Award, the Open Source Software Award, second rank in the U360 ML Hackathon, the IBM PhD Fellowship and the Microsoft Graduate Fellowship.


Contact

1 Microsoft Way
Redmond, WA 98052

E-mail: navendu [AT] microsoft.com
 > People > Navendu Jain