Load Management and Fault-Tolerance in a Distributed Stream Processing System

Recently, a new class of data management applications has emerged in areas such as sensor-based environmental monitoring, financial services, network monitoring, and military applications. These “stream processing applications” require low-latency processing of large-volume data streams. Because traditional database management systems are ill-suited for high-volume, low-latency stream processing, new systems, called stream processing engines (SPEs), have been developed.

In this talk, we present the software architecture and algorithms in Borealis, one of the first distributed stream processing engines. We discuss how our system meets two important challenges: (1) distributed load management, and (2), fault-tolerant operation in the face of node failures, network failures, and network partitions.

We present a mechanism that enables autonomous participants to collaboratively handle load. Our approach is based on contracts that participants negotiate offline. At runtime, participants move load only to partners with whom they have a contract and pay each other the contracted price, making the mechanism lightweight. We show that our approach provides incentives that foster participation and leads to good system-wide load balance properties.

For fault-tolerance, we present a replication-based scheme that masks most node and network failures. When network partitions occur, our approach addresses the traditional availability-consistency trade-off by striving to minimize inconsistencies, while ensuring that the system meets the desired availability specified by the application or user.

Speaker Details

Magdalena Balazinska is an assistant professor in the Computer Science and Engineering Department at the University of Washington. She received a PhD from MIT in February 2006 and was selected as one of five Microsoft New Faculty Fellows in 2007. Magdalena’s research interests are broadly in the fields of databases and distributed systems. She is currently working on Moirae, a system that integrates historical information into continuous monitoring engines, and the RFID Ecosystem, a system for managing RFID data, detecting probabilistic events from that data, and studying building-scale RFID deployments.

Date:
Speakers:
Magdalena Balazinska
Affiliation:
Massachusetts Institute of Technology
    • Portrait of Jeff Running

      Jeff Running