ExPert: Pareto-Efficient Replicated Task EXecution

Many large-scale distributed environments, aka “clouds”, “grids”, or “batch systems”, execute Bags-of-(millions of) Tasks (BOTs). These include Google’s Map-Reduce, Intel’s NetBatch, and resource-demanding extreme e-Science computations.
However, any large-scale distributed system incurs faults, and, when executed on a non-dedicated system, the tasks also experience preemption by higher-priority activities. This is the reason, for example, for the long-tail phenomenon of BOT execution, which may lengthen the turnaround time for BOT computation by hundreds of percentage points.

To reduce task turnaround time and to cope with the environment unreliability, users employ task replication, however replication wastes resources and raises non-trivial turnaround-cost trade-offs. Moreover, users usually have a choice of several environments of different reliability and cost characteristics, which make them take hard and non-optimal decisions in sending (replicas of) their tasks for execution.

To address this problem we introduce ExPert, a general framework and the associated algorithms and tools for optimizing turnaround-cost trade-offs of BOT execution in mixture of environments. Our framework allows for the selection of a Pareto-efficient task replication strategy, subject to the user-specified utility function, thus minimizing the waste of budget (in terms of energy, cost, etc.) otherwise incurred by the replication policies. We show through mathematical and trace-based analysis that by working with our framework the user may expect a significant cost reduction (even an order of magnitude) with no performance loss, in realistic scenarios.

Speaker Details

Prof. Assaf Schuster (http://www.cs.technion.ac.il/~assaf) interests are in the areas of data streams, data mining, parallel and distributed and grid computing. Since 1991 he is with the Computer Science department at the Technion – Israel Institute of Technology. At the Technion he established and is heading the Distributed Systems Laboratory (DSL http://dsl.cs.technion.ac.il/). He published over 140 papers in his areas of expertise in prestigious conferences and high-quality journals. He regularly participates in program committees for conferences on knowledge discovery in large systems, and conferences on parallel and distributed computing. He consults the hi-tech industry and government agencies on related issues and is the inventor of several patents. He serves as an associate editor of the distinguished journals: Journal of Parallel and Distributed Computing, and IEEE Transactions on Computers. He supervises fifteen master and doctor students, and takes part in large national and international projects as an expert on data management, knowledge discovery in databases, grid and distributed computing. His group participate as a partner (specializing on core HPC, grid and data mining technologies) in several national and European projects.

Date:
Speakers:
Assaf Schuster
Affiliation:
Technion, Haifa
    • Portrait of Jeff Running

      Jeff Running