Brendan Lucier, Ishai Menache, Joseph Naor, and Jonathan Yaniv
We consider mechanisms for online deadline-aware scheduling in large computing clusters. Batch jobs that run on such clusters often require guarantees on their completion time (i.e., deadlines). However, most existing scheduling systems implement fair-share resource allocation between users, an approach that ignores heterogeneity in job requirements and may cause deadlines to be missed. In our framework, jobs arrive dynamically and are characterized by their value and total resource demand (or estimation thereof), along with their reported deadlines. The scheduler’s objective is to maximize the aggregate value of jobs completed by their deadlines. We circumvent known lower bounds for this problem by assuming that the input has slack, meaning that any job could be delayed and still finish by its deadline. Under the slackness assumption, we design a preemptive scheduler with a constant-factor worst-case performance guarantee. Along the way, we pay close attention to practical aspects, such as runtime efficiency, data locality and demand uncertainty. We evaluate the algorithm via simulations over real job traces taken from a large production cluster, and show that its actual performance is significantly better than other heuristics used in practice.
We then extend our framework to handle provider commitments: the requirement that jobs admitted to service must be executed until completion. We prove that no algorithm can obtain worst-case guarantees when enforcing the commitment decision to the job arrival time. Nevertheless, we design efficient heuristics that commit on job admission, in the spirit of our basic algorithm. We show empirically that these heuristics perform just as well as (or better than) the original algorithm. Finally, we discuss how our scheduling framework can be used to design truthful scheduling mechanisms, motivated by applications to commercial public cloud offerings.