Adaptive Work-Stealing With Parallelism Feedback

  • Kunal Agrawal ,
  • Charles E. Leiserson ,
  • Yuxiong He ,
  • Wen-Jing Hsu

ACM Transactions on Computer Systems |

Multiprocessor scheduling in a shared multiprogramming environment can be structured as two level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler schedules the work of a job on its allotted processors. We present a randomized work stealing thread scheduler for fork-join multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors. Our A-STEAL algorithm is appropriate for large parallel servers where many jobs share a common multiprocessor resource and in which the number of processors available to a particular job may vary during the job’s execution. Assuming that the job scheduler never allots a job more processors than requested by the job’s thread scheduler, A-STEAL guarantees that the job completes in near-optimal time while utilizing at least a constant fraction of the allotted processors. We model the job scheduler as the thread scheduler’s adversary, challenging the thread scheduler to be robust to the operating environment as well as to the job scheduler’s administrative policies. For example, the job scheduler might make a large number of processors available exactly when the job has little use for them. To analyze the performance of our adaptive thread scheduler under this stringent adversarial assumption, we introduce a new technique called trim analysis, which allows us to prove that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting near-optimal behavior on the vast majority. More precisely, suppose that a job has work T1 and span T∞. On a machine with P processors, A-STEAL completes the job in an expected duration of O(T1/ P+T∞+Llg P) time steps, where L is the length of a scheduling quantum, and  P denotes the O(T∞+Llg P)-trimmed availability. This quantity is the average of the processor availability over all time steps except the O(T∞+Llg P) time steps that have the highest processor availability.