Implementing Dataflow With Threads

MSR-TR-2006-181 |

In the summer of 2005, I was writing an algorithm in PlusCal [161] and essentially needed barrier synchronization as a primitive. The easiest way to do this in PlusCal was to write a little barrier synchronization algorithm. I used the simplest algorithm I could think of, in which each process maintains a single 3-valued variable–the Barrier1 algorithm of this paper. The algorithm seemed quite nice, and I wondered if it was new. A Web search revealed that it was. (In 2008, Wim Hesselink informed me that he had discovered this algorithm in 2001, but he had “published” it only in course notes.) I was curious about what barrier synchronization algorithm was used inside the Windows operating system and how it compared with mine, so I asked Neill Clift. He and John Rector found that my algorithm outperformed the one inside Windows. Meanwhile, I showed my algorithm to Dahlia Malkhi, who suggested some variants, including the paper’s Barrier2 algorithm.

By around 1980, I knew that the producer/consumer algorithm introduced in [23] should generalize to an arbitrary marked graph, but I never thought it important enough to bother working out the details. (Marked graphs, which specify dataflow computation, are described in the discussion of [141].) I realized that these new barrier synchronization algorithms should also be instances of that generalization. The fact that the barrier algorithms worked well on a real multiprocessor made the general algorithm seem more interesting. Further thought revealed that the good performance of these barrier algorithms was not an accident. They have optimal caching behavior, and that optimal behavior can be achieved in the general case. All this makes the general synchronization algorithm relevant for the coming generation of multicore processor chips.