Xuezheng Liu, Zhenyu Guo, Xi Wang, Feibo Chen, Xiaochen Lian, Jian Tang, Ming Wu, M. Frans Kaashoek, and Zheng Zhang
Testing large-scale distributed systems is a challenge, because some errors manifest themselves only after a distributed sequence of events that involves machine and network failures. D3S is a checker that allows developers to specify predicates on distributed properties of a deployed system, and that checks these predicates while the system is running. When D3S finds a problem it produces the sequence of state changes that led to the problem, allowing developers to quickly find the root cause. Developers write predicates in a simple and sequential programming style, while D3S checks these predicates in a distributed and parallel manner to allow checking to be scalable to large systems and fault tolerant. By using binary instrumentation, D3S works transparently with legacy systems and can change predicates to be checked at runtime. An evaluation with 5 deployed systems shows that D3S can detect non-trivial correctness and performance bugs at runtime and with low performance overhead (less than 8%).