The limits of single-thread performance and the demands of emerging applications have caused a shift toward increasingly concurrent and parallel software. For example, concurrency and parallelism unlock the performance and energy benefits of multi-core architectures and many domains like servers, mobile devices, and cloud applications require concurrency. Unfortunately, writing correct, reliable concurrent software is extremely difficult. In this talk, I will discuss my research on using architecture and system support to make programs easier to debug and less prone to failure.
First, I will present Recon, a new technique for concurrency debugging. Using a simple statistical model, Recon isolates and reconstructs the root cause of failures to help programmers understand their errors. With hardware support, Recon works efficiently even in production. In experiments with real, buggy programs (e.g., MySQL, Apache) we showed Recon reveals bug root causes with few – often 0 – false positives.
Second, I will present Aviso, a new technique for avoiding failures in buggy concurrent programs. Aviso traces events as programs run. When an execution fails, Aviso uses the failing event trace and a statistical model to generate thread schedule constraints that prevent the same failure from occurring in the future. Collections of systems running Aviso can work cooperatively to find and share effective constraints. Our experiments with real software show that Aviso decreases failure rates by up to two orders of magnitude with performance overheads tolerable for production use.