Tolerating Latency in Replicated State Machines Through Client Speculation

  • Benjamin Wester ,
  • James Cowling ,
  • Edmund B. Nightingale ,
  • Peter M. Chen ,
  • Jason Flinn ,
  • Barbara Liskov ,
  • Ed Nightingale

The 6th USENIX Symposium on Networked Systems Design and Implementation (NSDI '09) |

Published by USENIX

Replicated state machines are an important and widelystudied
methodology for tolerating a wide range of
faults. Unfortunately, while replicas should be distributed
geographically for maximum fault tolerance,
current replicated state machine protocols tend to magnify
the effects of high network latencies caused by geographic
distribution. In this paper, we examine how to
use speculative execution at the clients of a replicated
service to reduce the impact of network and protocol latency.
We first give design principles for using client
speculation with replicated services, such as generating
early replies and prioritizing throughput over latency. We
then describe a mechanism that allows speculative clients
to make new requests through replica-resolved speculation
and predicated writes. We implement a detailed case
study that applies this approach to a standard Byzantine
fault tolerant protocol (PBFT) for replicated NFS and
counter services. Client speculation trades in 18% maximum
throughput to decrease the effective latency under
light workloads, letting us speed up run time on singleclient
micro-benchmarks 1.08–19× when the client is
co-located with the primary. On a macro-benchmark, reduced
latency gives the client a speedup of up to 5×.