Replicated state machines
Consensus: a set of processes decide on an input value
Paxos asynchronous consensus algorithm
AP Abstract Paxos: generic, non-local version
CP Classic Paxos: stopping failures, compare-and-swap
1989: Lamport, Liskov and
Oki
DP Disk
Paxos: stopping failures,
read-write
1999: Gafni and Lamport
BP Byzantine
Paxos: arbitrary failures
1999: Castro and Liskov
The paper is at research.microsoft.com/lampson
Lamport 1978: Time, clocks and the ordering of events …
Cast your problem as a deterministic state machine
Takes client input requests for state transitions, called steps
Performs the steps
Returns the output to the client.
Make n copies or ‘replicas’ of the state machine.
Use consensus to feed all the replicas the same inputs.
Steps must be deterministic, local to replica, atomic (use transactions)
Recover by replaying the steps (like transactions)
Even a read needs a step, unless the result is “as of step n”.
Reliable, available data storage system
Airplane flight control
Reflexive: Changing quorums of the consensus algorithm
Issuing a lease:
A lock on part of the state that times out, hence is fault tolerant
Leaseholder can work on its state without consensus
Like any lock, a lease can have modes or be hierarchical
A sequence of views; get a decision quorum in one of them.
Each view v chooses an anchored value cv: equals any earlier decision.
If a quorum accepts the choice, decision!
Decision is irrevocable,
may be invisible, but is any later
view’s choice.
Choice
is changeable, must be visible

· Communicate only stable predicates: once true always true
· Structure program as a set of atomic actions
· Make actions as non-deterministic as possible: weakest guards
Allows more freedom for the implementation
Makes it clear what is essential
· Separate safety, liveness, and performance
Safety first, then strengthen guards for liveness and scheduling
· Abstraction functions and simulation proofs
Subscripts and superscripts for function arguments: rva for r(v,
a)
State functions used like variables
Actions described like this:
|
Name |
Guard |
State change |
|
Closev |
cv = nil Ù x Î anchorv |
→cv := x |
A set M of processes (machines)
A faulty process can send arbitrary
messages: F m
A stopped process does nothing: S m
A failed process is faulty or stopped. Failure doesn’t lose state.
Limits on failure:
ZF = set of sets of processes that can all be faulty
ZS = set of sets of processes that can all be stopped
ZFS = set of sets of processes that can all be failed
Examples:
Fail-stop: n processes, ZF={}, ZS=ZFS=any set of size < (n+1)/2
Byzantine: n
processes, ZF = ZS=ZFS=any set of size < (n+1)/3
Intel-Microsoft: nI + nM processes, ZF=any subset of one side
Quorum: monotonic set of sets of processes: q in Þ any superset in.
Predicates g. Predicates on processes G, so Gm is a predicate.
A stable predicate once true remains true.
A predicate G
holds in a quorum Q: Q#G = {m | Gm Ú Fm} Î
Q
Shorthand: Q[rv*=x] for Q#(λ m | rvm = x).
A good quorum is not all faulty: Q~F = {q | q Ï ZF}
Q and Q′ exclusive: Q quorum for G Þ no Q′ quorum for its negation.
Means q Ç
q′ Î Q~F
for any two quorums. Ex: size >
(n + f )/2
Lifts local exclusion G1 Þ ~G2 to global: Q#G1 Þ ~Q′#G2
Q+: ensures Q even after failures: q+ – zFS Î Q for any q+, zFS
A live quorum has Q+ ≠ {}
type X = ... values to decide on
var d : (X È {nil}) := nil Decision
input : set X := {}
|
Name |
Guard |
State change |
|
Input(x) |
|
input := input È {x} |
|
Decision: X |
d ≠ nil |
→ret d |
|
|
|
|
|
Decide |
d = nil Ù x Î input |
→d := x |
A sequence of views; get a decision quorum in one of them.
Each view v chooses an anchored value cv: equals any earlier decision.
If a quorum accepts the choice, decision!
Decision is irrevocable,
may be invisible, but is any later
view’s choice.
Choice
is changeable, must be visible

Non-local Agents State
functions View is
rv d
cv 1: rv1
d 1 Qdec[rv*=x] x x decided
input 2: rv2
d 2
Qout[rv*=out] out nil out
activev 3: rv3
d 3
else nil nil open
to later views

rua=nil Closev xÎanchorv Choosev cv Acceptv rv=cv Finishv da=rv
rua:=out cv:=x rva:=cv da:=rv
for u < v
Each
value is nil or = the
previous one
Client INPUT x xÎinput

|
|
cv rva rvb
rvc |
cv rva rvb
rvc |
|
View 1 View 2 View 3 |
7 7 out out 8 8 out out 9 out out 9 |
8 8 out out 9 9 out 9 9 out out 9 |
|
input Ç anchor4 |
= {7, 8, 9} seeing a,
b, c |
{9} no matter what |
Two runs of AP with
agents a, b, c,
two agents in a quorum,
input = {7, 8, 9}
|
invariant rv = x Ù ru = x′
Þ x = x′ |
all results agree |
|
|
Ü rv = x Þ (" u < v | cu = x Ú Qout[ru*Î{x,out}]) |
assume u<v rua Î {x, out} Þ ~(rua = x′) |
|
|
sfunc anchorv
|
= |
{x | (" u < v | cu = x Ú Qout[ru*Î{x,out}])} |
|
|
= |
{x | (" w | v0 ≤ w <
u Þ cw = x Ú Qout[rw*Î{x,out}])} |
= anchoru = X if outu,v |
|
= |
{x | cu = x} È (anchoru Ç {x | Qout[ru*Î{x,out}]}) if outu,v |
since |
|
Ê |
if outu,v Ù rua = x then {x} elseif outv0,v then X else {} |
|
where outu,v = (" w | u < w < v Þ rw = out)
|
Startv |
u<v too slow |
→activev := true |
|
|
|
activev |
→for all u < v do |
post u<v Þ rua
≠ nil |
anchorv = {x
| cu = x}
È
(anchoru Ç {x | Qout[ru*Î{x,out}]}) if outu,v
|
Anchorv |
anchorv ≠ {} |
→no state
change |
|
|
Choosev |
cva = nil |
→cv := x |
|
|
Acceptva |
rva =
nil |
→rva := cv; Closeva |
|
|
Finishva |
rv ÎX |
→da := rv |
|
