The ABCDs of Paxos

Replicated state machines

Consensus: a set of processes decide on an input value

Paxos asynchronous consensus algorithm

AP   Abstract Paxos:     generic, non-local version


CP    Classic Paxos:       stopping failures, compare-and-swap
                                      1989: Lamport, Liskov and Oki

DP   Disk Paxos:           stopping failures, read-write
                                      1999: Gafni and Lamport

BP    Byzantine Paxos:  arbitrary failures
                                      1999: Castro and Liskov

 

The paper is at research.microsoft.com/lampson

Replicated State Machines

Lamport 1978: Time, clocks and the ordering of events …

Cast your problem as a deterministic state machine

Takes client input requests for state transitions, called steps

Performs the steps

Returns the output to the client.

Make n copies or ‘replicas’ of the state machine.

Use consensus to feed all the replicas the same inputs.

 

Steps must be deterministic, local to replica, atomic (use transactions)

Recover by replaying the steps (like transactions)

Even a read needs a step, unless the result is “as of step n”.

 

Applications of RSM

Reliable, available data storage system

Airplane flight control

Reflexive: Changing quorums of the consensus algorithm

 

Issuing a lease:

A lock on part of the state that times out, hence is fault tolerant

Leaseholder can work on its state without consensus

Like any lock, a lease can have modes or be hierarchical

The Idea of Paxos

A sequence of views; get a decision quorum in one of them.

Each view v chooses an anchored value cv: equals any earlier decision.

If a quorum accepts the choice, decision!

Decision is irrevocable, may  be invisible, but is any later view’s choice.
Choice    is changeable, must be visible

Design Methodology

·     Communicate only stable predicates: once true always true

·     Structure program as a set of atomic actions

·     Make actions as non-deterministic as possible: weakest guards

Allows more freedom for the implementation

Makes it clear what is essential

·     Separate safety, liveness, and performance

Safety first, then strengthen guards for liveness and scheduling

·     Abstraction functions and simulation proofs

Notation

Subscripts and superscripts for function arguments: rva for r(v, a)

State functions used like variables

Actions described like this:

Name

Guard

    State change

Closev

cv = nil Ù x Î anchorv

cv := x

 

Failure Model

A set M of processes (machines)

A faulty process can send arbitrary messages: F m

A stopped process does nothing: S m

A failed process is faulty or stopped. Failure doesn’t lose state.

Limits on failure:

ZF  = set of sets of processes that can all be faulty

ZS   = set of sets of processes that can all be stopped

ZFS = set of sets of processes that can all be failed

Examples:

Fail-stop: n processes,   ZF={}, ZS=ZFS=any set of size < (n+1)/2

Byzantine: n processes, ZF    =   ZS=ZFS=any set of size < (n+1)/3

Intel-Microsoft: nI + nM processes, ZF=any subset of one side

Quorums and Predicates

Quorum: monotonic set of sets of processes: q in Þ any superset in.

Predicates g. Predicates on processes G, so Gm is a predicate.

A stable predicate once true remains true.

 

A predicate G holds in a quorum Q: Q#G = {| Gm Ú Fm} Î Q

Shorthand: Q[rv*=x] for Q#(λ m | rvm = x).

A good quorum is not all faulty: Q~F = {q | q Ï ZF}

Q and Q exclusive: Q quorum for G Þ no Q quorum for its negation.

Means q Ç qÎ Q~F for any two quorums. Ex: size > (n + f )/2

Lifts local exclusion G1 Þ ~G2 to global: Q#G1 Þ ~Q#G2

Q+: ensures Q even after failures: q+ – zFS Î Q for any q+, zFS

A live quorum has Q+ {}

Specification

type X         =                                  ...         values to decide on

var    d          : (X È {nil})  := nil     Decision

          input    : set X := {}

         

Name

Guard

   State change

Input(x)

 

    input := input  È {x}

Decision: X

dnil

ret d

 

 

 

Decide

d = nil Ù  x Î input

d := x

 

The Idea of Paxos

A sequence of views; get a decision quorum in one of them.

Each view v chooses an anchored value cv: equals any earlier decision.

If a quorum accepts the choice, decision!

Decision is irrevocable, may  be invisible, but is any later view’s choice.
Choice    is changeable, must be visible

Abstract Paxos­—AP: State

 

Non-local        Agents                                State functions                        View is

                                                                        rv      d

cv               1:   rv1               
                        d
1                Qdec[rv*=x]           x       x             decided

 

input          2:   rv2               
                        d
2

                                            Qout[rv*=out]        out    nil                 out

activev       3:   rv3               
                        d
3

                                            else                      nil     nil                        open

AP: Data Flow

                                                to later views

 


rua=nil  Closev   xÎanchorv   Choosev   cv  Acceptv   rv=cv            Finishv      da=rv

            rua:=out                       cv:=x           rva:=cv         da:=rv

             for u < v   

                                                   Each value is nil or = the previous one

Client  INPUT x    xÎinput

Example

 

 

cv      rva        rvb      rvc

 cv     rva        rvb      rvc

View 1

View 2

View 3   

7        7       out      out

8        8       out      out

9      out      out      9 

8       8        out     out

9       9        out     9

9       out     out     9

input Ç  anchor4

= {7, 8, 9} seeing a, b, c
Ê{8}          seeing a, b
Ê{9}          seeing a, c or b, c

  {9} no matter what
        quorum we see

 

Two runs of AP with

agents a, b, c,

two agents in a quorum,

input  = {7, 8, 9}

Anchoring

 

invariant  rv = x Ù ru = x Þ x = x

all results agree

=  " x, u | rv = x Ù ru = x Þ x = x    

=  rv = x Þ (" u < v, xx | ~ Qdec[ru*=x])

Ü rv = x Þ (" u < v | cu = x Ú Qout[ru*Î{x,out}])     

assume u<v

rua Î {x, out} Þ ~(rua = x)

 

sfunc anchorv 

=

    {x | (" u < v |                   cu = x Ú Qout[ru*Î{x,out}])}

 

=

    {x | (" w | v0 w < u Þ cw = x Ú Qout[rw*Î{x,out}])}
Ç {x |                                   cu = x Ú Qout[ru*Î{x,out}]}
Ç {x | (" w | u0 < w < v Þ cw = x Ú Qout[rw*Î{x,out}])}

= anchoru

 

= X if outu,v

=

{x | cu = x} È (anchoru Ç         {x  |  Qout[ru*Î{x,out}]})

if outu,v

since
cu Îanchoru

Ê

if outu,v Ù rua = x then {x} elseif outv0,v then X else {}

 

      where outu,v = (" w | u < w < v Þ rw = out)

AP: Algorithm

Startv

u<v too slow

→activev := true

 

Closeva

activev

for all u < v do
      
if  rua = nil
      
then rua := out

post u<v Þ rua ≠ nil

  anchorv = {x | cu = x} È (anchoru Ç {x  |  Qout[ru*Î{x,out}]}) if outu,v

Anchorv

anchorv {}

→no state change

 

Choosev

   cva = nil
Ù x Î input Ç anchorv

→cv := x

 

Acceptva

   rva = nil
Ù cv nil

→rva := cv; Closeva

 

Finishva

rv ÎX

da  := rv

 

 

AP: Liveness