This document was made by OCR
from a scan of the technical report. It has not been edited or proofread and is
not meant for human consumption, but only for search engines. To see the
scanned original, replace OCR.htm with Abstract.htm or Abstract.html in the URL
that got you here.
Theoretical Computer Science 243 (2000) 35–91
Fundamental
Study
Revisiting the PAXOS algorithm *
Roberto De Prisco^{a,}^{*} ,
Butler Lampson^{b},
Nancy Lynch^{a}
^{a}MIT Laboratory for Computer Science, 545 Technology
Square, Cambridge, MA 02139, USA
^{b}Microsoft Corporation, 180 Lake View Ave,
Cambridge, MA 02138, USA
Received October 1998; revised July 1999
Communicated by M. Mavronicolas
Abstract
The PAXOS algorithm
is an efficient and highly faulttolerant algorithm, devised by Lamport, for reaching consensus in a distributed system.
Although it appears to be practical, it seems to be not widely known or understood. This paper contains a new
presentation of the PAXOS algorithm, based on a formal decomposition into several
interacting components. It also contains a
correctness proof and a time performance and faulttolerance analysis. The
formal framework used for the presentation of
the algorithm is provided by the Clock General Timed Automaton (Clock
GTA) model. The Clock GTA provides a systematic way of describing timingbased systems in which there is a notion of “normal”
timing behavior, but that do not necessarily always exhibit this “normal” timing behavior. ~c 2000 Elsevier Science
B.V. All rights reserved.
Keywords: I/O automata models; Formal verification; Distributed
consensus; Partially synchronous systems; Faulttolerance
Contents
1. Introduction...............................................................................................36
1.1. Related
work.........................................................................................38
1.2. Road
map............................................................................................39
2.
Models....................................................................................................39
2.1. I/O automata and the
GTA..........................................................................39
2.2. The Clock
GTA.....................................................................................40
2.3. Composition of
automata............................................................................42
* A preliminary version of this paper appeared in
Proceedings of the 11th International Workshop on Distributed Algorithms,
Saarbr^{}ucken, Germany, September 1997, Lecture Notes in Computer
Science, Vol. 1320, 1997, pp. 111–125. The
first author is on leave from the Dipartimento di Informatica ed Applicazioni, Universitâ
di Salerno, 84081 Baronissi (SA), Italy.
* Corresponding author.
Email addresses: robdep@theory.lcs.mit.edu (R. De Prisco), blampson@microsoft.com (B.
Lampson), lynch@theory.lcs.mit.edu (N.
Lynch).
03043975/00/$  see front matter ~c 2000 Elsevier
Science B.V. All rights reserved. PII:
S03043975(00)000426
3.
The distributed
setting....................................................................................43
3.1.
Processes.............................................................................................44
3.2.
Channels.............................................................................................44
3.3. Distributed systems
..................................................................................45
4. The consensus
problem...................................................................................46
5. Failure detector and leader elector
.......................................................................47 5.1. A failure
detector....................................................................................47
5.2. A leader
elector......................................................................................49
6.
The PAXOS algorithm
.....................................................................................52
6.1.
Overview.............................................................................................52
6.2. Automaton BASICPAXOS
..............................................................................54
6.3. Automaton SPAX .....................................................................................80
6.4. Correctness and analysis of ^{S}PAX ...................................................................
81 6.5.
Messages.............................................................................................86
6.6. Concluding
remarks..................................................................................86
7. The MULTIPAXOS algorithm
...............................................................................87
8. Application to data
replication............................................................................88
9. Conclusion................................................................................................89
Acknowledgement............................................................................................89
References...................................................................................................89
1.
Introduction
Reaching consensus is a fundamental problem in
distributed systems. Given a distributed system in which each
process starts with an initial value, to solve a consensus problem
means to give a distributed algorithm that enables each process to eventually output
a value of the same type as the input values, in such a way that three
conditions, called agreement, validity and
termination, hold. There are different
definitions of the problem depending on what these conditions
require. Distributed consensus has been extensively studied. A good
survey of early results is provided in [13]. We refer the reader to [24] for a
more recent treatment of consensus problems.
Real distributed systems are often partially
synchronous systems subject to process, channel and timing failures
and process recoveries. In a partially synchronous distributed
system, processes take actions within " time and messages are delivered
within d time, for given constants " and d. However, these
time bounds hold when the system exhibits a “normal” timing
behavior. Hence the abovementioned bounds of " and d can be occasionally
violated (timing failures). Processes may stop and recover; it is possible
to keep the state of a process, or part of it, in a stable storage so that the
state, or part of it, survives the failure. Messages can be
lost, duplicated or reordered. Any practical consensus algorithm needs to
consider the above practical setting. Moreover, the basic safety
properties must not be affected by the occurrence of failures. Also, the performance
of the algorithm must be good when there are no failures, while when failures occur, it is
reasonable to not expect efficiency.
Lamport’s PAXOS algorithm
[19] meets these requirements. The model considered is a
partially synchronous distributed system where each process has a direct
communication channel with each other process. The failures
allowed are timing failures, loss,
duplication and reordering of messages, and process stopping failures.
Process recoveries are allowed; some stable storage is needed. PAXOS is guaranteed to work safely, that is, to satisfy
agreement and validity, regardless of process, channel and timing failures
and process recoveries. When the distributed system stabilizes, meaning that there
are no failures, nor process recoveries, and a majority of the processes are
not stopped, for a sufficiently long time, termination is
also achieved and the performance of the algorithm is good.
Hence PAXOS has good faulttolerance
properties and when the system is stable it combines those
faulttolerance properties with the performance of an efficient
algorithm, so that it can be useful in practice. In the original paper [19],
the PAXOS algorithm is described as the
result of discoveries of archaeological studies of an ancient Greek
civilization. That paper contains also a proof of correctness and a
discussion of the performance analysis. The style used for the description of
the algorithm often diverts the reader’s attention. Because
of this, we found the paper hard to understand and we suspect
that others did as well. Indeed the PAXOS algorithm,
even though it appears to be a practical and elegant
algorithm, seems not widely known or understood.
In [19] a variation of PAXOS that considers multiple concurrent runs of PAXOS for reaching consensus on a sequence of values is also presented. We call this variation the MULTIPAXOS algorithm. ^{1}
This paper contains a new, detailed presentation
of the PAXOS algorithm, based on a formal
decomposition into several interacting components. It also contains a
correctness proof and a time performance and faulttolerance
analysis. The MULTIPAXOS algorithm
is also described, together with an application to data
replication. The formal framework used for the presentation is
provided by the Clock General Timed Automaton (Clock GTA),
which has been developed in [5]. The Clock GTA is a special type of Lynch and
Vaandrager’s General Timed Automaton (GTA) model [26–28]. The Clock GTA uses
the timing mechanisms of the GTA to provide a systematic way of describing both
the normal and the abnormal timing behaviors of a partially synchronous
distributed system subject to timing failures. The model is intended
to be used for performance and faulttolerance analysis of practical
distributed systems based upon the stabilization of the system.
The correctness proof uses automata composition and invariant assertion
methods. Automata composition is useful for representing a system
using separate components. We provide a modular presentation of the PAXOS algorithm, obtained by decomposing it
into several components. Each one of these components copes with a specific
aspect of the problem. In particular there is a “failure
detector” module that detects process failures and recoveries.
There is a “leader elector” module that copes with the problem of
electing a leader; processes elected leaders by this module, are used as
leaders
1 PAXOS is the name of the
ancient civilization studied in [19]. The actual algorithm is called the
“singledecree synod” protocol and its variation for
multiple consensus is called the “multidecree parliament” protocol.
We use the name PAXOS for the singledecree protocol
and the name MULTIPAXOS for the multidecree parliament protocol.
in PAXOS. The PAXOS algorithm is then split into a basic part that
ensures agreement and validity and into an additional part that
ensures termination when the system stabilizes; the basic part of the
algorithm, for the sake of clarity of presentation, is further subdivided
into three components. The correctness of each piece is proved by means of
invariants, i.e., properties of system states which are always true in any
execution.
The time performance and faulttolerance analysis
is conditional on the stabilization of the system behavior
starting from some point in an execution. Using the Clock GTA we
prove that when the system stabilizes PAXOS reaches
consensus in O(1) time and uses O(n) messages, where n is the number of
processes. We also briefly discuss the MULTIPAXOS
protocol and a data replication algorithm which uses MULTIPAXOS. With MULTIPAXOS the
high availability of the replicated data is combined with high fault tolerance.
1.1. Related work
The consensus algorithms of Dwork et al. [9] and
of Chandra and Toueg [2] bear some similarities with PAXOS. The algorithm of [9] also
uses “rounds” conducted by a leader, but the strategy used in each round is
different from the one used by PAXOS. Also,
[9] does not consider process restarts. The time analysis provided in [9] is
conditional on a “global stabilization time” after which
process response times and message delivery times satisfy the
time assumptions. This is similar to our stabilized analysis. A
similar time analysis, applied to the problem of reliable group communication,
can be
found in [12].
The algorithm of Chandra and Toueg is based on the
idea of an abstract failure detector [2]. It turns out that failure detectors
provide an abstract and modular way of incorporating partial synchrony assumptions
in the model of computation. A P failure
detector incorporates the partial synchrony considered in this paper. One of the
algorithms in [2] uses a 9' failure
detector, which is weaker than a P failure
detector. This algorithm is based on the rotating coordinator paradigm
and as PAXOS uses majorities to achieve
consistency. However it takes, in the worst case, longer time than PAXOS to achieve termination. Chandra et al. [1]
identified the “weakest” failure detector that can be used to solve the
consensus problem. This weakest failure detector is #" and it is equivalent to 9'. The Chandra and Toueg algorithm does not consider channel
failures (however, it can be modified to work with loss of messages but the
resulting algorithm is less efficient than PAXOS with respect to the number of messages sent).
The failure detector provided in this paper
differs from those classified by Chandra and Toueg in that it provides
reliability conditional on the system stabilization. If the system eventually
stabilizes then our failure detector can be classified in the class of the
eventually perfect failure detectors. However it should be noted that in order
for PAXOS to achieve termination it is
not needed that the system become stable forever but only for a sufficiently long time.
Dolev et al. [8] have adapted the Chandra and
Toueg’s definition of failure detector to consider also omission
failures and have given a distributed consensus protocol that allows majorities to
achieve consensus.
MULTIPAXOS can be easily used to
implement a data replication algorithm. The data replication algorithms
in [23, 30, 17,22] are based on ideas similar to the ones used in
PAXOS.
PAXOS bears
some similarities with the threephase commit protocol [34]. However, the
threephase commit protocol does not always guarantee majorities to progress.
The commit algorithm of Keidar and Dolev [18] is similar to PAXOS in that it always guarantees
majorities to progress. Also, PAXOS is
more efficient than the threephase commit protocol when the
system is stable and consensus has to be reached on a sequence of values
(a threephase protocol is needed only for the first consensus problem, while
all the subsequent ones can be solved with a twophase
exchange of messages).
Cristian’s timed asynchronous model [4] is
similar to the distributed setting considered in this paper. It
assumes, however, a bounded clock drift even when the system is
unstable. Our model is weaker in the sense that makes no assumption on clock
drift when the system is unstable. The Clock GTA provides a
formal way of modelling the stability property of the timed asynchronous
model. In [31] PattShamir introduces a special type of GTA used
for the clock synchronization problem. The Clock GTA considers
only the local time; our goal is to model good timing behavior starting from some
point on and thus we are not concerned with synchronization of the local
clocks.
In [20] Lampson provides a brief overview of the PAXOS algorithm together with key
ideas for proving the correctness of the algorithm. We used these ideas in the correctness proof
provided in this paper.
1.2. Road map
Section 2 describes the I/O automaton models used
and Section 3 describes the distributed system considered. Section 4 gives a
formal definition of the consensus problem. In Section 5 a
failure detector and a leader elector are presented; they are used by the PAXOS algorithm. The PAXOS algorithm itself is
described and analyzed in Section 6. Section
7 describes MULTIPAXOS and
Section 8 discusses how to use MULTIPAXOS to implement a data replication algorithm.
2. Models
Our formal framework is provided by I/O automaton models, specifically
by the Clock GTA model developed in [5]. In
this section we briefly describe essential notions about I/O automata needed to
read the rest of the paper. We refer the interested reader to [24, Chapters 8
and 23] for more information and references about I/O automaton models,
and to [5] for a more detailed presentation of the Clock GTA model.
2.1. I/O automata and the GTA
The I/O automata models are formal models suitable for describing
asynchronous and partially synchronous distributed systems. An I/O
automaton is a simple type of
state machine in which transitions are associated with named actions.
These actions are classified into categories, namely input,
output, internal and, for the timed models, timepassage.
Input and output actions are used for communication with the external environment,
while internal actions are local to the automaton. The timepassage actions are
intended to model the passage of time. The input actions are assumed not to be
under the control of the automaton, that is, they are controlled by the
external environment, which can force the automaton to execute the
input actions. Internal and output actions are controlled by the
automaton. The timepassage actions are also controlled by the
automaton (though this may at first seem somewhat strange, it is just
a formal way of modelling the fact that the automaton must perform some action before some amount of
time elapses).
The General Timed Automaton (GTA) uses timepassage
actions called v(t), t E + to model the passage of time. The timepassage
action v(t) represents the passage of time by the amount t.
A GTA consists of four components: (i) the signature,
consisting of four disjoint sets of actions, namely, the
input, output, internal and timepassage actions; (ii) the set
of states; (iii) the set of initial states, which
is a nonempty subset of the set of states; (iv) the statetransition
relation, which specifies all the possible state to state transitions.
A statetostate transition, usually called a step,
is a triple (s, 1r, s') where s and ^{s}^{'}^{ }are states of the automaton and 1r is an action
that takes the automaton from s to s'. If for a particular state s
and action 1r, there is some transition of the form (s, 1r, s'), then we say that 1r is enabled in s. Input actions are
enabled in every state.
A timed execution fragment of
a GTA is defined to be either a finite sequence
· = s0,1r1,s1,1r2,...,1r,.,s,. or an infinite sequence = s0,1r1,s1,1r2,.
..,1r,.,s,.,..., where the s’s are states, the 1r’s
are actions (either input, output, internal, or timepassage), and
(sk, 1rk+1 ,
sk+1) is a step for every k. Note
that if the sequence is finite, it must end with a state. The length
of a finite execution fragment = s0,
1r1,s1, 1r2,.
.. , 1r,., s,. ^{is },.. A timed execution
fragment beginning with a start state is called a timed
execution. If
· is
any timed execution fragment and _{1r,. }is any action in , then we say
that the time of occurrence of _{1r,. }is
the sum of all the reals in the timepassage actions preceding 1r,. ^{in }. A timed execution
fragment is said to be admissible if the sum of all the reals in the
timepassage actions in is oc. A
state is said to be reachable if it is the final state of a finite
timed execution of the GTA.
In the rest of the paper we will often refer to
timed executions (resp. timed execution fragments) simply as executions (resp. execution
fragments).
2.2. The Clock GTA
A Clock GTA is a GTA with a special component
included in the state; this special variable is called Clock and it can assume values in . The purpose of Clock is to model the local
clock of the process. The only actions that are allowed to modify Clock
are the timepassage actions v(t). When a timepassage action v(t) is
executed
by the automaton, the Clock is incremented by an amount
of time t' 0 independent of the amount t of time specified by the
timepassage action.^{2}
Since the occurrence of the timepassage action v(t) represents the
passage of (real) time by the amount t, by incrementing the
local variable Clock by an amount t' different from t we are able to
model the passage of (local) time by the amount t'. As a special case, we have
some timepassage actions in which t' = t; in these cases the
local clock of the process is running at the speed of real time.
In the following and in the rest of the paper, we
use the notation s .x to denote the value of state component x in state s.
Definition 2.1. A step (sk 1,V(t),sk) of a
Clock GTA is called regular if sk.Clock  sk 1. Clock = t; it is called irregular if it is not regular.
That is, a timepassage step executing action
v(t) is regular if it increases Clock by
t' = t. In a regular timepassage step, the local clock is increased by the
same amount as the real time, whereas in an irregular
timepassage step v(t) that represents the passage of real time by
the amount t, the local clock is increased either by t' <t (the
local clock is slower than the real time) or by t' > t (the local clock is
faster than the
real time).
Definition 2.2. A timed execution fragment of a Clock GTA is
called regular if all the
timepassage steps of are regular. It is called irregular if
it is not regular, i.e., if at least one of its timepassage step is irregular.
In a partially synchronous distributed system
processes are expected to respond and messages are expected to be
delivered within given time bounds. A timing failure is a violation of these
time bounds. An irregular timepassage step can model the occurrence of
a timing failure. We remark that a timing failure can actually be either an
upper bound violation (a process or a channel is slower than expected) or a
lower bound violation (a process or a channel is faster than expected).
Obviously, in a regular execution fragment there are no timing failures.
Though we have defined a regular execution
fragment so that it does not contain any of the timing failures,
we remark that for the the scope of this paper we actually need
only that the former type of timing failures (upper bound) does not happen.
That is, for the scope of this paper, we could have defined a
regular step v(t) as one that increases the clock time by an amount t', t' t.
2.2.1. Using MMTAs to describe Clock GTAs
GTAs encode timing restrictions explicitly into the code of the
automata. This provides a lot of flexibility but requires more
complicated code to explicitly handle the
2 Formally, we have that if (s,
v(t), s') is a step then also (s, v(t^{˜}), s'), for any ^{˜}t>
0, is a step. Hence a Clock GTA cannot keep track of the real time.
time and the time bounds. In many situations however we do not need
such a flexibility and we only need to specify simple time bounds
(e.g., an enabled action is executed within " time). The MMTA^{3} model is
a subclass of the GTA model suitable for describing
such simple time bounds. The MMTA does not have timepassage actions but each action is coupled with its time of
execution so that the execution of an MMTA is a (not necessarily finite) sequence = s0, (in1, t1), s1,
(in2, t2),. .. , (inr,
tr), sr,..., where the s’s are states, the in’s
are actions, and the t’s are times in [I^{¿0}.^{ }To specify
the time bounds an MMTA has a fifth component (with respect to the
four components of a GTA) called task partition, which
is an equivalence relation on the locally controlled actions (i.e., internal
and output action). Each equivalence class is called a task of the automaton. A task C having at least one
enabled action is said enabled. Each task C
has a lower bound, lower(C), and an upper bound upper(C),
on the time that can elapse before an enabled action belonging to the
task C is executed. If the task is not enabled then there is no restriction.
There is a standard technique that transforms any
MMTA into a GTA (see [24, Section 23.1]). This technique
can be extended to transform any MMTA into a Clock GTA
(see [5]). In the rest of the paper we will sometimes use MMTAs to describe Clock GTAs and
when using MMTAs we will always use lower(C) =0 and upper(C) = ". The following
lemma [5] holds.
Lemma 1. Consider a regular execution fragment of a Clock GTA described
with the MMTA model, starting from a reachable state s0 and lasting for more than " time. Then (i) any
task C enabled in s0 either has a step or is disabled within "
time, and (ii) any new enabling of C has a
subsequent step or disabling within " time, provided that lasts for more than
" time from the enabling of C.
2.3. Composition of automata
The composition operation allows an automaton
representing a complex system to be constructed by composing automata
representing simpler system components. The most important
characteristic of the composition of automata is that properties of isolated system
components still hold when those isolated components are composed with other components.
The composition identifies actions with the same name in different component
automata. When any component automaton performs a step involving action in, so
do all component automata that have in in their signatures. Since internal
actions of an automaton A are intended to be unobservable by any
other automaton B, automaton
A cannot be composed with automaton B unless the internal actions of A
are disjoint from the actions of B. (Otherwise, A’s performance of an
internal action could force
B
to take a step.) Moreover, A and B cannot be composed unless the sets of output
actions of A and B are disjoint. (Otherwise two automata would have the
control of an
3 The
name MMT derives from Merritt, Modugno, and Tuttle who introduced this
automaton [29].
output
action.) When A and B can be composed we say that they are compatible. The
transitions of the composition are obtained by allowing all the components that
have a particular action t in their signature to participate, simultaneously,
in steps involving t, while all the other components do nothing.
Note that this implies that all the components participate in timepassage
steps, with the same amount of time passing for all of them.
For a formal definition of the composition
operation we refer the reader to [24, Section 23.2.3]. Here we recall the following
theorems.
Theorem 2. The composition of a compatible collection of GTAs is a GTA.
Given the execution = s0, t1,s1,..., of a GTA A obtained by composing a compatible collection {^{A}i}iEI of GTAs, the notation IAi denotes the sequence
obtained from by deleting each pair
t,., s,. for which _{t,. }is
not action of Ai
and by replacing each remaining _{s,. }^{by }(s,.)i, that is, automaton Ai’s piece of s,..
Theorem 3. Let
{^{A}i}iEI be a compatible collection
of GTAs and let A be
the composition of Ai, for all i EI. If is an execution of A, then IAi is an
execution of Ai, for every iEI.
The above theorem is important because it enables
us to claim that properties proven to be true for a particular
automaton A are still true for a bigger automaton obtained by
composing automaton A with other automata. We will make extensive use of this theorem in the rest of
the paper.
Clock GTAs are GTAs; hence, they can be composed
as GTAs are composed. However we point out that the composition of Clock
GTAs does not yield a Clock GTA but a GTA.
3. The distributed setting
In this section we discuss the distributed
setting. We consider a partially synchronous distributed system consisting of n
processes. The distributed system provides a bidirectional
channel for every two processes. Each process is uniquely identified by its identifier i E I, where I is a totally ordered
finite set of n identifiers. The set I is known by all
the processes. Moreover each process of the system has a local clock. Local clocks can run at different speeds, though
in general we expect them to run at the same speed as real time. We assume that a local clock is available also
for channels; though this may seem
somewhat strange, it is just a formal way to express the fact that a channel is able to deliver a given message
within a fixed amount of time, by relying on some timing mechanism (which we
model with the local clock). We use Clock GTAs to model both processes
and channels.
Throughout the rest of the paper we use two
constants, ‘ and d, to represent upper bounds on the time needed to
execute an enabled action and to deliver a message,
respectively.
These time bounds do not necessarily hold for every action and message in every execution; a
violation of these bounds is a timing failure.
3.1. Processes
We allow process stopping failures and recoveries,
and timing failures. To formally model process stops and
recoveries we model process i with a Clock GTA which has a
special state component called Statusi and two input actions Stop_{i}
and Recoveri. The state
variable Statusi reflects
the current condition of process i. The effect of action Stop_{i}
is to set Statusi to stopped, while the effect of Recoveri is to set Statusi to alive. Moreover
when Statusi = stopped, all the locally controlled
actions are not enabled and the input actions have no effect,
except for action Recoveri.
We say that a process i is alive (resp.
stopped) in a given state s if we have s. Statusi = alive (resp. s.Statusi = stopped). We say that a process i is
alive (resp. stopped) in a given execution fragment, if it is
alive (resp. stopped) in all the states of the execution fragment. An automaton
modelling a process is called a process automaton.
Between a failure and a recovery a process does
not lose its state. We remark that PAXOS needs
only a small amount of stable storage (see Section 6.6); however, for simplicity,
we assume that the entire state of a process is stable. We also assume that there
is an upper bound of ‘ on the elapsed clock time if some locally controlled action is enabled. This
time bound can be violated if timing failures happen.
Finally, we provide the following definition of “stable” execution
fragment of a given process automaton. This
definition is used later to define a stable execution of a distributed
system.
Definition 3.1. Given a process automaton PROCESSi, we say that an execution
fragment
of PROCESSi is stable if process i is either stopped or alive in and is regular.
3.2. Channels
We consider unreliable channels that can lose and
duplicate messages. Reordering of messages is allowed, i.e.,
is not considered a failure. Timing failures are also possible. Fig.
1 shows the code of a Clock GTACHANNELi_{;} j, which models the communication channel
from process i to process j; there is one automaton for each possible choice of i and j. Notice that
we allow the possibility that the sender and the receiver are the same process. We denote by h' the
set of messages that can be sent over the channels.
The timepassage actions of CHANNELi_{;}j do not let pass the time beyond t^{
}+ d if a message (m_{;}t^{}),
that is, a message m sent at time t^{},^{ }is
in the channel. Clearly this restriction is on the local time and messages can
also be lost. However if the execution is regular and no messages
are lost then a particular message is delivered in a timely manner.
The following definition of “stable” execution fragment for a channel captures the condition under
which messages are delivered on time.
CHANNELi_{,} _{j}
Signature:
Input: Send(m)i_{,} j, Losei_{,} j, Duplicatei, _{j}
Output: Receive(m)i_{,} _{j}
Timepassage: v(t)
State:
Clock E ,
initially arbitrary
Msgs, a set of elements of .h' × , initially empty
Actions: input Send(m)i, _{j} Eff: add
(m, Clock) to Msgs output Receive(m)i_{,} _{j} Pre:
(m, t) is in Msgs, for some t Eff: remove (m, t)
from Msgs input Losei_{,} _{j} Eff:
remove one element of Msgs 
input Duplicatei, _{j} Eff: let (m, t) be an element of Msgs let
t' such that t6t'6Clock
place
(m,t') into Msgs timepassage
v(t) Pre: Let t'¿0 be such that for all (m, t'') E Msgs Clock + t' 6t'' + d E: Clock := Clock + _{t}'_{} 
Fig. 1. Automaton CHANNELi_{,}j.
Definition 3.2. Given a channelCHANNELi_{,}
j, we say
that an execution fragment of CHANNELi_{,} _{j }is stable if no Losei_{,} _{j }and Duplicate i, j actions occur in and is
regular.
We remark that the above definition requires also
that no Duplicate i, j actions
happen. This is needed for the performance analysis (duplicated
messages may introduce delays in the PAXOS algorithm).
The next lemma follows
from the above discussion.
Lemma 4. In a
stable execution fragment of CHANNELi_{,}j beginning in a reachable
state s and lasting for more than d time, we have that (i) all messages (m, t) that in state s are
in Msgsi, _{j }are delivered by time d, and (ii)
any message sent in is delivered within time d of
the sending, provided that lasts for more than d time from the sending of the message.
3.3.
Distributed systems
A distributed system is the composition of automata modelling channels
and processes. We are interested in modelling bad and good
behaviors of a distributed system;
in order to do so we provide some definitions that characterize the
behavior of a distributed system. The definition of “nice” execution fragment
given later in this section, captures the good behavior of a distributed
system. Informally, a distributed system behaves nicely if there are
no process failures and recoveries, no channel failures and no
irregular steps – remember that an irregular step models a timing failure – and
a majority
of the processes are alive.
Definition 3.3. A communication system
for the set 5 of processes, is the
composition of
channel automataCHANNELi_{,} jfor all possible choices of i,j E 5.
Definition 3.4. A distributed system
is the composition of process automata modeling the set 5 of processes and a communication system for
5.
We define the communication system _{SCHA }to be the communication
system for the set 5 of
all processes.
Next we provide the definition of “stable”
execution fragment for a distributed system exploiting the definition of stable
execution fragment given previously for channels and process automata.
Definition 3.5. Given a distributed
system S, we say that an execution fragment of S is stable if: (i) for all
automata PROCESSi modelling process i, i E S it holds that IPROCESSi is a stable execution
fragment for process i; (ii) for all channels CHANNELi_{,}
_{j }with i,j E S
it holds that ICHANNELi_{,} jis a stable execution
fragment for CHANNELi_{,} j.
Finally we provide the definition of “nice”
execution fragment that captures the conditions under which PAXOS satisfies termination.
Definition 3.6. Given a distributed
system S, we say that an execution fragment of S is nice if is a stable
execution fragment and a majority of the processes are alive in .
The above definition requires a majority of
processes to be alive. As is explained in Section 6.6, any quorum scheme could be used
instead of majorities.
In the rest of the paper, we will often use the
word “system” to mean “distributed system”.
4. The consensus problem
Several different but related agreement problems
have been considered in the literature. All have in common
that processes start the computation with initial values and
at the end of the computation each process must reach a decision. The
variations mostly concern stronger or weaker requirements that the
solution to the problem has to satisfy. The requirement that a solution to the
problem has to satisfy are captured by three properties, usually
called agreement, validity and termination. It is clear that
the definition of the consensus problem must take into account the distributed
setting in
which the problem is considered.
We assume that for each process i there is an
external agent that provides an initial value v by means of an action
Init(v)i. We denote by V the set of
possible initial values and, given a particular execution , we
denote by V the subset of V consisting of
those values actually used as initial values in , that is, those values
provided by Init(v)i
actions executed in . A process outputs a decision v by executing an action Decide(v)i. If a process i executes action Decide(v)i more than once then the output value v must be the
same.
To solve the consensus problem means to give a
distributed algorithm that, for any execution of the system, satisfies
·
Agreement: All the Decide(v) actions in have the same v.
·
Validity: For any Decide(v) action in , v belongs to V.
and, for any admissible
execution , satisfies
·
Termination: If = fly and y is a nice
execution fragment and for each process i alive in y an Init(v)i action occurs in while process i is alive, then
any process i alive
in y, executes a Decide(v)i action in .
The agreement and termination conditions require, as one can expect,
that processes “agree” on a particular value. The validity
condition is needed to relate the output value to the input values (otherwise a
trivial solution, i.e., always output a default value, exists).
5. Failure detector and leader elector
In this section we provide a failure detector
algorithm and then we use it to implement a leader election
algorithm, which, in turn, is used in Section 6 to implement PAXOS. The failure detector and the leader elector we
implement here are both sloppy, meaning that they are
guaranteed to give accurate information on the system only in a stable execution.
However, this is enough for implementing PAXOS.
5.1. A failure detector
In this section we provide an automaton that
detects process failures and recoveries. This automaton satisfies certain
properties that we will need in the rest of the paper. We do not provide a
formal definition of the failure detection problem, however, roughly speaking,
the failure detection problem is the problem of checking which processes are alive and which ones are
stopped.
Fig. 2 shows a Clock GTA, called DETECTOR(z, c)i,
which detects failures. In our setting failures and
recoveries are modeled by means of actions Stop_{i} and Recoveri. These two actions are input
actions of DETECTOR(z, c)i.
Moreover DETECTOR(z, c)i
has InformStopped(j)i
and InformAlive(j)i as
output actions which are executed when, respectively, the stopping and the recovering of
process j are detected.
DETECTOR(z, c)s
Signature:
Input: Receive(m)j_{,} s, Stops,
Recovers
Output: InformStopped(j)s, InformAlive(j)s, Send(m)s_{,} _{j}
Internal: Check(j)s
Timepassage: v(t)
State:
Clock E init.
arbitrary StatusE{alive,stopped} init. alive Alive
E _{2}I_{ }init.
I 
for alljEI: Prevrec(j) E ¿0 init.
arbitrary Lastsend(j) E ¿0 init.
Clock 
Actions:
input Stops
Eff: Status := stopped
output Send(“Alive”)s, _{j}
Pre: Status = alive
Eff: Lastsend(j) := Clock + z
input Receive(“Alive”)j, _{s}
Eff: if Status = alive then
Prevrec(j) := Clock if j E Alive then
Alive := Alive U {j} Lastcheck(j) := Clock
+ c
internal Check(j)s
Pre: Status = alive
jEAlive
Eff: Lastcheck(j) := Clock + c
if Clock¿Prevrec(j) +z +
d then Alive := Alive\{j}
input Recovers
Eff:
Status:=alive
output InformStopped(j)s
Pre: Status= alive
j E Alive
Eff: Lastinform(j) := Clock
+ "
output InformAlive(j)s
Pre: Status= alive
jEAlive
Eff: Lastinform(j) := Clock
+ "
timepassage v(t)
Pre: none
Eff: if Status = alive
then
Let t' be such that
Vj, Clock + t'
<Lastinform(j) Vj, Clock + t'
<Lastsend(j) Vj, Clock + t' <Lastcheck(j)
Clock := Clock + _{t}'_{}
Fig. 2. Automaton DETECTOR for
process s.
Automaton DETECTOR(z, c)s
works by having each process constantly sending “Alive” messages
to each other process and checking that such messages are received from other
processes. It sends at least one “Alive” message in an interval of time of a
fixed length z (i.e., if an “Alive” message is sent at time t
then the next one is sent before
time t + z) and checks for incoming messages at least once in an
interval of time of a fixed length c. Let us denote by _{SDET }the system consisting of
system _{SCHA }and an automaton DETECTOR(z, c)s for each process s E 5.
For simplicity of notation, henceforth we assume
that z = t' and c = t', that is, we use DETECTOR(t', t')s. In practice the choice
of z and c may be different.
Using the strategy used by DETECTOR(t', t')sit is not hard to prove
the following lemmas (for a detailed formal proof we refer the interested
reader to [5]).
Lemma 5. If an execution fragment of _{SDET,
}starting in a reachable state and lasting for more than 3t'
+ 2d time, is stable and process s is
stopped in , then by time 3t' + 2d from
the beginning of , for each process j alive
in , an action InformStopped(s) _{j }is executed and no subsequent InformAlive(s) _{j }action is executed in .
Lemma 6. If an execution fragment of _{SDET,
}starting in a reachable state and lasting
for more than d + 2t' time, is stable and process s is
alive in , then by time d+2t' from
the beginning of ,for
each process j
alive in , an
action InformAlive(s)j is
executed and no subsequent InformStopped(s) _{j }action is executed in .
The strategy used by DETECTORs is a straightforward one. For this reason
it is very easy to implement. However the
failure detector so obtained is not reliable, i.e., it does not give accurate information, in the
presence of failures (Stops,
Loses_{,} j, irregular executions). For example, it may
consider a process stopped just because the “Alive” message
of that process was lost in the channel. Automaton DETECTORs is guaranteed to
provide accurate information on faulty and alive processes only when the system
is stable.
5.2. A leader elector
Electing a leader in an asynchronous distributed
system is a difficult task. An informal argument that explains this
difficulty is that the leader election problem is somewhat similar
to the consensus problem (which, in an asynchronous system subject to failures is
unsolvable [14]) in the sense that to elect a leader all processes must reach
consensus on which one is the leader. It is fairly clear how a
failure detector can be used to elect a leader. Indeed the failure
detector gives information on which processes are alive and
which ones are not alive. This information can be used to elect the current
leader. We use the DETECTOR(t',
t')s automaton to check for the
set of alive processes. Fig. 3 shows automaton LEADERELECTORs which is an MMTA. Remember
that we use MMTAs to describe in a simpler way Clock GTAs. Automaton
LEADERELECTORs interacts with DETECTOR(t',t')s by means of actions
InformStopped(j)s, which inform process s that process j has stopped, and InformAlive(j)s, which inform process s that process j has
recovered. Each process updates its view of the set of alive processes when
these two actions are executed. The process with the biggest identifier in the
set of alive processes
LEADERELECTORi
Signature:
Input: InformStopped(j)i,
InformAlive(j)i, Stop_{i}, Recoveri
Output: Leaderi,
NotLeaderi
State:
Status E {alive_{;}
stopped} initially alive
Pool E _{2}5_{ }initially
{i}
Derived variable: Leader, defined as max of Pool Actions: input Stop_{i} Eff: Status := stopped output Leaderi Pre: Status= alive Eff: none input InformStopped(j)i Eff: if Status = alive then Pool
:= Pool\{j} Tasks and bounds: {Leaderi, NotLeaderi}, bounds [0_{;}
t'] 
input Recoveri Eff: Status := alive output NotLeaderi Pre: Status= alive i = Leader
Eff: none input InformAlive(j)i Eff: if Status = alive
Pool :=PoolU{j} 
Fig. 3. Automaton LEADERELECTOR for process i.
is declared
leader. We denote with _{SLEA }the system consisting of _{SDET }composed with a
LEADERELECTORi automaton for each
process i E 5. Fig.
4 shows _{SLEA; }it also shows SDET, which is a subsystem of
SLEA.
Since DETECTOR(t'_{;} t')i is
not a reliable failure detector, also LEADERELECTORi
is not reliable. Thus, it is possible that processes have
different views of the system so that more than one process considers itself
leader, or the process supposed to be the leader is actually stopped.
However, as the failure detector becomes reliable when the system SDET executes a stable execution fragment (see Lemmas 5
and 6), also the leader elector becomes reliable when system _{SLEA }is stable. Notice that
when _{SLEA
}executes
a stable execution fragment, so does SDET.
Formally, we say that a state s of system _{SLEA, }is a uniqueleader state if there exists an alive process i such that for all alive
processes j it holds that s:Leader _{j }=
i.
Fig. 4. The system SLEA.
In such a case, process i is the leader of
state s. Moreover, we say that an execution of system _{SLEA, }is
a uniqueleader execution if all the states of are
uniqueleader states
with the same leader in all the states.
Next lemma states that in a stable execution fragment, eventually there
is uniqueleader
state.
Lemma 7. If an execution fragment of _{SLEA, }starting
in a reachable state and lasting for more than 4t'
+ 2d, is stable, then by time 4t' + 2d, there
is a state occurrence s such
that in state s and in all the states after s there
is a unique leader. Moreover this unique leader is always the process
with the biggest identier among the processes alive in .
Proof. First notice that the system _{SLEA }consists of
system _{SDET }composed with other automata. Hence by
Theorem 3 we can use any property of _{SDET. }In particular we can
use Lemmas 5 and 6 and thus we have that by time 3t' + 2d each process has a consistent
view of the set of alive and stopped processes. Let i be the leader. Since
is
stable and thus also regular, by Lemma 1, within additional t' time, actions
Leaderj and
NotLeader _{j }are consistently executed for each process j, including
process j = i. The fact that i is the process with the biggest identifier among
the processes alive in
follows directly from
the code of LEADERELECTORi.
We remark that, for many algorithms that rely on
the concept of leader, it is important to provide exactly
one leader. For example when the leader election is used to
generate a new token in a token ring network, it is important that there is
exactly one process (the leader) that generates the new token,
because the network gives the right to send messages to the
owner of the token and two tokens may result in an interference between two
communications. For these algorithms, having two or more leaders
jeopardizes the correctness. Hence the sloppy leader elector provided before is
not suitable. However, for the purpose of this paper, LEADERELECTOR is all we need.
6. The PAXOS algorithm
PAXOS was
devised a very long time ago^{4} but its discovery, due to Lamport, is
very recent
[19].
In this section we describe the PAXOS algorithm, provide an implementation using Clock
GT automata, prove its correctness and analyze its performance. The performance
analysis is given assuming that there are no failures nor recoveries,
and a majority of the processes are alive for a sufficiently long
time. We remark that when no restrictions are imposed on the possible failures, the
algorithm might not terminate.
6.1. Overview
Our description of PAXOS
is modular: we have separated various parts of the overall algorithm;
each piece copes with a particular aspect of the problem. This approach should
make the understanding of the algorithm much easier. The core part of the
algorithm is a module that we call BASICPAXOS;
this piece incorporates the basic ideas on which the
algorithm itself is built. The description of this piece is further subdivided into three components,
namely BPLEADER, BPAGENT and BPSUCCESS.
In BASICPAXOS
processes try to reach a decision by running what we call a “round”. A
process starting a round is the leader of that round. BASICPAXOS guarantees that, no matter
how many leaders start rounds, agreement and validity are not violated. This means
that in any run of the algorithm no two different decisions are ever made and
any decision is equal to some input value. However to have a complete algorithm
that satisfies termination when there are no failures for a sufficiently
long time, we need to augment BASICPAXOS
with another module; we call this module STARTER.
The functionality of STARTER is
to make the current leader start a new round if the previous one is not completed
within some time bound.
Leaders
are elected by using the LEADERELECTOR algorithm
provided in Section 5. We remark that this is possible because the
presence of two or more leaders does not jeopardize agreement or
validity; however, to get termination there must be a unique leader.
^{4}The most accurate information dates it back to the beginning of this
millennium [19].
Fig. 5. PAXOS: process i. Some of the actions shown in the figure will be defined later
in this section.
Thus, our implementation of PAXOS is obtained by composing the following automata: CHANNELi_{;} _{j }for the communication
between processes, DETECTORi and
LEADERELECTORi for the
leader election, BASICPAXOSi and
STARTERi, for every process i_{;}j
5. The resulting system is called SPAX.
Fig. 5 shows the automaton at process i. Notice
that not all of the actions are drawn in the picture: we have drawn
only some of them and we refer to the formal code for all
of the actions. Actions Stop_{i} and Recoveri are input actions of all the automata. The
_{SPAX }automaton at process i interacts with automata at other
processes by sending messages over the channels. Channels are not drawn in the picture.
Fig. 6 shows the messages exchanged by processes i
and j. The automata that send and receive these messages are shown in the
picture. We remark that channels and actions interacting with
channels are not drawn, as well as other actions for the interaction with other
automata.
Fig. 6. BASICPAXOS: Messages.
It is worth to remark that some pieces of the algorithm do need to be
able to measure the passage of the time (DETECTORi, STARTERi and BPSUCCESSi) while others do not.
We will prove (Theorems 9 and 10) that the system _{SPAX
}solves the consensus problem ensuring partial correctness – any output
is guaranteed to be correct, that is, agreement and validity are
satisfied – and (Theorem 17) that _{SPAX }guarantees also termination
when the system executes a nice execution fragment, that is, without failures and recoveries and with
at least a majority of the processes remaining alive.
6.1.1. Roadmap
for the rest of the section
In Section 6.2 we provide automaton BASICPAXOS.
This automaton is responsible for carrying out a round in response to an
external request. We prove that any round satisfies agreement and validity and
we provide a performance analysis for a successful round.
Then in Section 6.3 we provide automaton STARTER which
takes care of the problem of starting new rounds. In Section 6.4 we
prove that the entire system _{SPAX }^{is }correct
and provide a performance analysis. In Section 6.5 we provide some comments about
the number of messages used by the algorithm. Finally Section 6.6 contains some
concluding
remarks.
6.2. Automaton
BASICPAXOS
In this section we present the automaton BASICPAXOS which is the core part of the
PAXOS algorithm. We begin by
providing an overview of how automaton BASICPAXOS
works, then we provide the automaton code along with a detailed
description and finally we prove that it satisfies agreement and validity.
6.2.1.
Overview
The basic idea is to
have processes propose values until one of them is accepted by a majority of the processes; that value is the
final output value. Any process may
propose a value by initiating a round for
that value. The process initiating a round is said to be the leader
of that round while all processes, including the leader itself, are said
to be agents for that round. Informally,
the steps for a round are the following.
(1)
To initiate a round, the leader sends a “Collect”
message to all agents^{5} announcing that it wants to start
a new round and at the same time asking for information about previous rounds in
which agents may have been involved.
(2) An
agent that receives a message sent in Step 1 from the leader of the round, responds
with a “Last” message giving its own information about rounds previously conducted.
With this, the agent makes a kind of commitment for this particular round
that may prevent it from accepting (in Step 4) the value proposed in some other
round. If the agent is already committed for a round with a bigger round number
then it informs the leader of its commitment with an “OldRound” message.
(3) Once
the leader has gathered information about previous rounds from a majority of
agents, it decides, according to some rules, the value to propose for its round
and sends to all agents a “Begin” message announcing the value and asking them to
accept it. In order for the leader to be able to choose a value for the round
it is necessary that initial values be provided. If no initial
value is provided, the leader must wait for an initial
value before proceeding with Step 3. The set of processes from
which the leader gathers information is called the infoquorum
of the round.
(4) An
agent that receives a message from the leader of the round sent in Step 3, responds
with an “Accept” message by accepting the value proposed in the current round,
unless it is committed for a later round and thus must reject the value proposed
in the current round. In the latter case the agent sends an “OldRound” message to the leader
indicating the round for which it is committed.
(5) If the leader gets “Accept” messages from a
majority of agents, then the leader sets its own output value to the value
proposed in the round. At this point the round is successful. The set
of agents that accept the value proposed by the leader is called the acceptingquorum.
Since a successful round implies that the leader of the round reached a
decision, after a successful round the leader still needs to do
something, namely to broadcast the reached decision. Thus, once the leader has
made a decision it broadcasts a “Success” message announcing
the value for which it has decided. An agent that receives a “Success” message
from the leader makes its decision choosing the value of the successful round. We
use also an “Ack” message sent from the agent to the leader, so that the leader
can make
sure that everyone knows the outcome.
Fig.
7 shows: (a) the steps of a successful round r; (b) the responses from an agent
that informs the leader that an higher numbered round r' has been
already initiated; (c) the broadcast of a decision. The parameters
used in the messages will be explained later. Section 6.2.2 contains a description of the
messages.
5 Thus it sends a message also
to itself. This helps in that we do not have to specify different behaviors for a process according to the fact that it is both
leader and agent or just an agent. We just need to specify the
leader behavior and the agent behavior.
Fig. 7. Exchange of messages.
Since different rounds may be carried out
concurrently (several processes may concurrently initiate rounds),
we need to distinguish them. Every round has a unique identifier.
Next we formally define these round identifiers. A round number is a pair (x, i) where x is a nonnegative
integer and i is a process identifier. The set of round numbers
is denoted by . A total order on elements of is defined by (x, i) <(y,j) iffx<y
or, x=y and i<j.
We
say that round r precedes round
r' if r<r'.
If round r precedes round r' then we also say that r is a previous round, with respect to
round r'. We remark that the ordering of rounds is not related to the actual
time the rounds are conducted. It is possible that a roundr'
is started at some point in time and a previous round r, that is, one with r
<r', is started later on.
For each process i, we define a “+i”
operation that given a round number (x,j) and an integer y, returns the round number
(x,j) +i y
= (x + y, i).
Every round in the algorithm is tagged with a unique round number.
Every message sent by the leader or by an agent for a round (with
round number) r E ,
carries the
round number r so that no confusion among messages belonging to
different rounds is possible.
However the most important issue is about the
values that leaders propose for their rounds. Indeed, since the value of a
successful round is the output value of some processes, we must
guarantee that the values of successful rounds are all equal in order to
satisfy the agreement condition of the consensus problem. This is the tricky
part of the algorithm and basically all the difficulties
derive from solving this problem. Consistency is guaranteed by
choosing the values of new rounds exploiting the information
about previous rounds from at least a majority of the agents so that, for any two rounds, there is at
least one process that participated in both rounds.
In more detail, the leader of a round chooses the
value for the round in the following way. In Step 1, the leader
asks for information and in Step 2 an agent responds with the
number of the latest round in which it accepted the value and with the accepted
value or with round number (0,j) and nil if the agent has not yet accepted a value. Once
the leader gets such information from a majority of the agents (which is the infoquorum
of the round), it chooses the value for its round to be equal to the value of
the latest round among all those it has heard from the agents in the
infoquorum or equal to its initial value if all agents in the
infoquorum were not involved in any previous round. Moreover, in
order to keep consistency, if an agent tells the leader of
a round r that the last round in which it accepted a value is round T1, r^{1} <r, then implicitly
the agent commits itself not to accept any value proposed in any other round
T^{11},^{
}T^{1}<T^{11}<T.
Given the above setting, if T^{1}
is the round from which the leader of round T gets the
value for its round, then, when a value for round T has been chosen, any round T11, T^{1} <T ^{11}<T,
cannot be successful; indeed at least a majority of the processes are committed
for round T, which implies that at least a majority of the processes are rejecting
round T^{11}.^{ }This, along with the fact that
infoquorums and acceptingquorums are majorities, implies that
if a round T is successful, then any round with a bigger round number ^{˜}1>T
is for the same value. Indeed the information sent by processes in
the infoquorum of round i ˜ is used to choose the value for the round, but
since infoquorums and acceptingquorums share at least one
process, at least one of the processes in the infoquorum of round T^{1}
is also in the acceptingquorum of round T. Indeed, since the
round is successful, the acceptingquorum is a majority. This implies that
the value of any round ^{˜}i> T must be equal to the value of round
T, which, in turn,
implies agreement.
We remark that instead of majorities for
infoquorums and acceptingquorums, any quorum system can be used.
Indeed the only property that is required is that there be a process in the
intersection of any infoquorum with any acceptingquorum.
Example. Fig. 8 shows how the value of
a round is chosen. In this example we have a network of 5
processes, A, B, C, D, E (where the ordering is the alphabetical one) and
vA, vB denote the initial values of A and B. At some
point process B is the leader and starts round (1,B). It
receives information from A,B,E (the set {A,B,E} is the
Fig. 8. Choosing the values of rounds. Empty boxes denote
that the process is in the infoquorum, and black boxes
denote acceptance. Dotted lines indicate commitments.
infoquorum of this round). Since none of them has been involved in a
previous round, process B is free to choose its initial value VB as the value of the round. However it receives
acceptance only from B, C (the set {B, C} is the acceptingquorum for this round).
Later, process A becomes the leader and starts round (2,A). The infoquorum for
this round is {A, D, E}. Since none of this processes has accepted a value in a
previous round, A is free to choose its initial value for its round. For
round (2,D) the infoquorum is {C,D,E}. This time in the quorum
there is process C that has accepted a value in round (1, B) so
the value of this round must be the same of that of round (1,B).
For round (3,A) the infoquorum is {A,B,E} and since A has accepted the value
of round (2,A) then the value of round (2,A) is chosen for round (3,A). For
round (3,B) the infoquorum is {A, C,D}. In this case there are
three processes that accepted values in previous rounds: process A that has
accepted the value of round (2,A) and processes C,D, that have
accepted the value of round (2,D). Since round (2,D) is the higher
round number, the value for round (3,B) is taken from round (2,D). Round (3,B) is successful.
To end up with a decision value, rounds must be started
until at least one is successful. The basic consensus module BASICPAXOS guarantees that a new round
does not
violate agreement or validity, that is, the value of a new round is
chosen in such a way that if the round is successful, it does not violate
agreement and validity. However, it is necessary to make BASICPAXOS start rounds until one is
successful. We deal with this problem in Section 6.3.
6.2.2. The code
In order to describe automaton BASICPAXOS
for process we provide three automata. One is called BPLEADER and models the “leader”
behavior of the process; another one is called BPAGENT and models the “agent”
behavior of the process; the third one is called BPSUCCESS
and it simply takes care of broadcasting a reached decision. Automaton
BASICPAXOS is the composition of BPLEADER , BPAGENT and BPSUCCESS
.
Figs. 9 and 10 show the code for BPLEADER ,
while Fig. 11 shows the code for BPAGENT .
We remark that these code fragments are written using the MMTA model. Remember that we use
MMTA to describe in a simpler way Clock GT automata. Figs. 12 and 13 show automaton BPSUCCESS
. The purpose of this automaton is simply to broadcast the decision
once it has been reached by the leader of a round. Figs. 6 and 7 describe the
exchange of messages used in a round.
It is worth noticing that the code fragments are
“tuned” to work efficiently when there are no failures. Indeed
messages for a given round are sent only once, that is, no attempt
is made to try to cope with losses of messages and responses are expected to be
received within given time bounds. Other strategies to try to conduct a
successful round even in the presence of some failures could be
used. For example, messages could be sent more than once to cope with the
loss of some messages or a leader could wait more than the
minimum required time before abandoning the current round and
starting a new one – this is actually dealt with in Section 6.3. We have chosen
to send only one message for each step of the round: if the execution is
nice, one message is enough to conduct a successful round. Once a
decision has been made, there is nothing to do but try to send it to
others. Thus once the decision has been made by the leader, the
leader repeatedly sends the decision to the agents until it gets an
acknowledgment. We remark that also in this case, in practice, it is important
to choose appropriate timeouts for the resending of a
message; in our implementation we have chosen to wait the minimum amount of
time required by an agent to respond to a message from the leader; if the
execution is stable this is enough to ensure that only one message
announcing the decision is sent to each agent.
We remark that there is some redundancy that
derives from having separate automata for the leader
behavior and for the broadcasting of the decision. For example, both
automata BPLEADER and BPSUCCESS need to be aware of the
decision, thus both have a Decision variable
(the Decision variable of BPSUCCESS is updated when action RndSuccess
is executed by BPLEADER after
the Decision variable of BPLEADER is set). Having
only one automaton would have eliminated the need of such a duplication. However
we preferred to separate BPLEADER and
BPSUCCESS because they accomplish different tasks.
BPLEADERi
Signature:
Input:
Receive(m)j_{,} i, m E {“Last”,
“Accept”, “OldRound”}
Init(v)i,
NewRoundi, Stop_{i}, Recoveri,
Leaderi, NotLeaderi Internal: Collecti, BeginCast_{i},
GatherLast(m)i, m is a “Last” message
GatherAccept(m)i, m is a “Accept”
message
GatherOldRound(m)i m
is a “OldRound” message Output: Send(m)i_{,} j, m E {“Collect”,
“Begin”}
Gathered(v)i,
Continuei, RndSuccess(v)i
State:
Status
E {alive, stopped} init.
alives
IamLeader,
a boolean init. false
Mode E {collect , gatherlast,
wait ,begincast, gatheraccept,
decided,done} init. done
Init Value E VU nil init.
nil
Decision E V U {nil} init. nil
Derived
Variable:
CurRnd E R init.
(0, i)
HighestRnd
E R init.
(0, i)
Value
E V U {nil} init. nil
ValFromE R init. (0,i)
Info
Quo E _{2}I_{ }init. {}
Accept
Quo E _{2}I_{ }init. {}
InMsgs,
multiset of msgs init. {}
OutMsgs, multiset of msgs init. {}
LeaderAlive, a boolean, true if Status
= alive and
IamLeader = true Actions:
input Stop_{i}
Eff: Status := stopped
input Recoveri
Eff: Status:= alive
input Leaderi
Eff: if Status = alive then
IamLeader := true
input NotLeaderi
Eff: if Status = alive then
IamLeader := false
output Send(m)i_{,} _{j}
Pre: Status = alive mi, j E OutMsgs
E: removemi_{,} jfrom OutMsgs
input Receive(m)j_{,} _{i}
Eff: if Status = alive then
addmj_{,} ito InMsgs
input Init(v)i
Eff: if Status = alive
then
In it Value := v
input NewRoundi
Eff: if LeaderAlive = true
then CurRnd
:= HighestRnd +i 1 HighestRnd := CurRnd Mode := collect
Fig. 9. Automaton BPLEADER for
process i (part 1).
BPLEADER (cont’d)
output Collecti
Pre: LeaderAlive = true
Mode = collect
Eff: ValFrom := (0, i) Info Quo := {} Accept
Quo := {}
Vj put (CurRnd, “Collect”)i_{,} _{j}
in OutMsgs
Mode := gatherlast
internal GatherLast(m)i Pre:
LeaderAlive = true
Mode = gatherlast
m = (r, “Last”,r', ^{v})j, i
m E InMsgs CurRnd =
r
E: remove m from
InMsgs
Info Quo := Info Quo U {j}
if ValFrom<r' and v=nil
then
Value := v ValFrom :=r'
if I Info QuoI >n/2 then
Mode := gathered
output Gathered(Value)
Pre: LeaderAlive = true
Mode = gathered
Eff: if Value = nil and
InitValue = nil
then
Value := Init Value if Value
= nil then
Mode := begincast
else
Mode := wait
internal Continues
Pre: LeaderAlive = true
Mode = wait
Value = nil
Init Value = nil
Eff: Value := Init Value Mode := begincast
internal BeginCast_{i}
Pre: LeaderAlive = true
Mode = begincast
Eff: Vj,
let m be
(CurRnd, “Begin”, Value)i_{,} _{j }put m in OutMsgs
Mode := gatheraccept
internal GatherAccept(m),
Pre: LeaderAlive = true
Mode = gatheraccept m = (T, “Accept”)j, _{i }mE InMsgs
CurRnd = T
E:
remove m from InMsgs
AcceptQuo := AcceptQuo
U {j} if I Accept QuoI >n/2 then
Decision := Value
Mode := decided
output RndSuccess(Decision )j Pre: LeaderAlive
= true
Mode = decided
E: Mode := done
internal GatherOldRound(m), Pre: Status = alive
m =
(T, “OldRound”, ^{T'})j, i mE InMsgs
HighestRnd <r'
E:
remove m from InMsgs HighestRnd :=r'
Tasks and bounds:
{Collecti, Gathered(v)i,
Continues, BeginCast_{i}, RndSuccess(v)i}, bounds [0, t'] {GatherLast(m)i, m E InMsgs, m is a “Last” message}, bounds [0, t'] {GatherAccept(m)i, m E InMsgs, m is a “Accept” message}, bounds [0, t'] {GatherOldRound(m)i, m E InMsgs,
m is a “OldRound” message},
bounds [0, t'] {Send(m)i_{,} _{j,mi, }jE OutMsgs}, bounds
[0, t']
Fig. 10. Automaton BPLEADER for process i (part 2).
BPAGENTs
Signature:
Input: Receive(m)j_{,} s, m E {“Collect”, “Begin”}
Init(v)s,
Stops,
Recovers
Internal: LastAccept(m)s, m
is a “Collect” message Accept(m)s, m is a “Begin” message
Output: Send(m)s_{,} j, m E {“Last”, “Accept”, “OldRound”}
State:
Status
E {alive, stopped} 
init. alive 
Commit E 
init. (0, s) 
LastRE 
init. (0,s) 
InMsgs, multiset of msgs 
init. {} 
LastVEVU{nil} 
init. nil 
OutMsgs, multiset
of msgs 
init. {} 
Actions: 



input Stops
E: Status := stopped
input Recovers
Eff: Status := alive
output Send(m)s_{,} _{j}
Pre: Status = alive
m E OutMsgs
Eff: removems_{,} jfrom OutMsgs
input Receive(m)j_{,} _{s}
Eff: if Status = alive then
add _{mj,s }to InMsgs
input Init(v)s
Eff: if Status = alives then if Last
V= nil then
LastV:= v
Tasks and
bounds:
internal LastAccept(m)s
Pre: Status = alive
m = (r, “Collect”)j_{,} sE InMsgs E: remove m from
InMsgs
if
r Commit then
Commit := r
put (r, “Last”, LastR
, Last V)s, j in OutMsgs
else
put (r,
“OldRound”, Commit)s_{,} _{j }in OutMsgs
internal Accept(m)s
Pre: Status = alive
m = (r, “Begin”,v)j_{,} sE InMsgs E: remove m from
InMsgs
if
r Commit then
put
(r, “Accept”)s, _{j }in InMsgs
LastR:=r, LastV:=v
else
put
(r, “OldRound”, Commit)s, _{j }in OutMsgs
{LastAccept(m)s, m E InMsgs,
m is a “Collect” message},
bounds [0, t'] {Accept(m)s, m E InMsgs, m is a “Begin” message}, bounds [0, t'] {Send(m)s_{,} _{j,ms, }jE OutMsgs }, bounds
[0, t']
BPSUCCESSs
Signature:
Input: Receive(m)j_{,} s, m E {“Ack”, “Success”}
Stops, Recovers, Leaders, NotLeaders, RndSuccess(v)s
Internal: SendSuccesss, Checks
Output: Send(m)s_{,} j, m E {“Ack”, “Success”}
Decide(v)s
Timepassage:
v(t)
State:
Clock E init.
arbitrary For each j E I
StatusE{alive,stopped}
init. alive Acked(j), a
boolean init. false
IamLeader, a boolean init.
false LastSendAck(j) E U {00} init. 00
Decision E V U {nil} init.
nil LastSendSuc(j)E U{00} init.
00
Prevsend E U{nil} init. nil OutAckMsgs(j),
set of msgs init. {}
Last Check E U {00} init. 00 OutSucMsgs(j), set of msgs init. {}
LastSS E U {00} init. 00
Actions:
input Stops
E: Status := stopped input Leaders
Eff: if Status = alive
and
IamLeader = false
then
IamLeader := true
if Decision
= nil
then
LastSS :=
clock + ‘
PrevSend := nil output Send(m)s_{,} _{j}
Pre: Status = alive
ms, j E OutAckMsgs(j)
Eff: OutAckMsgs(j) := {}
LastSendAck(j) := 00 output Send(m)s_{,} _{j}
Pre: Status = alive
ms, j E OutSucMsgs(j)
Eff: OutSucMsgs(j) := {}
LastSendSuc(j) := 00
input Recovers
Eff: Status:= alive
input NotLeaders
Eff: if Status = alive then IamLeader := false LastSS :=
00
LastCheck := 00
For each j E I
LastSendSuc(j) :=
00 input Receive(
(“Ack”) )j, s Eff: if Status =
alive then
Acked(j) := true
input Receive( (“Success”,
v) )j, s Eff: if Status = alive then
Decision := v
put (“Ack”)s_{,} _{j }into OutAckMsgs(j) LastSendAck(j) := Clock + ‘
BPSUCCESS (cont’d)
internal SendSuccess,
Pre: Status = alive
IamLeader = true Decision = nil
PrevSend = nil
j = i, Acked(j)= false
Eff: Vj=i such that Acked(j)= false
put (“Success”, Decision),; _{j}
in OutSucMsgs(j)
LastSendSuc(j) :=
Clock + " PrevSend
:= Clock
LastCheck
:= Clock + (2"+ 2d) + " LastSS := 00
timepassage v(t)
Pre: none
Eff: if Status = alive then
Let t' be such that
Clock
+ t' <Last Check Clock +
t' <LastSS and for each j E 5
Clock +
t' <LastSendAck(j)
Clock +
t' <LastSendSuc(j)
Clock := Clock + _{t}'_{}
Fig. 13. Automaton BPSUCCESS for
process i (part 2).
In addition to the code fragments of BPLEADERi, BPAGENT and BPSUCCESSi, we provide
here
some comments about the messages, the state variables and the actions.
6.2.2.1. Messages. In this paragraph we describe the messages used
for communication between the leader i and the agents of a round.
Every message m is a tuple of elements. The messages are:
(1) “Collect”
messages, m = (r,
“Collect”),_{;} j. This message is sent by the leader of a round
to announce that a new round, with number r, has been started and at the same time to ask for
information about previous rounds.
(2) “Last”
messages, m = (r,
“Last”,r',v)j; i.This message is sent by an agent to respond
to a “Collect” message from the leader. It provides the last round r' in which
the agent has accepted a value, and the value v proposed in that round.
If the agent did not accept any value in previous
rounds, then v is either nil or the initial value of the
agent and r' is (0_{;}j).
(3) “Begin” messages, m = (r_{;} “Begin”_{;}v)i_{;} j.This message is sent by
the leader of round r to announce the value
v of the round and at the same time to ask to accept it.
(4)
“Accept” messages, m = (r_{;} “Accept”)j_{;} i. This message is sent by an agent to respond
to a “Begin” message from the leader. With this message an agent accepts the value proposed in
the current round.
(5) “OldRound”
messages, m = (r_{;}
“OldRound”_{;}r')j_{;} i.This message is sent by an agent to
respond either to a “Collect” or a “Begin” message. It is sent when the agent is
committed to reject round r and it informs the leader about round r', which is the
higher numbered round for which the agent is committed to reject round r.
(6)
“Success” messages, m = (“Success”_{;}v)i_{;} j.This message is sent by the leader to broadcast the decision.
(7) “Ack” messages, m = (“Ack”)j_{;} i. This
message is an acknowledgment, so that the
leader can
be sure that an agent has received the “Success” message.
We
use the kind of a message to indicate any message of that kind. For example the
notation m E {“Collect”, “Begin”} means
that m is either a “Collect” message, that is m = (r_{;} “Collect”) for some r, or a
“Begin” message, that is m = (r_{;}
“Begin”_{;} v) for some r and v.
Automaton BPLEADERi.
Variable Statusi is
used to model process failures and recoveries. Variable IamLeaderi keeps track of whether the process is leader.
Variable Modei is
used like a program counter, to go through the steps of a round. Variable Init
Valuei contains the initial value of the process.
Variable Decisioni contains
the value, if any, decided by process i. Variable CurRndi contains the number of the round for which process
i is currently the leader. Variable HighestRnd_{i} stores
the highest round number seen by process i. Variable Valuei contains the value being proposed in the current round. Variable ValFromi is the round number of
the round from which Valuei has been chosen (recall that a leader sets the value
for its round to be equal to the value of a particular previous round,
which is round ValFromi). Variable Info Quo_{i} contains the set of processes for which a “Last”
message has been received by process i (that
is, the infoquorum). Variable Accept Quo contains
the set of processes for which an “Accept” message has been
received by process i (that is, the acceptingquorum). We remark that in the
original paper by Lamport, there is only one quorum which is fixed in the first exchange of messages
between the leader and the agents, so that only processes in that quorum can accept the value being proposed.
However, there is no need to restrict
the set of processes that can accept the proposed value to the infoquorum of the round. Messages from processes
in the infoquorum are used only to
choose a consistent value for the round, and once this has been done anyone can
accept that value. This improvement
is also suggested in Lamport’s paper [19]. Finally, variables InMsgs_{i} and
OutMsgs_{i} are buffers used for incoming and outcoming messages.
Actions Stops
and Recovers model process failures and
recoveries. Actions Leaders
and NotLeader are used to update IamLeaderi. Actions Send(m)i_{,} and Receive(m)i_{,} _{j }send messages to the
channels and receive messages from the channels. Action Init(v)i is used by an external agent
to set the initial value of process i. Action NewRoundi starts a new round. It sets
the new round number by increasing the highest round number
ever seen. Action Collects
resets to the initial values all the variables that describe the
status of the round being conducted and broadcasts a “Collect” message. Action GatherLast(m)
collects the information sent by agents in response to the leader’s “Collect”
message. This information is the number of the last round accepted by the agent
and the value of that round. Upon receiving these messages,
GatherLast(m) updates, if necessary, variables Value, and
ValFromi. Also
it updates the set of processes which eventually will be the
infoquorum of the current round. Action GatherLast(m) is executed until
information is received from a majority of the processes. When “Last” messages
have been collected from a majority of the processes, the infoquorum is fixed
and GatherLast(m) is no longer enabled. At this point action Gathered(v) is enabled.
If Value, is defined then the value for the round is set,
and action BeginCast_{i}
is enabled. If Value, is not defined (and this is
possible if the leader does not have an initial value and does not
receive any value in “Last” messages) the leader waits for an
initial value before enabling action BeginCast_{i}. When an initial value is provided, action
Continues can be executed and it sets
Value, and enables action BeginCast_{i}. Action
BeginCast _{i }broadcasts
a “Begin” message including the value chosen for the round. Action
GatherAccept(m) gathers the “Accept” messages. If a majority of the processes accept
the value of the current round then the round is successful and GatherAccept_{i} sets
the Decision, variable to the value of the
current round. When variable Decisions has been set, action
RndSuccess(v) is enabled. Action RndSuccess is used to pass the decision to BPSUCCESSi.
Action
GatherOldRound(m) collects messages that inform process i that the round
previously started by i is “old”, in the sense that a round with a higher number has been started. Process i
can update, if necessary, variable HighestRnd_{i}.
Automaton BPAGENTi. Variable Status is used to model process failures and recoveries. Variable LastR,
is the round number of the latest round for which process i has sent an “Accept”
message. Variable LastViis the value for round LastRi. Variable Commits specifies the round for
which process i is committed and thus specifies the set
of rounds that process i must reject, which are all the rounds with round
number less than Commits. We remark that when an agent
commits for a round r and sends to the leader of round r a
“Last” message specifying the latest round r^{1} <r in which it
has accepted the proposed value, it is enough that the agent commits to not
accept the value of any round T^{11 }in between T^{1}
and T. To make the code simpler, when an agent commits for a round T,
it commits to reject any round T^{11 }<T. Finally, variables InMsgs
_{i }and OutMsgs
_{i }are buffers used for
incoming and outcoming messages.
Actions Stops
and Recover, model process failures and recoveries. Actions Send(m)i, _{j }and
Receive(m)i_{,} send messages to the
channels and receive messages from the
channels. Action LastAccept responds to the “Collect” message sent by
the leader by sending a “Last” message that gives information about
the last round in which the agent has been involved. Action Accept responds to
the “Begin” message sent by the leader. The agent accepts the value of the
current round if it is not rejecting the round. In both LastAccept
and Accept actions, if the agent is committed to reject the current
round because of a higher numbered round, then an “OldRound” message is sent
to the leader so that the leader can update the highest round number ever seen.
Action
Init(v) sets to v the value of LastV only if this variable is undefined. With this, the agent sends its initial value in a
“Last” message whenever the agent has not yet accepted the value of any
round.
Automaton BPSUCCESS .
Variable Status is used to model process failures and recoveries.
Variable IamLeader keeps track of whether the
process is leader. Variable Decision stores the decision. Variable
Acked(j) contains a boolean that specifies whether
or not process j has sent an acknowledgment for a “Success” message. Variable
Prevsend records the time of the previous broadcast of the
decision. Variables Last Check , LastSS , and variables LastSendAck(j) ,
LastSendSuc(j) , for j = , are used to impose the time
bounds on enabled actions. Their use should be clear from the code. Variables OutAckMsgs(j) and
OutSucMsgs(j) , for
j = , are buffers for outcoming
“Ack” and “Success” messages, respectively. There are no buffers for incoming
messages because incoming messages are processed immediately, that is, by action Receive(m) ; j.
Actions Stop and Recover model process failures and recoveries. Actions
Leader and NotLeader are used to update IamLeader
. Actions Send(m) ; j and
Receive(m) ; j send messages to the
channels and receive messages from the channels. Action Receive (m)
handles the receipt of “Ack” and “Success” messages. Action RndSuccess simply takes
care of updating the Decision variable and sets a time
bound for the execution of action SendSuccess . Action SendSuccess sends
the “Success” message, along with the value of Decision
to all processes for which there is no acknowledgment. It sets the
time bounds for the resending of the “Success” message and also the time
bounds LastSendSuc(j) for
the actual sending of the messages. Action Check reenable action SendSuccess
after an appropriate time bound. We remark that 2‘ + 2d is the time needed
to send the “Success” message and get back an “Ack” message (see the analysis in the proof of Lemma
11).
We remark that automaton BPSUCCESS needs to be able to measure
the passage of time.
6.2.3. Partial correctness
Let us define the system _{SBPX }to be the
composition of system _{SCHA }and automaton BASICPAXOS for each process E 5 (remember that BASICPAXOS
is the composition of automata BPLEADER ,
BPAGENT and BPSUCCESS ).
In this section we prove the partial correctness of _{SBPX: }we
show that in any execution of the system _{SBPX, }agreement and validity are guaranteed.
For these proofs, we augment the algorithm with a
collection ' of history variables. Each variable in ' is an
array indexed by the round number. For every round number r a
history variable contains some information about round r. In particular the set
' consists
of:
Hleader(r) E 5 U nil, initially nil (the
leader of round r). Hvalue(r) E VU nil, initially
nil (the value for round r).
Hf rom(r) E U nil, initially nil (the
round from which Hvalue(r) is taken). Hinfquo(r), subset of 5, initially { } (the infoquorum of
round r). Haccquo(r), subset of 5, initially
{ } (the acceptingquorum of
round r). Hrej ect(r), subset
of 5, initially { } (processes
committed to reject round r).
The code fragments of automata BPLEADER and
BPAGENT augmented with the history variables
are shown in Figs. 14 and 15. The figures show only the actions that change history variables.
Actions of BPSUCCESS do not change history variables.
Initially, when no round has been started yet, all
the information contained in the history variables is set to the initial values.
All but Hrej ect(r) history variables
of round r are set by the leader of round r, thus if the
round has not been started these variables remain at their
initial values. More formally we have the following lemma.
Lemma 8. In any state of an
execution of _{SBPX, }if Hleader(r) = nil then Hvalue(r) = nil, Hf rom(r)
= nil, Hinfquo(r) = { }, Haccquo(r) = { }.
Proof. By an easy induction. 

Given a round r, Hrej ect(r), is
modified by all the processes that commit themselves to reject round
r, and we know nothing about its value at the time round r is started.
Next we define some key concepts that will be instrumental in the
proofs.
Definition 6.1. In any state of the system _{SBPX, }a
round r is said to be dead if Hreject(r)¿n/2.
That is, a round r is dead if at least n/2 of the
processes are rejecting it. Hence, if a round r is dead, there
cannot be a majority of the processes accepting its value, i.e., round r cannot be
successful.
We denote by _{S }the set {r E  Hleader(r) = nil} of started rounds and by _{V }the set {r E  Hvalue(r) = nil} of rounds for which the value has been chosen. Clearly in any state s
of an execution of _{SBPX, }we have that _{V } S.
Next we formally define the concept of anchored
round which is crucial to the proofs. The idea of anchored
round is borrowed from [21]. Informally a round r is anchored if its value is
consistent with the value chosen in any previous round r'. Consistent
means that either the value of round r is equal to the value of round r' or round
r' is dead. Intuitively, it is clear that if all the rounds are either anchored
or dead,
then agreement is satisfied.
ABPleader (history variables)
input NewRoundi
Eff: if LeaderAlive = true
then CurRnd
:=HighestRnd + 1
·
Hleader(CurRnd) := i HighestRnd := CurRnd Mode := collect
output BeginCasti
Pre: LeaderAlive = true
Mode = begincast
Eff: Vj put (CurRnd, “Begin”, Value)i, _{j}
in OutMsgs
·
Hinfquo(CurRnd) := Info Quo
·
Hf rom(CurRnd):= ValFrom
·
Hvalue(CurRnd):=
Value
Mode := gatheraccept
internal GatherAccept(m), Pre: LeaderAlive
= true
Mode = gatheraccept
m = (r,
“Accept”)j_{,} E InMsgs
CurRnd =r
E:
remove m from InMsgs
Accept Quo := Accept Quo U{j}
if AcceptQuo > n/2 then
Decision := Value
·
Haccquo(CurRnd) := Accept Quo
Mode := decide
Fig. 14. Actions of BPLEADER for process i augmented with history variables. Only the
actions that do change history variables are
shown. Other actions are the same as in BPLEADERi,
i.e. they do not change history
variables. Actions of BPSUCCESS do not change history
variables.
Definition 6.2. A round r E _{V }is
said to be anchored if for every round r' E _{V }such
that r'
<r, either round r' is dead or Hvalue(r')
= Hvalue(r).
Next we prove that _{SBPX }guarantees
agreement, by using a sequence of invariants. The key invariant is
Invariant 6.8 which states that all rounds are either dead or anchored.
The first invariant, Invariant 6.3, captures the fact that when a process sends
a “Last” message in response to a “Collect” message for a round r, then
it commits to
not vote for rounds previous to round r.
ABPagent _{i }(history variables)
internal LastAccept(m)i
Pre: Status=alive
= (r,
“Collect”)j_{,} iE InMsgs
Eff: remove m from InMsgs
if r Commit then Commit := r
·
For all r', LastR<r'<r
·
Hrej ect(r') := Hrej ectt(r' ) U {i}
put (r, “Last”, LastR , LastV)i_{,} _{j}
in OutMsgs else
put (r, “OldRound”, Commit)i, _{j}
in OutMsgs
Fig. 15. Actions of BPAGENT for process i augmented with history variables. Only the
actions that do change history variables are shown. Other actions
are the same as in BPAGENTi, i.e. they do not change history
variables. Actions of BPSUCCESS do not change history variables.
Invariant 6.3. In any state s of an
execution of _{SBPX, }if
message (r, “Last”, r'',v)j_{,} iis in OutMsgs_{j}, then jEHreject(r'), for all r' such that r''<r'<r.
Proof. We prove the invariant by induction on the length
k of the execution c. The base is trivial: if k =0 then = s0, and in the initial state no message is in OutMsgs_{j}. Hence the invariant is
vacuously true. For the inductive step assume that the invariant is
true for = s0lr1s1 ... rksk and consider the execution _{s0r1s1...kskrs. }We need to prove that the invariant
is still true in s. We distinguish two cases.
Case 1: (r, “Last”, r'', _{v}_{)}_{j, }E sk. OutMsgs_{j}. By the inductive hypothesis
we have j E sk. Hreject(r'), for all r' such that T'' <T' <T. Since no process
is ever removed from any Hrej ect set,
we have j E s.Hrej ect(r'), for
all T ' such that r''<T' <T.
Case 2: (T,
“Last”,r'', ^{v})j, iE=sk.OutMsgs_{j}. Since by hypothesis we have (T, “Last”, r'', ^{v})j, iE s.OutMsgs_{j}, it must be that ir =
LastAccept(m)j, with m = (T, “Collect”) and
it must be sk.LastR
_{j }=r''.Then the invariant
follows by the code of LastAccept(m)j which
puts process j into Hreject(r') for all T ' such that r''<T' <T.
The next invariant
states that the commitment made by an agent when sending a “Last”
message is still in effect when the message is in the communication channel. This
should be obvious, but to be precise in the rest of the proof we prove it
formally.
Invariant 6.4. In any
state s of an execution of _{SBPX, }if message (T, “Last”,r'', v)j, iis inCHANNELj_{,}
i,then j E Hreject(T'), for all T ' such that r''<T' <T.
Proof. We prove the invariant by induction on the length
k of the execution c. The base is trivial: if k = 0 then = s0, and in the initial state no messages are in CHANNELj_{,} i. Hence the invariant is vacuously true. For the
inductive step assume that the invariant is true for = s0lr1s1 . . . lrksk and consider the execution s0ir1s1 . . . iikskirs. We need to prove that the invariant is still true in
s. We distinguish two cases.
Case 1: (r, “Last”, r'', _{v}_{)}_{j, }E sk.CHANNELj_{,} . By the inductive hypothesis we have j E sk. Hreject(r'), for all r' such that T'' <T' <T. Since no
process is ever removed from any Hrej ect set, we have j E s.Hrej ect(T'), for all T' such that T'' <T' <T.
Case 2: (T,
“Last”, T'',v)j_{,} iE= sk.CHANNELj_{,}i. Since by hypothesis (T, “Last”, T'',v)j_{,} iEs. OutMsgs_{j},
it must be that ir = Send(m)j_{,}
with m = (T, “Last”, T'',v)j_{,} i.By the precondition of action
Send(m)j_{,} we
have that message (T,
“Last”, T'',v)j_{,} iE sk.OutMsgs_{j}.
By Invariant 6.3 we have that process j Esk.Hreject(T') for all T' such that T'' <T' <T. Since
no process is ever removed from any Hreject set, we have j Es.Hreject(T'), for all T' such that T'' <T' <T.
The next invariant
states that the commitment made by an agent when sending a “Last”
message is still in effect when the message is received by the leader. Again, this should be obvious.
Invariant 6.5. In any state s of
an execution of _{SBPX, }if message (T,
“Last”, T'', v)j_{,} iis in InMsgs_{i}, then j E Hreject(T'),
for all T' such that T'' <T' <T.
Proof. We prove the invariant by induction on the length k of
the execution x. The base is trivial: if k =0 then = s0, and in the initial state no messages are in InMsgs_{i}.
Hence the invariant is vacuously true. For the inductive step assume
that the invariant is true for = s0lr1s1 .
. . rksk and consider the execution s0lr1s1 . . . nkskns. We need to prove that the invariant is still true in
s. We distinguish two cases.
Case 1: (T, “Last”, T'',v)j_{,} iE sk.InMsgs_{i}.
By the inductive hypothesis we have j E sk. Hreject(T'), for all T' such that T'' <T' <T. Since no
process is ever removed from any Hrej ect set, we have j E s.Hrej ect(T'), for all T' such that T'' <T' <T.
Case 2: (T, “Last”, T'',v)j_{,} iE= sk.InMsgs_{i}. Since by hypothesis (T, “Last”, T'',v)j_{,} iEs. InMsgs_{i},
it must be that ir = Receive(m)i_{,} _{j }with
m = (T, “Last”, T'',v)j_{,} i.In order to execute action Receive(m)i_{,} _{j }we
must have (T, “Last”, T'',v)j_{,} iE sk.CHANNELj_{,} i. By Invariant 6.4 we have j E sk.Hreject(T') for all T' such that T'' <T' <T. Since no
process is ever removed from any Hreject set, we have j Es.Hreject(T'), for all T' such that T''
<T' <T.
The following invariant states that the commitment to reject smaller
rounds, made by the agent is still in effect when the leader updates its information
about previous rounds using the agents’ “Last” messages.
Invariant
6.6. In any state s of an execution _{SBPX, }if
process j E Info
Quo_{i}, for some process i, and
CurRnd = T,
then VT' such that s. ValFrom <T' <T, we have that
j E Hreject(T').
Proof. We prove the invariant by induction on the length
k of the execution c. The base is trivial: if k =0 then = s0, and in the initial state no process j is in Info
Quo_{i} for any i. Hence the invariant is vacuously true.
For the inductive step assume that the invariant is true for = s0lr1s1 ... lrksk and consider the execution s0ir1s1 ... iikskirs. We need to prove that the invariant is still true in
s. We distinguish two cases.
Case 1: In state sk, j E Info
Quo_{i}, for some process i, and CurRnd
= r. Then by the inductive hypothesis, in state sk we have that j EHreject(r'), for all r' such that sk.ValFrom <T' <r. Since no
process is ever removed from any Hreject set and, as long as CurRnd
is not changed, variable ValFrom, is
never decreased, then also in state s we have that j E Hrej ect(r'), for all r' such that s. ValFrom
<T' <r.
Case 2: In state sk, it is not true that j E Info Quo_{i}, for some process i,
and CurRndi =
T. Since in state s it holds that j E Info
Quo_{i}, for some process i, and CurRnd
= r, it must be the case that ir = GatherLast(m) with m = (T, “Last”, T'',vj_{,} i.Notice that, by the precondition of GatherLast(m)i, m E InMsgs_{i}.
Hence, by Invariant 6.5 we have that j E Hreject(T'), for all T' such that r''<T' <T. By the code
of the GatherLast(m)i action
we have that ValFromi¿r''. Whence
the invariant is proved.
The following
invariant is basically the previous one stated when the leader has fixed the infoquorum.
Invariant 6.7.
In any state of an execution of _{SBPX, }ifj E Hinfquo(T)
then VT' such that Hf
rom(T)<T'<T,
we have that jEHreject(T').
Proof. We prove the invariant by induction on the length
k of the execution x. The base is trivial: if k =0 then = s0, and in the initial state we have that for every
round T, Hleader(T) = nil and
thus by Lemma 8 there is no process j in Hinfquo(T). Hence the invariant is vacuously
true. For the inductive step assume that the invariant is true for
= s0lr1s1...ksk
and consider the execution s0lr1s1
... lrksklrs. We need to prove that the invariant is
still true in s. We distinguish two cases.
Case 1: In state sk, j E Hinfquo(T). By the inductive hypothesis,
in state sk we have that
j E Hrej ect(T'), for
all T' such that Hf rom(T)<T'
<T. Since no process is ever removed from any Hreject set, then also in state s we
have that j E Hreject(T'), for all T' such that Hf rom(T)<T'<T.
Case 2: In state sk,
j E Hinfquo(T). Since in state s, j EHinfquo(T), it
must be the case that action ir puts j in Hinfquo(T). Thus it must be ir =
BeginCast _{i }for some process
i, and it must be sk.CuTRnd
= T and j E sk.InfoQuo_{i}. Since
action BeginCast_{i} does
not change CurRnd and Info Quo _{i }we
have that s.CuTRnd = T and j E s.InfoQuo_{i}.
By Invariant 6.6 we have that j E Hreject(T') for all T' such that s.ValFrom
<T' <T. By the code of BeginCast _{i }we
have that Hf rom(T)
= s. ValFromi.
We are now ready to prove the main
invariant.
Invariant 6.8. In any state of an execution of _{SBPX, }any
nondead round T E RV is anchored.
Proof. We proceed by induction
on the length k of the execution . The base is trivial. When k =0 we have that
= s0
and in the initial state no round has been started
yet. Thus Hleader(r) = nil and
by Lemma 8 we have that _{V }= { } and thus the
assertion is vacuously true. For the inductive step assume that the assertion
is true for = s0lr1s1]
... lrksk and consider the execution s0lr1s1 ... iikskirs. We need to prove that,
for every possible action ir the assertion is still true in state s. First we
observe that the definition of “dead” round depends only upon the
history variables and that the definition of “anchored” round depends upon the
history variables and the definition of “dead” round. Thus the definition of
“anchored” depends only on the history variables. Hence actions that do not
modify the history variables cannot affect the truth of the assertion. The actions
that change history variables are:
(1)
ir = NewRoundi
(2) ir = BeginCast_{i}
(3)
ir = GatherAccept(m)i
(4)
ir = LastAccept(m)i
Case 1: Assume ir = NewRoundi.
This action sets the history variable Hleader(r), where r is the round number of the round being
started by process i. The new round r does not belong to _{V }since Hvalue(r) is
still undefined. Thus the assertion of the lemma cannot be contradicted by this
action.
Case 2: Assume ir = BeginCast_{i}.
Action ir sets Hvalue(r), Hf rom(r)
and Hinfquo(r) where r =sk.CurRndi. Round r belongs to _{V }in the new state s.
In order to prove that the assertion is still true it suffices to
prove that round r is anchored in state s and any roundr', r' >r is
still anchored in state s. Indeed rounds with round number less than T are
still anchored in state s, since the definition of anchored for a given round involves only
rounds with smaller round numbers.
First we prove that round T is anchored. From the
precondition of BeginCast _{i }we have that Hinfquo(T) contains
more than n/2 processes; indeed variable Modes is equal to begincast only if the cardinality of Info
Quo _{i }is greater than n/2. Using
Invariant 6.7 for each process j in s.Hinfquo(T), we have that for every
roundr', such that s.Hfrom(r)<r'
<T, there are more than n/2 processes in the set Hreject(r'), which means that every round r',
s.Hfrom(T) <r' <T, is dead.
Moreover, by the code of ir we have that s.Hfrom(T) = sk. ValFrom
and s.Hvalue(T) = sk. Values. From the code (see
action GatherLasti)
it is immediate that in any state Values is the value of round ValFromi. In particular we have that sk. Values = sk.Hvalue(sk. ValFromi). Hence we have
s.Hvalue(T) = s.Hvalue(s.Hfrom(r)). Finally
we notice that round Hf rom(r)
is anchored (any round previous to T is still anchored in
state s) and thus we have that any round r' <, is either dead
or such that s.Hvalue(s.Hfrom(T)) = s.Hvalue(r'). Hence
for any round r' < we have that either roundr' is dead or
that s.Hvalue(r) = s.Hvalue(r'). Thus round T is anchored
in state s.
Finally, we need to prove that any nondead roundr',
r' >T that was anchored in sk
is still anchored in s. Since action BeginCast _{i }modifies
only history variables for round T, we only need to prove that in
state s, Hvalue(r') = Hvalue(T).
Let r'' be equal to Hf
rom(r). Sincer' is anchored in state sk we have that sk.Hvalue(r') = sk.Hvalue(r'').
Again because BeginCast _{i }modifies
only history variables for round r, we have that s.Hvalue(r') =
s.Hvalue(r''). But we have proved that round r is anchored in
state s and thus s.Hvalue(r) =
s.Hvalue(r''). Hence s.Hvalue(r') =
s.Hvalue(r).
Case 3: Assume ir =
GatherAccept(m)i.
This action modifies only variable Haccquo, which is not involved in the definition of
anchored. Thus this action cannot make the assertion false.
Case 4: Assume ir = LastAccept(m)i. This action modifies Hinf quo and Hrej ect. Variable Hinf quo is not involved in the
definition of anchored. Action LastAccept(m)i may
put process i in Hrej ect of
some rounds and this, in turn, may make those rounds dead.
However this cannot make false the assertion; indeed if a round r was anchored in
sk it is still anchored when
another round becomes dead.
The next invariant
follows from the previous one and gives a more direct statement about the agreement
property.
Invariant 6.9. In any state of an execution of _{SBPX,
}all the Decision variables that are not nil,
are set to
the same value.
Proof. We prove the invariant by induction on the length
k of the execution c. The base of the induction is trivially true: for k = 0
we have that = s0
and in the initial state all the Decisioni variables
are undefined.
Assume that the assertion is true for = s0lr1s1 ... irksk and consider the execution s0ir1s1 ... irkskirs. We need to prove that, for every possible
action ir the assertion is still true in state s. Clearly the only actions
which can make the assertion false are those that set Decisioni, for some process i. Thus we
only need to consider actions GatherAccept( (r, “Accept”))i and actions RndSuccess(v)i and Receive( (“Success”,
v) )i, j of automaton BPSUCCESSi.
Case 1.
Assume r= GatherAccept((r,“Accept”))i. This action sets Decisioni to Hvalue(r). If all Decisionj, j = i, are undefined then Decisioni is the first decision and
the assertion is still true. Assume there is only one Decisionj already defined. Let Decisionj = Hvalue(r') for some roundr'. By Invariant 6.8, rounds r andr'
are anchored and thus we have that Hvalue(r') = Hvalue(r). Whence Decisioni = Decisionj. If there are some Decisionj, j = i, which are already defined, then by the
inductive hypothesis
they are all equal. Thus, the lemma follows.
Case 2. Assume ir = RndSuccess(v)i. This action sets Decisioni to v. By the code, value
v is equal to the Decisionj of
some other process. The lemma follows by the inductive hypothesis.
Case 3. Assume ir = Receive( (“Success”, v) )i.
This action sets Decisioni to
v. It is easy to see (by the code) that the value sent in a
“Success” message is always the Decision of
some process. Thus we have that Decisioni is equal to Decisionj for some other process j and the
lemma follows by the inductive hypothesis.
Finally we can prove that agreement is
satisfied.
Validity is easier to
prove since the value proposed in any round comes either from a value supplied by an
Init(v)i
action or from a previous round.
Invariant
6.10. In any state of an execution of _{SBPX, }for
any r E RV we have that Hvalue(r) E V.
Proof. We proceed by induction on the length k of the execution
. The base of the induction is trivially true: for k = 0 we have that = s0 and in the initial state all the Hvalue variables are undefined.
Assume
that the assertion is true for = s0lr1s1
::: irksk and consider the execution s0ir1s1 ::: irkskirs. We need to prove that, for every possible
action ir the assertion is still true in state s. Clearly the only actions that
can make the assertion false are those that modify Hvalue. The only action that modifies Hvalue is BeginCast. Thus, assume ir = BeginCast_{i}.
This action sets Hvalue(r) to Valuei. We
need to prove that all the values assigned to Valuei are in the set V.
Variable Valuei is
modified by actions NewRoundi
and GatherLast(m)i.
We can easily take care of action NewRoundi because
it simply sets Valuei to
be InitValuei which
is obviously in V.
Thus we only need to worry about GatherLast(m)i actions. A GatherLast(m)i action sets variable Valuei to the value specified into the “Last” message if
that value is not nil. The value
specified into any “Last” message is either nil or
the value Hvalue(r') of a previous round r'; by the inductive hypothesis we
have that Hvalue(r') belongs to V.
Invariant 6.11. In any
state of an execution of _{SBPX, }all
the Decision variables that are not undefined are set to some value in V.
Proof. A
variable Decision is always set to be equal to
Hvalue(r) for
some r. Thus the invariant follows from Invariant 6.10.
Theorem 10. In any execution of the system _{SBPX;
}validity is satisfied. Proof. Immediate
from Invariant 6.11.
6.2.4. Analysis of SBPX
In this section we analyze the performance of _{SBPX. }Since
termination is not guaranteed by _{SBPX }in this section we
provide a performance analysis (Lemma 14) assuming that
a successful round is conducted. Then in Section 6.4, Theorem 17 provides the
performance analysis of _{SPAX, }which, in a nice
execution fragment, guarantees termination.
Let us begin by making precise the meaning of the expressions “the start
(end) of a
round”.
Definition 6.12. In an execution fragment whose states are all
uniqueleader states with process being the unique leader, the start of
a round is the execution of action NewRound and the end of a round is the execution of action RndSuccess .
A round is successful if
it ends, that is, if the RndSuccess action is executed by the
leader . Moreover we say that a process reaches its
decision when automaton BPSUCCESS sets
its Decision variable. We remark that, in
the case of a leader, the decision is actually reached when the leader knows
that a majority of the processes have accepted the value being
proposed. This happens in action GatherAccept(m) of BPLEADER . However, to be precise in
our proofs, we consider the decision reached when the variable Decision of BPSUCCESS
is set; for
the leader this happens exactly at the end
of a successful round. Notice that the Decide(v) action, which communicates the
decision v of process to the
external environment, is executed within " time from the point in time when process reaches the decision,
provided that the execution is regular
(in a regular execution actions are executed within the expected time bounds).
The following lemma states that once a round has
ended, if the execution is stable, the decision is reached by
all the alive processes within linear (in the number of processes) time.
Lemma 11. If an execution fragment of the system _{SBPX, }starting in a reachable state
s and lasting for more than 3" + 2d time,
is stable and uniqueleader, with process leader, and process
reaches a decision in state s, then by time 3"
+ 2d, every alive process j = has reached a decision, and the leader has Acked(j) = true for every alive process j = .
Proof. First notice that _{SBPX }is the composition of CHANNEL
, j and
other automata. Hence, by Theorem 3 we can
apply Lemma 4. Let / be the alive processes j = such that Acked(j) = false. If / is
empty then the lemma is trivially true. Hence assume / = { }.
By assumption, the action that brings the system into state s is action
RndSuccess (the leader reaches a decision in state s). Hence action
SendSuccess is enabled. By the code of BPSUCCESS ,
action SendSuccess is executed within " time. This action puts a “Success” message for
each process j /
into OutSucMsgs(j) . By the code of BPSUCCESS , each of these messages is
put on CHANNEL _{,j, }i.e.,
action Send((“Success”,
v) ) ,j is executed, within "
time. By Lemma 4 each alive process j / receives
the “Success” message, i.e., executes a Receive( (“Success”, v)) , jaction, within d time. This
action sets Decisionj to v
and puts an “Ack” message into OutAckMsgs( )j. By the code of BPSUCCESSj, this “Ack” message is put on CHANNELj_{,} , i.e., action Send(“Ack”)j_{,}
is executed, within " time, for every process j. By Lemma 4 the leader
receives the “Ack” message, i.e., executes a Receive((“Ack”) _{)j, }action,
within d time, for each process j. This action sets Acked(j) = true.
Summing up the time
bounds we get the lemma.
In the following we are interested in the time
analysis from the start to the end of a successful round. We
consider a uniqueleader execution fragment , with process i leader,
and such that the leader i has started a round by the first state of (that is,
in the
first state of , CurRndi = r for some round
number r).
We remark that in order for the leader to execute
step 3 of a round, i.e., action BeginCast_{i}, it is
necessary that Valuei be
defined. If the leader does not have an initial value and no agent sends a
value in a “Last” message, variable Valuei is not defined. In
this case the leader needs to wait for the execution of the Init(v)i to set a value to propose in the round
(see action Continuei).
Clearly the time analysis depends on the time of occurrence of the
Init(v)i. To deal with this we use
the following definition.
Definition 6.13. Given an execution fragment , we define ^{ti } to be 0, if variable InitValuei is defined in the first
state of ; the time of occurrence of action Init(v)i, if variable InitValuei is undefined in the first
state of and action Init(v)i
is executed in
; oc, if variable InitValuei is
undefined in the first state of and no Init(v)i action is executed in .
Moreover, we define ^{7i } to be max{7" + 2d, ^{ti }
+ 2"}.
Informally, the above definition of 7i gives
the time, counted from the beginning of a round, by which a BeginCast
_{i }action is expected to be executed, assuming that the execution
is stable and the round being conducted is successful. More formally we have the following
lemma.
Lemma 12. Suppose that for an execution fragment of the system _{SBPX,
}starting in a reachable state s in which s:Decision = nil, then it holds that
(i)
is stable;
(ii)
is a uniqueleader execution, with process i leader;
(iii)
lasts for more than ^{7i } ;
(iv)
the action that brings the system into state s is action NewRoundi for some round r;
(v) round r is successful.
Then we have that action Begin Cast_{i} for
round r is executed within time ^{7j }of the beginning of .
Proof. First notice that _{SBPX }is the composition of CHANNELi_{,} _{j }and other automata. Hence,
by Theorem 3 we can apply Lemmas 1 and 4. Since the execution is stable, it is
also regular, and thus by Lemma 1 actions of BPLEADERi
and BPAGENTi are executed within " time and
by Lemma 4 messages are delivered within d time.
Action NewRoundi
enables action Collecti
which is executed in at most " time. This action puts “Collect”
messages, one for each agent j, into OutMsgs_{i}. By
the code of BPLEADERi (see tasks and bounds) each
one of these messages is sent on CHANNELi_{,}j i.e., action Sendi_{,} _{j }is executed
for each of these messages, within " time. By Lemma 4 a
“Collect” message is delivered to each agent j, i.e., action Receivei_{,} _{j }is executed, within
d time. Then it takes " time for an agent to execute action LastAccept _{j
}which
puts a “Last” message in OutMsgs_{j}. By the code of BPAGENT (see tasks and bounds) it
takes additional " time to execute action Sendj_{,} to send the “Last” message
on CHANNELj_{,} .
By Lemma 4, this “Last” message is delivered to the leader, i.e., action Receivej_{,} is executed, within
additional d time. By the code of BPLEADER (see tasks and bounds) each one of these messages is
processed by GatherLast within " time. Action Gathered is executed
within additional " time.
At this point there are two possible cases: ( ) Value
is defined and ( ) Value is not
defined. In case (i), action BeginCast is enabled and is executed within "
time. Summing up the times considered so far we have that
action BeginCast is executed within 7"+2d time from the start of the
round. In case (ii), action Continue is executed within " time of the
execution of action Continue , and thus by time t + ". This action enables
action BeginCast which is executed within additional " time. Hence action
BeginCast is executed by time t +
2". Putting together the two cases we have that action BeginCast is
executed by time max{7" + 2d, t + 2"}.
Hence we have proved
that action BeginCast is executed in by time T .
Next lemma gives a bound for the time that elapses between the execution
of the BeginCast action and the RndSuccess action for a
successful round in a stable execution fragment.
Lemma
13. Suppose that for an execution fragment of the system _{SBPX, }starting in a reachable state s in which s:Decision = nil,
then it holds
that:
(i)
is stable;
(ii)
is a uniqueleader execution, with process leader;
(iii)
lasts for more than 5" + 2d time;
(iv)
the action that brings the system into state s is action BeginCast for
some round r;
(v) round r is successful.
Then
we have that action RndSuccess is performed by time 5"
+ 2d from the beginning of .
Proof. First notice that _{SBPX }is the composition of CHANNEL
, j and
other automata. Hence, by Theorem 3 we can
apply Lemmas 1 and 4. Since the execution is stable, it is also regular, and
thus by Lemma 1 actions of BPLEADER and
BPAGENT are executed within " time and
by Lemma 4 messages are delivered within d time.
Action BeginCast puts “Begin” messages for round r
in OutMsgs . By the code of BPLEADER (see tasks and bounds) each
one of these messages is put on CHANNEL , j by
means of action Send , j in
at most " time. By Lemma 4 a “Begin” message is delivered
to each agent j, i.e., action Receive , jis
executed, within d time. By the code of BPAGENTj (see tasks and bounds) action
Accept_{j} is executed within "
time. This action puts an “Accept” message in OutMsgs_{j}. By the code of BPAGENT _{j }the “Accept” message
is put on CHANNELj_{,} ,
i.e., action Sendj_{,}
for this message is executed, within "
time. By Lemma 4 the message is delivered, i.e., action Receivej_{,} _{i }for
that message is executed, within d time. By the code of BPLEADERi action GatherAccept _{i }is
executed for a majority of the “Accept” messages within additional ‘
time. At this point variable Decisioni is
defined and action RndSuccessi is executed within ‘ time. Summing up all the times we have that the round ends within
5‘ + 2d.
We can now easily
prove a time bound on the time needed to complete a round.
Lemma
14. Suppose that for an execution fragment of the system _{SBPX, }starting
in a
reachable state s in
which s.Decision = nil,
then it holds
that
(i)
is stable;
(ii)
is a uniqueleader execution, with process i leader;
(iii)
lasts for more than ^{7i } + 5‘ + 2d;
(iv)
the action that brings the system into state s is action NewRoundi for some round r;
(v) round r is successful.
Then we
have that action Begin Cast_{i} for round r is
executed within time ^{7i } of
the beginning
of and action RndSuccessi is executed by time ^{7i } + 5‘ + 2d of the beginning of .
Proof. Follows
from Lemmas 12 and 13.
The previous lemma states that in a stable execution a successful round
is conducted within some time bound. However, it is possible that even
if the system executes nicely from some point in time on, no successful round
is conducted and to have a successful round a new round must be
started. We take care of this problem in the next section. We
will use a more refined version of Lemma 14; this refined version replaces
condition (v) with a weaker requirement. This weaker requirement is
enough to prove that the round is successful.
Lemma
15. Suppose that for an execution fragment of _{SBPX, }starting in a
reachable state
s in which s.Decision = nil,
then it holds
that
(i)
is nice;
(ii)
is a uniqueleader execution, with process i leader;
(iii)
lasts for more than ^{7i } + 5‘ + 2d time;
(iv)
the action that brings the system into state s is action NewRoundi for some round r;
(v) there exists a set / c 5 of processes such that every process in / is alive and /
is a majority, for every j E /, s. Commitj r and in state s for
every j E / and
k E 5, CHANNELk_{,} _{j }and InMsgs _{j }do not
contain any “Collect” message belonging to any round r' ¿r.
Then
we have that action BeginCasti is
performed by time ^{7i } and
action RndSuccessi is performed by time ^{7i
} + 5‘ + 2d from the beginning of .
Proof. Process sends a “Collect” message which is
delivered to all the alive voters. All the alive voters, and
thus all the processes in /, respond
with “Last” messages which are delivered to the leader. No process j / can be committed to reject round r. Indeed,
by assumption, process j is not committed to reject round r in state s and process
j cannot commit to reject round r. The latter is due to the fact that in state
s no message that can cause process j to commit to reject
round r is either in InMsgs_{j} nor
in any channel to process j, and in the only leader is , which only sends messages belonging to round r. Since / is a majority, the leader receives at least a
majority of “Last” messages and thus it is able to proceed with the next step
of the
round. The leader sends a “Begin” message which is delivered to all the alive voters. All the alive voters, and thus all the
processes in /, respond
with “Accept” messages since they are not committed to reject round r.
Since / is a majority, the leader
receives at least a majority of “Accept” messages. Therefore given that lasts for enough time round r
is successful.
Since round r is
successful, the lemma follows easily from Lemma 14.
6.3. Automaton SPAX
To reach consensus using _{SBPX, }rounds
must be started by an external agent by means of the NewRound action
that makes process start a new round. In this section we
provide automata STARTER that
start new round. Composing STARTER with
_{SBPX }^{we }obtain SPAX.
The system _{SBPX }guarantees that running
rounds does not violate agreement and validity, even if rounds are
started by many processes. However, since running a new round may prevent a
previous one from succeeding, initiating too many rounds is not a good
idea. The strategy used to initiate rounds is to have a leader election
algorithm and let the leader initiate new rounds until a round is
successful. We exploit the robustness of BASICPAXOS
in order to use the sloppy leader elector provided in Section 5. As long
as the leader elector does not provide exactly one leader, it is possible that
no round is successful, however agreement and validity are always
guaranteed. This means that regardless of termination, in any run of the
algorithm no two different decisions are ever made and any decision
is equal to some input value. Moreover, when the leader elector
provides exactly one leader, if the system _{SBPX }is executing a nice
execution fragment
then a round is successful.
Automaton STARTER takes
care of the problem of starting new rounds. This automaton interacts with LEADERELECTOR by means of the Leader and
NotLeader actions and with BASICPAXOS by
means of the NewRound , Gathered(v) , Continue and RndSuccess(v) actions. Fig.
5, given at the beginning of the section, shows the interaction of the STARTER automaton with the other
automata.
The code of automaton STARTER is shown in Figs. 16 and 17.
Automaton STARTER does the following. Whenever
process becomes leader, the STARTER automaton
starts a new round by means of action NewRound . Moreover the automaton checks
that action BeginCast is executed within the expected time
bound (given by Lemma 14).
If BeginCast is not executed within the expected time bound, then STARTER starts a new round.
Similarly once BeginCast has been executed, the automaton checks that action RndSuccess(v)
is executed within the expected time bound (given by Lemma 14). Again, if such
an action is not executed within the expected time bound, STARTER starts a
new round. We remark that to check for the execution of BeginCast , the
automaton actually checks for the execution of action Gathered(v) .
This is because the expected time of execution of BeginCast depends on whether
an initial value is already available when action Gathered(v) is
executed. If such a value is available when Gathered(v) is executed then
BeginCast is enabled and is expected to be executed within " time of
the execution of Gathered(v) . Otherwise the leader has to wait for the
execution of action Init(v) which enables action Continue and
action BeginCast is expected to be executed within " time of the execution of
Continue .
In addition to the code we provide some comments
about the state variables and the actions. Variables IamLeader
and Status are selfexplanatory. Variable Start
is true when a new round needs to be started. Variable RndSuccess
is true when a decision has been reached. Variables DlineGat and DlineSuc are used to check for the execution of actions Gathered(v) and RndSuccess(v) . They are also
used, together with variable LastNR , to impose time bounds on enabled actions.
Automaton STARTER updates
variable IamLeader according to the input
actions Leader and NotLeader and executes internal and output
actions whenever it is the leader. Variable Start is
used to start a new round and it is set either when a Leader action
changes the leader status IamLeader from false to true, that is, when the process becomes leader or
when the expected time bounds for the execution of actions Gathered(v)
and RndSuccess(v) elapse without the execution of these actions. Variable
RndSuccess is updated by the input action RndSuccess(v) .
Action NewRound starts a new round. Actions CheckGathered and
CheckRndSuccess check, respectively, whether actions Gathered(v) and
RndSuccess(v) are executed within the expected time bounds.
Using an analysis similar to the one done in the proof of Lemma 12 we have that
action Gathered(v) is supposed to be executed within 6" + 2d time of the
start of the round. The time bound for the execution of action
RndSuccess(v) depends on whether the leader has to wait for an Init(v) event.
However by Lemma 13 action RndSuccess(v) is expected to be executed within
5" + 2d time from the time of occurrence of action BeginCast and action BeginCast
is executed either within " time of the
execution of action Gathered(v) , if an initial value is available when this
action is executed, or else within
" time of the execution of action Continue . Hence actions Gathered(v) and
Continue both set a deadline of 6" + 2d for the execution of action RndSuccess(v) . Actions CheckGathered and
CheckRndSuccess start a new round if the above deadlines expire.
6.4. Correctness and analysis of SPAX
Even in a
nice execution fragment a round may not reach success. This is possible when agents are committed to reject the first
round started in the nice execution
STARTERs
Signature:
Input: Leaders, NotLeaders,
Stops, Recovers,
Gathered(v)s,
Continues, RndSuccess(v )s
Internal: CheckGathereds, CheckRndSuccesss
Output: NewRounds
Timepassage:
v(t)
State:
Clock E ER init.
arbitrary DlineSucEER U{00} init.
nil
Status E {alive_{;} stopped} init. alive DlineGat
E ER U {00} init. nil
IamLeader, a boolean init.
false LastNR E ER U {00} init. 00
Start, a boolean init. false RndSuccess,
a boolean init. false
Actions:
input Stops
E: Status := stopped
input Recovers
Eff: Status := alive
input Leaders
Eff: if Status = alive then
if IamLeader = false then IamLeader := true
if RndSuccess = false then
Start := true
DlineGat := 00 DlineSuc := 00 LastNR
:= Clock + t'
input NotLeaders
Eff: if Status = alive then
LastNR := 00
DlineSus := 00
DlineGat := 00
IamLeader := false
output NewRounds Pre: Status= alive IamLeader = true
Start = true
Eff: Start:=false
DlineGat
:= Clock + 6t' + 2d LastNR := 00
input Gathered(v)s
Eff: if Status = alive then
DlineGat := 00
if v = nil then
DlineSuc := Clock + 6t' + 2d
input Continues
Eff: if Status = alive then DlineSuc
:= Clock + 6t' + 2d
STARTERS (cont’d)
internal CheckRndSuccessi Pre: Status = alive
IamLeader = true DlineSuc = nil
Clock >DlineSuc Eff:
DlineSuc:=00
Start := true
LastNR := Clock + t'
timepassage v(t)
Pre: none
Eff: if Status = alive then
Let t' be
such that Clock
+ t' <LastNR
and Clock + t' <DlineGat + t' and Clock + t'<DlineSuc + t' Clock:= Clock + _{t}'_{}
Fig. 17. Automaton STARTER for
process i (part 2).
fragment because they are committed for higher numbered rounds started
before the beginning of the nice execution fragment. However, in
such a case a new round is started and there is nothing that can prevent the
success of the new round. Indeed in the newly started round,
alive processes are not committed for higher numbered rounds since
during the first round they inform the leader of the round number for which
they are committed and the leader, when starting a new round,
always uses a round number greater than any round number ever seen. In this
section we will prove that in a long enough nice execution fragment termination is
guaranteed.
Remember that SPA X is
the system obtained by composing system _{SLEA }with
one automaton BASICPAXOS and
one automaton STARTER for
each process i 5. Since this system contains as a
subsystem the system _{SBPX, }it guarantees agreement and
validity. However, in a long enough nice execution fragment of _{SPAX
}termination is achieved, too.
The following lemma states that in a long enough
nice, uniqueleader execution, the leader reaches a decision.
We recall that 7 = max{7t' + 2d, ^{ti } + 2t'} and that ^{ti } is
the time of
occurrence of action Init(v) in (see Definition 6.15).
Lemma 16.
Suppose that for an execution fragment of _{SPAX, }starting in a
reachable state s in
which s.Decision = nil, then it holds that
(i)
is nice;
(ii)
is a uniqueleader execution, with process i leader;
(iii) lasts for more than 7^{i} + 20t' + 7d time.
Then by time 7^{i} + 20t' +
7d the leader
i has reached a decision.
Proof. First we notice that system _{SPAX }contains as
subsystem _{SBPX; }hence by using Theorem 3, the projection of on the subsystem _{SBPX
}is actually an execution of SBPX and thus Lemmas 14 and
15 are still true in .
For simplicity, in the following we assume that ^{7j
}=0, i.e., that process i has executed an Init(v)i action before . At the end of each case we
consider, we will add ^{7i }to the time bound to take
into account the possibility that process i has to wait for an Init(v)i action. Notice that 7^{i}
=0 implies that7i =0 for any fragment fi of starting at some
state of and ending in the last state of .
Let s' be the first state of such that no
“Collect” message sent by a process k = i
is present in CHANNELk_{;}j nor in InMsgs _{j }for
any j. State s' exists in and its time of occurrence is less or
equal to d + t'. Indeed, since the execution is nice, all the messages
that are in the channels in state s are delivered by time d and messages present
in any InMsgs set are processed within t'
time. Since i is the unique leader, in state s' no messages sent
by a process k = i
is present in any channel nor in any InMsgs set.
Let ' be the fragment of beginning at s'. Since ' is a fragment of , we have that ' is nice,
process i is the unique leader in ' and ^{7i },^{ }=0.
If process i has started a round r' by state s'
and roundr' is successful, then round r' ends by time ^{7i },^{
}+ 5t' + 2d = 5t' + 2d in '. Indeed if the action that brings that system
in state s' is a NewRoundi
action for round r' then by Lemma 14 we have that the round ends by
time ^{7i },^{ }+ 5t'+ 2d = 5t' + 2d. If action NewRoundi for roundr' has been executed before,
roundr' ends even more quickly and the time bound holds anyway.
Since the time of occurrence of s' is less or equal to t' + d we have that
round r' ends by time 6t' + 3d. Considering the possibility that process i has
to wait for an Init(v)i
action we have that round r' ends by time 7^{i} + 6t' + 3d in . Hence the lemma is true in
this case.
Assume that either (a) process i has started a
round r' by state s' but round r is not successful or (b) that
process i has not started any round by state s'. In both cases process i executes
a NewRoundi
action by time ^{7i },^{ }+ 7t' + 2d = 7t' + 2d in ' . Indeed in case (a), by the code of STARTERi, action CheckRndSuccessi is executed within ^{7i },^{ }+
6t' + 2d = 6t' + 2d time and it takes additional t' time to execute action NewRoundi. In case (b), by the
code of BPLEADERi, action NewRoundi is executed within t' time. Let r '' be the round
started by such an action.
Let s'' be the state after the execution of the
NewRoundi action and let '' be the fragment
of starting in s''. Since '' is a fragment of , we have that '' is nice, process
i is the unique leader in '' and ^{7i },,^{ }=0. We notice that
since the time of occurrence of state s' is less or equal to t' + d the time of
occurrence of s'' is less or equal to 8t' + 3d in .
We now
distinguish two possible cases.
Case 1: Roundr'' is
successful. In this case, by Lemma 14 we have that round _{r}''_{
}is successful within ^{7i },,^{ }+ 5t' +
2d = 5t' + 2d time in ''. Since the time of occurrence
of S'' is less or equal to 8t' + 3d, we have that round T'' ends by
time 13t' + 5d in . Considering the
possibility that process i has to wait for an Init(v)i action we have that round T'' ends
by time 7^{i} + 13t' + 5d in . Hence the lemma is true in this case.
Case 2. Round
T'' is not successful. By the code of STARTERi, action NewRoundi is executed within ^{Ti },,^{
}+ 7" + 2d = 6t' + 2d time in ''. Indeed, it takes ^{Ti },,^{
}+ 5' + 2d to execute action CheckRndSuccessi and additional " time to execute action
NewRoundi. Let T''' be the new round
started by i with such an action, let 5''' be the state of the system after the
execution of action NewRoundi
and let ''' be the fragment of '' beginning at 5'''. The time
of occurrence of 5''' is less or equal than 15t' + 5d in .
Clearly ''' is nice, process i is the unique leader in '''. Any alive
process j that rejected round T'' because
of a round ^{˜}i , ^{˜}i>T'', has responded to the “Collect”
message of round T'', with a message (T'',“OldRound”,^{˜}rj_{;} iinforming the leader i about round ^{˜}i.
Since '' is nice all the “OldRound” messages are received before state
5'''. Since action NewRoundi
uses a round number greater than all the ones received in “OldRound” messages,
we have that for any alive process j, s'''.Commit
j<T'''. Let / be
the set of alive processes. In state s''', for every j E / and any k E 5, CHANNELk_{,} _{j }does not contain
any “Collect” message belonging to any round ^{˜}1>T''' nor such a
message is present in any InMsgs _{j }set
(indeed this is true in state s'). Finally, since '' is nice, by
definition of nice execution fragment, we have that / contains
a majority of the processes.
Hence, we can apply Lemma 15 to the execution
fragment '''. By Lemma 15, round T''' is successful within ^{Ti
},,,^{ }+ 5' + 2d = 5' + 2d time from the beginning of '''. Since the
time of occurrence of 5''' is less or equal to 15t' + 5d in , we have that
round T''' ends by time 20t' + 7d in . Considering the possibility that process
i has to wait for an Init(v)i
action we have that round T''' ends by time ^{7j }+ 20t' + 7d in .
Hence the
lemma is true also in this case.
If the execution is stable for enough time, then
the leader election eventually elects a unique leader (Lemma 7). In the
following theorem we consider a nice execution fragment and we let i be the
process eventually elected unique leader. We recall that ti is the time of occurrence of action Init(v)i in and that " and d are constants.
Theorem 17. Let be a nice execution fragment of _{SPAX
}starting in a reachable state and
lasting for more than t^{i} + 35t' + 13d. Then
the leader i executes Decide(v')i by time t^{i} + 32t' + 11d from the beginning of .
Moreover by time t^{i} + 35t' + 13d from the beginning of any alive process j executes Decide(v' )j.
Proof. Since _{SPAX }^{contains }SLEA ^{and }SBPX as subsystems, by Theorem 3 we can use any
property of _{SLEA }^{and }SBPX. Since
the execution fragment is nice (and thus stable), by
Lemma 7 there is a unique leader by time 4t' + 2d. Let 5' be the first
uniqueleader state of and let i be the leader. By Lemma 7 the time of
occurrence of s' before or at time 4'+2d. Let ' be the
fragment of starting in state s'. Since is nice, ' is nice.
By Lemma 16 we have that the leader reaches a
decision by time 7j,
+ 20t' + 7d from the beginning of '. Summing up the times and noticing that ^{7i
},^{ }ti ,
+ 7" + 2d
and that ^{ti }, <t
^{i } we have that the leader
reaches a decision by time ^{ti } +
31t' + 11d. Within
additional t' time action Decide(v')i is executed.
The leader reaches a decision by time ^{ti } + 31t' + 11d. By Lemma 11 we
have that a decision is reached by every alive process j within
additional 3t' + 2d time, that is by time ^{ti } + 34t' + 13d. Within
additional t' time action Decide(v' )j
is executed.
6.5. Messages
It is not difficult to see that in a nice execution, which is an
execution with no failures, the number of
messages spent in a round is linear in the number of processes. Indeed in a successful round the leader broadcasts
two messages and the agents respond to
the leader’s messages. Once the leader reached a decision another broadcast is enough to spread this decision to the agents. It
is easy to see that, if everything goes well, at most 6n messages are sent to
have all the alive processes reach the decision.
However failures may cause the sending of extra
messages. It is not difficult to construct situations where the number of
messages sent is quadratic in the number of processes. For example if
we have that before i becomes the unique leader, all the
processes act as leaders and send messages, even if i becomes the unique leader
and conducts a successful round, there are 61(n^{2}) messages in
the channels which are delivered to the agents which respond to these messages.
Automaton BPSUCCESS
keeps sending messages to processes that do not acknowledge the
“Success” messages. If a process is dead and never recovers, an infinite number
of messages is sent. In a real implementation, clearly the leader should not
send messages to
dead processes.
Finally the automaton DETECTOR sends an infinite number of
messages. However the information provided by this automaton can be
used also by other applications.
6.6. Concluding remarks
The PAXOS algorithm
was devised in [19]. In this section we have provided a new presentation
of the PAXOS algorithm. We conclude this
section with a few remarks.
The first remark concerns the use of majorities
for infoquorums and acceptingquorums. The only property
that is used is that there exists at least one process common to any
infoquorum and any acceptingquorum. Thus any quorum scheme for infoquorums
and acceptingquorums that guarantees the above property can be used.
As pointed out in also [21], the amount of stable
storage needed can be reduced to a very few state variables.
These are the last round started by a leader (which is stored in the CurRnd
variable), the last round in which an agent accepted the value and the value
of that round (variables LastR, Last V), and the round for
which an agent is committed (variable Commit). These
variables are used to keep consistency, that is, to always
propose values that are consistent with previously proposed values, so if they are
lost then consistency might not be preserved. In our setting we assumed that
the entire state of the processes is in stable storage, but
in a practical implementation only the variables described above need to be stable.
A practical implementation of PAXOS should cope with some failures before abandoning a round. For example a message could be sent twice, since duplication is not a problem for the algorithm (it