# When does Paxos work?

a.k.a. what about the impossibility of consensus result?

It would clearly be desirable that, if a client broadcasts a new command to all replicas, that it eventually receives at least one response. This is an example of a $liveness$ property. It requires that if one or more commands have been proposed for a particular slot, that some command is eventually decided for that slot. Unfortunately, the Synod protocol as described does not guarantee this, even in the absence of any failure whatsoever. In fact, failures tend to be good for liveness. If all leaders but one fail, Paxos is guaranteed to terminate.

Consider the following scenario shown above, with two leaders with identifiers $\lambda$ and ${\lambda'}$ such that $\lambda < {\lambda'}$. Both start at the same time, respectively proposing commands $c$ and $c'$ for slot number 1. Suppose there are three acceptors, $\alpha_1$, $\alpha_2$, and $\alpha_3$. In ballot $\langle 0, \lambda \rangle$, leader $\lambda$ is successful in getting $\alpha_1$ and $\alpha_2$ to adopt the ballot, and $\alpha_1$ to accept pvalue $\langle \langle 0, \lambda \rangle, 1, c \rangle$.

Now leader ${\lambda'}$ gets $\alpha_2$ and $\alpha_3$ to adopt ballot $\langle 0, {\lambda'} \rangle$, which has a higher ballot number than ballot $\langle 0, \lambda \rangle$ because $\lambda < {\lambda'}$. Note that neither $\alpha_2$ or $\alpha_3$ accepted any pvalues, so leader $\lambda'$ is free to select any proposal. Leader $\lambda'$ then gets $\alpha_3$ to accept $\langle \langle 0, {\lambda'} \rangle, 1, c' \rangle$.

At this point, acceptors $\alpha_2$ and $\alpha_3$ are unable to accept $\langle \langle 0, \lambda \rangle, 1, c \rangle$ and thus leader $\lambda$ is unable to get a majority of acceptors to accept $\langle \langle 0, \lambda \rangle, 1, c \rangle$. Trying again with a higher ballot, leader $\lambda$ gets $\alpha_1$ and $\alpha_2$ to adopt $\langle 1, \lambda \rangle$. The maximum pvalue accepted by $\alpha_1$ and $\alpha_2$ is $\langle \langle 0, \lambda \rangle, 1, c \rangle$, and thus $\lambda$ must propose $c$. Suppose $\lambda$ gets $\alpha_1$ to accept $\langle \langle 1, \lambda \rangle, 1, c \rangle$. Because acceptors $\alpha_1$ and $\alpha_2$ adopted $\langle 1, \lambda \rangle$, they are unable to accept $\langle \langle 0, {\lambda'} \rangle, 1, c' \rangle$. Trying to make progress, leader ${\lambda'}$ gets $\alpha_2$ and $\alpha_3$ to adopt $\langle 1, {\lambda'} \rangle$, and gets $\alpha_3$ to accept $\langle \langle 1, {\lambda'} \rangle, 1, c' \rangle$.

This ping-pong scenario can be continued indefinitely, with no ballot ever succeeding in choosing a pvalue. This is true even if $c = c'$, that is, the leaders propose the same command. The well-known FLP impossibility result'' demonstrates that in an asynchronous environment that admits crash failures, no consensus protocol can guarantee termination, and the Synod protocol is no exception. The argument does not apply directly if transitions have non-deterministic actions---for example changing state in a randomized manner. However, it can be demonstrated that such protocols cannot guarantee a decision either.

If we could somehow guarantee that some leader would be able to work long enough to get a majority of acceptors to adopt a high ballot and also accept a pvalue, then Paxos would be guaranteed to choose a proposed command. A possible approach could be as follows: when a leader $\lambda$ discovers, through a $\texttt{preempted}$ message, that there is a higher ballot with leader ${\lambda'}$ active, rather than starting a new scout with an even higher ballot number, $\lambda$ starts monitoring $\lambda'$ by pinging it on a regular basis. As long as ${\lambda'}$ responds timely to pings, $\lambda$ waits patiently. Only if ${\lambda'}$ stops responding will $\lambda$ select a higher ballot number and start a scout.

This concept is called $failure$ $detection$, and theoreticians have been interested in the weakest properties failure detection should have in order to support a consensus algorithm that is guaranteed to terminate. In a purely asynchronous environment, it is impossible to determine through pinging or any other method whether a particular leader has crashed or is simply slow. However, under fairly weak assumptions about timing, we can design a version of Paxos that is guaranteed to choose a proposal. In particular, we will assume that both the following are bounded:

• the clock drift of a process, that is, the rate of its clock is within some factor of the rate of real-time;
• the time between when a non-faulty process initiates sending a message, and the message having been received and handled by a non-faulty destination process.

We do not need to assume that we know what those bounds are---only that such bounds exist. From a practical point of view, this seems entirely reasonable. Modern clocks progress certainly within a factor of 2 of real-time. A message between two non-faulty processes is likely delivered and processed within, say, a year. Even if the network was partitioned at the time the sender started sending the message, by the time a year has expired the message is highly likely to have been delivered and processed.

These assumptions can be exploited as follows: we use a scheme similar to the one described above, based on pinging and timeouts, but the value of the timeout interval depends on the ballot number: the higher the competing ballot number, the longer a leader waits before trying to preempt it with a higher ballot number. Eventually the timeout at each of the leaders becomes so high that some correct leader will always be able to get its proposals chosen.

For good performance, one would like the timeout period to be long enough so that a leader can be successful, but short enough so that the ballots of a faulty leader are preempted quickly. This can be achieved with a TCP-like AIMD (Additive Increase, Multiplicative Decrease) approach for choosing timeouts. The leader associates an initial timeout with each ballot. If a ballot gets preempted, the next ballot uses a timeout that is multiplied by some small factor larger than one. With each chosen proposal, this initial timeout is decreased linearly. Eventually the timeout will become too short, and the ballot will be replaced with another even if its leader is non-faulty.

Liveness can be further improved by keeping state on disk. The Paxos protocol can tolerate a minority of its acceptors failing, and all but one of its replicas and leaders failing. If more than that fail, consistency is still guaranteed, but liveness will be violated. A process that suffers from a power failure but can recover from disk is not theoretically considered crashed---it is simply slow for a while. Only a process that suffers a permanent disk failure would be considered crashed.

For completeness, we note that for liveness we also assumed reliable communication. This assumption can be weakened by using a fair links assumption: if a correct process repeatedly sends a message to another correct process, at least one copy is eventually delivered. Reliable transmission of a message can then be implemented by periodic retransmission until an ack is received. In order to prevent overload on the network, the time intervals between retransmissions can grow until the load imposed on the network is negligible. The fair links assumption can be weakened further, but such a discussion is outside the scope of this paper.

As an example of how liveness is achieved, a correct client retransmits its request to replicas until it has received a response. Because there are at least $~f+1$ replicas, at least one of those replicas will not fail and will assign a slot to the request and send a proposal to the $~f+1$ or more leaders. Thus at least one correct leader will try to get a command decided for that slot. Should a competing command get decided, the replica will reassign the request to a new slot and retry. While this may lead to starvation, in the absence of new requests, any outstanding request will eventually get decided in at least one slot.