How do Replicas work?

When a client $\kappa$ wants to execute a command $c = \langle\kappa, \textit{cid}, \textit{op}\rangle$, its stub routine broadcasts a $\langle\textbf{request}, c\rangle$ message to all replicas and waits for a $\langle\textbf{response}, \textit{cid}, \textit{result}\rangle$ message from one of the replicas.

The replicas can be thought of as having a sequence of $slots$. that need to be filled with commands that make up the inputs to the state machine. Each slot is indexed by a $slot$ $number$. Replicas receive requests from clients and assign them to specific slots, creating a sequence of commands. A replica, on receipt of a $\langle\textbf{request}, c\rangle$ message, proposes command $c$ for its lowest unused slot. We call the pair $(s, c)$ a $proposal$ for slot $s$.

Replicas use the Synod protocol to choose a single command for each slot from the proposals they make. For each slot, the Synod protocol runs between a set of processes called the configuration of the slot. The configuration contains the leaders and the acceptors, but not the replicas. Leaders receive proposed commands from replicas and are responsible for choosing a single command for the slot. Thus, in order to tolerate $~f$ crash failures, Paxos needs at least $f + 1$ leaders, always leaving at least $1$ leader to order the commands proposed by replicas. A replica awaits the decision before actually updating its state and computing a response to send back to the client that issued the request.

Usually, the configuration for consecutive slots is the same. Sometimes, such as when a process in the configuration is suspected of having crashed, it is useful to be able to change the configuration. Paxos supports reconfiguration: a client can propose a special reconfiguration command, which is decided in a slot just like any other command. However, if $s$ is the index of the slot in which a new configuration is decided, it does not take effect until slot $s + \mathtt{WINDOW}$. This allows up to $\mathtt{WINDOW}$ slots to have proposals pending—the configuration of later slots may change. It is always possible to add new replicas—this does not require a reconfiguration of the leaders and acceptors.

Replica State

Each replica $\rho$ maintains seven variables:

  • $\rho.\textit{state}$, the replica's copy of the application state, which we will treat as opaque. All replicas start with the same initial application state.
  • $\rho.\textit{slot_in}$, the index of the next slot in which the replica has not yet proposed any command. Initially 1.
  • $\rho.\textit{slot_out}$, the index of the next slot for which it needs to learn a decision before it can update its copy of the application state. Equivalent to the state's version number (i.e., number of updates), and initially 1.
  • $\rho.\textit{requests}$, an initially empty set of requests that the replica has received and are not yet proposed or decided.
  • $\rho.\textit{proposals}$, an initially empty set of proposals that are currently outstanding.
  • $\rho.\textit{decisions}$, another set of proposals that are known to have been decided (also initially empty).
  • $\rho.\textit{leaders}$, the set of leaders in the current configuration. The leaders of the initial configuration are passed as an argument to the replica.

Replica Invariants

For correctness following invariants hold over the collected variables of replicas both before and after state transitions:

  • R1: There are no two different commands decided for the same slot: $~~~\forall s, \rho_1, \rho_2, c_1, c_2: \langle s, c_1 \rangle \in {\rho_1}.\textit{decisions} ~\wedge~ \langle s, c_2 \rangle \in {\rho_2}.\textit{decisions} \Rightarrow c_1 = c_2$
  • R2: All commands up to $\textit{slot_out}$ are in the set of decisions: $~~~\forall \rho, s: 1 \leq s < \rho.\textit{slot_out} \Rightarrow \exists c : \langle s, c \rangle \in \rho.\textit{decisions}$
  • R3: For all replicas $\rho$, $\rho.\textit{state}$ is the result of applying the commands $\langle s, c_s \rangle \in \rho.\textit{decisions}$ to $\textit{initial_state}$ for all $s$ up to $\textit{slot_out}$, in order of slot number;
  • R4: For each $\rho$, the variable $\rho.\textit{slot_out}$ cannot decrease over time.
  • R5: A replica proposes commands only for slots it knows the configuration for: $~~~\forall \rho: \rho.\textit{slot_in} < \rho.\textit{slot_out} + \mathtt{WINDOW}$.

To understand the significance of such invariants, it is useful to consider what would happen if one of the invariants would not hold. If R1 would not hold, replicas could $diverge$, ending up in different states even if they have applied the same number of commands. Also, without R1, the same replica could decide multiple different commands for the same slot, because $\rho_1$ and $\rho_2$ could be the same process. Thus, the application state of that replica would be ambiguous.

Invariants R2-R4 ensure that, for each replica $\rho$, the sequence of the first $\rho.\textit{slot_out}$ commands is recorded and fixed. If any of these invariants were invalidated, a replica could change its history and end up with a different application state. The invariants do not imply that the slots have to be decided in order; they only imply that decided commands have to be applied to the application state in order and that there is no way to roll back.

Invariant R5 guarantees that replicas do not propose commands for slots that have an uncertain configuration. Because a configuration for slot $s$ takes effect at slot $s + \mathtt{WINDOW}$ and all decisions up to $\textit{slot_in} - 1$ are known, configurations up to slot $\rho.\textit{slot_in} + \mathtt{WINDOW} - 1$ are known.

Replica Operational Description

Below you can find the pseudo-code for a Replica:

$\texttt{process} ~ \textit{Replica}(\textit{leaders}, \textit{initial_state})$
  $\texttt{var} ~ \textit{state} := \textit{initial_state}, \textit{slot_in} := 1, \textit{slot_out} := 1$;
  $\texttt{var} ~ \textit{requests} := \emptyset, \textit{proposals} := \emptyset, \textit{decisions} := \emptyset$;

  $\texttt{function} ~ \textit{propose}()$
    $\texttt{while} ~ \textit{slot_in} < \textit{slot_out} + \mathtt{WINDOW} \wedge \exists c: c \in \textit{requests} ~ \texttt{do}$
      $\texttt{if} ~ \exists \textit{op} : \langle \textit{slot_in}-\mathtt{WINDOW} , \langle \cdot, \cdot, \textit{op} \rangle \rangle \in \textit{decisions} \wedge \textbf{isreconfig}(op)~ \texttt{then}$
        $\textit{leaders} := \textit{op}.\textit{leaders}$;
      $\texttt{end if}$
      $\texttt{if} ~\not\exists c': \langle \textit{slot_in}, c' \rangle \in \textit{decisions} ~ \texttt{then}$
        $\textit{requests} := \textit{requests} \backslash \{ c \}$;
        $\textit{proposals} := \textit{proposals} \cup \{ \langle \textit{slot_in}, c \rangle \}$;
        $\forall \lambda \in \textit{leaders}: \textit{send}(\lambda, \langle \textbf{propose}, \textit{slot_in}, c \rangle)$;
      $\texttt{end if}$
      $\textit{slot_in} := \textit{slot_in} + 1$;
    $\texttt{end while}$
  $\texttt{end function}$

  $\texttt{function} ~ \textit{perform}(\langle\kappa, \textit{cid}, \textit{op}\rangle)$
    $\texttt{if} ~(\exists s: s < \textit{slot_out} \wedge \langle s, \langle \kappa, \textit{cid}, \textit{op} \rangle \rangle \in \textit{decisions} )~\vee \textbf{isreconfig}(op) ~ \texttt{then}$
      $\textit{slot_out} := \textit{slot_out} + 1$;
    $\texttt{else}$
      $\langle \textit{next}, \textit{result} \rangle := \textit{op}(\textit{state})$;
      $\texttt{atomic}$
        $\textit{state} := \textit{next}$; $\textit{slot_out} := \textit{slot_out} + 1$;
      $\texttt{end atomic}$
      $\textit{send}(\kappa, \langle \textbf{response}, \textit{cid}, \textit{result} \rangle)$;
    $\texttt{end if}$
  $\texttt{end function}$

  $\texttt{for ever}$
    $\texttt{switch} ~ \textit{receive}()$
      $\texttt{case} ~ \langle \textbf{request}, c \rangle:$
        $\textit{requests} := \textit{requests} \cup \{ c \}$;
      $\texttt{end case}$
      $\texttt{case} ~ \langle \textbf{decision}, s, c \rangle:$
        $\textit{decisions} := \textit{decisions} \cup \{ \langle s, c \rangle \}$;
        $\texttt{while} ~ \exists c' : \langle \textit{slot_out}, c' \rangle \in \textit{decisions} ~ \texttt{do}$
          $\texttt{if} ~ \exists c'': \langle \textit{slot_out}, c'' \rangle \in \textit{proposals} ~ \texttt{then}$
            $\textit{proposals} := \textit{proposals} \backslash \{ \langle \textit{slot_out}, c'' \rangle \}$;
            $\texttt{if} ~ c'' \ne c' ~ \texttt{then}$
              $\textit{requests} := \textit{requests} \cup \{ c'' \}$;
            $\texttt{end if}$
          $\texttt{end if}$
          $\textit{perform}(c')$;
        $\texttt{end while}$
      $\texttt{end case}$
    $\texttt{end switch}$
    $\textit{propose}()$;
  $\texttt{end for}$
$\texttt{end process}$

A replica runs in an infinite loop, receiving messages. Replicas receive two kinds of messages: requests and decisions. When it receives a request for command $c$ from a client, the replica adds the request to set $\textit{requests}$. Next, the replica invokes the function $\textit{propose}()$.

Function $\textit{propose}()$ tries to transfer requests from the set $\textit{requests}$ to $\textit{proposals}$. It uses $\textit{slot_in}$ to look for unused slots within the window of slots with known configurations. For each such slot $s$, it first checks if the configuration for $s$ is different from the prior slot by checking if the decision in slot $s - \mathtt{WINDOW}$ is a reconfiguration command. If so, the function updates the set of leaders for slot $s$. Then the function removes a request $r$ from $\textit{requests}$ and adds proposal $(s, r)$ to the set $\textit{proposals}$. Finally, it sends a $\langle\textbf{propose}, s, c\rangle$ message to all leaders in the configuration of slot $s$.

Decisions may arrive out-of-order and multiple times. For each $\textbf{decision}$ message, the replica adds the decision to the set $\textit{decisions}$. Then, in a loop, it considers which decisions are ready for execution before trying to receive more messages. If there is a decision $c'$ corresponding to the current $\textit{slot_out}$, the replica first checks to see if it has proposed a command $c''$ for that slot. If so, the replica removes $\langle\textit{slot_out}, c''\rangle$ from the set $\textit{proposals}$. If $c'' \ne c'$, that is, the replica proposed a different command for that slot, the replica returns $c''$ to set $\textit{requests}$ so $c''$ is proposed again at a later time. Next, the replica invokes $\textit{perform}(c')$.

The function $\textit{perform}()$ is invoked with the same sequence of commands at all replicas. First, it checks to see if it has already performed the command. Different replicas may end up proposing the same command for different slots, and thus the same command may be decided multiple times. The corresponding operation is evaluated only if the command is new and it is not a reconfiguration request. If so, $\textit{perform}()$ applies the requested operation to the application state. In either case, the function increments $\textit{slot_out}$.

Maintenance of Replica Invariants

Note that $\textit{decisions}$ is ``append-only'' in that there is no code that removes entries from this set. Doing so makes it easier to formulate invariants and reason about the correctness of the code. We will discuss correctness-preserving ways of removing entries that are no longer useful in the context of optimizations later.

It is clear that the code enforces Invariant R4. The variables $\textit{state}$ and $\textit{slot_out}$ are updated atomically in order to ensure that Invariant R3 holds, although in practice it is not actually necessary to perform these updates atomically, as the intermediate state is not externally visible. Since $\textit{slot_out}$ is only advanced if the corresponding decision is in $\textit{decisions}$, it is clear that Invariant R2 holds. A command is proposed for a slot only if that slot is within the current $\mathtt{WINDOW}$, and since replicas execute reconfiguration commands after a $\mathtt{WINDOW}$ of operations it is ensured that Invariant R5 holds.

The real difficulty lies in enforcing Invariant R1. It requires that the set of replicas agree on the order of commands. For each slot, the Paxos protocol $chooses$ a command from among a collection of commands proposed by clients. This is called $consensus$, and in Paxos the subprotocol that implements consensus is called the ``multi-decree Synod'' protocol, or just Synod protocol for short.