Re-reading the Raft paper

I have re-read In Search of an Understandable Consensus Algorithm about once a year for the last three years. Every time I come back to it, a different section becomes the interesting one. This is partly because the paper rewards re-reading, and partly because what I am building at the time changes which parts feel load-bearing. The first time, it was leader election. The second time, log replication. This pass, it was section 5.4 and everything around it.

The thing the paper is actually about

On the first read, I treated Raft as “how distributed consensus works.” On this read, I noticed that a huge portion of the paper is meta — it is not about how the algorithm works so much as why the algorithm is shaped the way it is. The decomposition into leader election, log replication, and safety is not an exposition trick. It is the core engineering claim: a consensus algorithm should be teachable in a single sitting, and Paxos, the paper argues at length, is not.

The interesting consequence is that a lot of the decisions in Raft are not motivated by performance or correctness in the narrow sense. They are motivated by what the authors call understandability — what I think of as the budget you have for things a reader is expected to keep in their head simultaneously. Strong leadership is the cleanest example. You could in principle have followers also accept entries from each other, or you could let any node propose entries directly. But each additional message path is something a reader has to keep track of. So Raft says: one leader, all entries go through it, followers are passive. Less expressive, dramatically more learnable.

Once you see this lens, you start noticing it everywhere. The randomized election timeout is another instance: you could have a more clever leader election protocol, but “wait a random amount, then ask for votes” is trivially explainable and converges fast enough. The membership change algorithm in the original paper used joint consensus, which is more general but harder to explain; the revised paper switches to single-server changes, with a footnote acknowledging that the more general version turned out to be unnecessary in practice. That is a useful footnote: the paper’s authors themselves changed their minds about which version belonged in the canon.

Things I keep getting wrong

Every time I re-read the paper, I find I had drifted on at least three details:

I keep wanting commitIndex to be persistent. It is not. After a crash, a server recomputes commit position from the leader’s AppendEntries. The split between volatile and persistent state in figure 2 is sharper and more deliberate than I remember. currentTerm, votedFor, and the log are persistent. commitIndex, lastApplied, nextIndex[], matchIndex[] are not. There is a real reason for each cut, but I always have to re-derive it.
The “no-op at the start of a term” trick. A new leader cannot commit entries from previous terms directly — it has to append a fresh entry under its own term first, and committing that entry implicitly commits everything before it. I forget this every single time, and every single time it bites me when I try to reason about a specific election scenario.
lastApplied is per-server, not cluster-wide. Obvious once you say it, but I had a mental model where the cluster had a single “applied” pointer. It does not. Each server applies entries to its own state machine independently, in the same order, but possibly at different real-time moments.

What the paper does not say

The paper is light on what happens around the edges of Raft. Snapshotting is in section 7 and is mercifully short. Membership changes are in section 6 and are more subtle than the algorithm proper. Client interaction is in section 8 and is where most production systems end up adding non-trivial code: linearizable read indices, leader leases, client session tracking, idempotent retries. None of this is wrong in the paper — it is just compressed.

The other thing the paper does not say is that Raft is not really a “library.” Every production implementation I have looked at (etcd, TiKV, CockroachDB, Consul, RethinkDB) bolts a meaningful amount of code around the core algorithm. Flow control. Batching. Read-index optimizations. Learner nodes. Pre-vote to suppress disruptive servers. The 18 pages are the part that fits in your head; the production-grade implementation is many tens of thousands of lines of additional code.

The TLA+ spec is worth reading

One thing I did not appreciate on earlier reads is how useful the formal TLA+ spec in the appendix is. The English description in the body is for understanding; the spec is for resolving ambiguity. There were two cases this pass where I thought I understood a corner of the protocol, went to the spec to confirm, and discovered I had been wrong about which variable update happened first. The English description is necessarily ordered linearly; the spec makes the actual dependencies explicit.

If you have not read it, the spec is short — under two pages — and the comments alone are worth the time.

The parts that age

It is also worth noticing what has aged in the paper and what has not. The core algorithm has not aged at all — the figure 2 cheat sheet is still the right cheat sheet a decade later. The performance numbers in section 9 have aged, in the sense that hardware has moved on; the relative comparisons are still useful but the absolute throughput numbers are quaint. The “understandability study” in section 9.1 is the part of the paper most often dismissed as fluff, and I think that is unfair. The study itself is small, but it is unusual for a systems paper to provide any evidence for an understandability claim, and the bar that sets is good for the field.

What has aged less well is the assumption that the cluster is small and fixed. The original paper sketches membership changes but does not engage with the operational reality of a large cluster where servers are added and removed continuously, and where the failure model includes things like a server returning with old state but appearing brand new. Production systems handle this with mechanisms — learner nodes, witness servers, fencing tokens — that postdate the paper. Raft is correct in their presence, but the paper does not teach you to expect them.

What Raft buys you, and what it does not

It is also worth being honest about what consensus actually gives you in exchange for the cost of running it. Raft guarantees that a committed entry will not be lost as long as a majority of servers survive. It guarantees that all servers apply the same entries in the same order. It does not guarantee that your application logic is deterministic, that your state machine is bug-free, or that the entries you replicated are the ones you should have replicated. The replicated log is a foundation, not a solution.

This sounds obvious, but I have seen teams reach for Raft as a way to “make a system reliable,” and then be surprised when it turns out their state machine has races inside the apply path, or when a logically-correct request was rejected because the leader had stepped down between the propose and the apply. Raft is excellent at the problem it solves. The problem it solves is narrower than people remember.

A small reading tip

If you have only read figure 2 — the cheat sheet — the section to read next is 5.4. The election restriction (a candidate cannot become leader unless its log is at least as up-to-date as a majority of voters’ logs) and the commit rule for previous terms are the two places where Raft’s correctness lives. Figure 2 alone does not make them feel load-bearing; section 5.4 is what convinces you that the rest of the protocol does not work without them.

I will probably re-read this paper again next year, and I am sure a different section will become the interesting one. That is, in a way, the highest compliment I can pay it. Most papers I read once and shelve. This is one of a handful I keep finding new things in.

The thing the paper is actually about#

Things I keep getting wrong#

What the paper does not say#

The TLA+ spec is worth reading#

The parts that age#

What Raft buys you, and what it does not#

A small reading tip#