[{"content":"Gossip protocols always seemed too simple to work. A node picks a random peer every second, exchanges some state, and after enough rounds the cluster has converged on everything. There is no leader, no quorum, no coordination. It feels like an algorithm a child would invent.\nThen you actually build one, or operate one, and the simplicity stops feeling naive and starts feeling like the point.\nThe shape of the algorithm The basic anti-entropy loop, in pseudocode:\nevery tick: peer = random_member(members) send my_state to peer receive peer_state from peer merge(my_state, peer_state) If merge is commutative, associative, and idempotent, the cluster converges. The convergence time is O(log N) rounds in the number of members, because each round roughly doubles the population that knows any given fact. A 1000-node cluster converges in about 10 rounds, not 1000.\nFailure detection The other thing gossip handles well, almost as a side-effect, is failure detection. If node A has not heard from node B in several rounds, A starts to suspect B is dead. In SWIM, the dominant protocol in this space, A then asks k random other nodes to also try pinging B before declaring it failed. This indirect-ping trick is what makes the detector robust against single network glitches in a way single-ping detectors are not. The cost is O(1) messages per round per node, and the false-positive rate stays low even under churn.\nWhat gossip is bad at I have shipped systems where gossip was the right answer and systems where it was the wrong answer. The wrong cases all wanted strong consistency in the propagated information. Gossip does not give you that. It gives you eventual consistency with a probabilistic bound. If you need bounded worst-case propagation time, or strict ordering, or all nodes to act on the same view at the same instant, you want consensus, not gossip — and the right architecture is usually a small consensus layer for the strict things plus a gossip layer for everything else. That is what Consul does, and what most \u0026ldquo;we use gossip\u0026rdquo; production systems actually mean when you look closely.\nThe boring property that makes it production-grade The thing I appreciate most about gossip, after years of running it in production, is that it has almost no operational pathologies. Network partitions heal automatically. A node that goes away and comes back rejoins without any explicit reconciliation. There is no leader to elect, no quorum to maintain, no snapshot/replay/repair tooling. The state simply diffuses. The protocol is robust because it does not depend on any single message getting through — and that is the part people miss when they look at the algorithm and assume it cannot be enough.\n","permalink":"https://devlogbox.win/posts/gossip-protocol-intuition/","summary":"How an algorithm that looks like rumor-spreading ends up being the most boring, most reliable membership layer you can ship.","title":"Building intuition for gossip protocols"},{"content":"I spent a long time being intimidated by CRDTs. The papers are full of lattices and join semilattices and monotonic state, and the impression you walk away with is that you need a graduate course in order theory to use them. You do not.\nThe minimum viable definition A CRDT is a data type with three properties:\nCommutativity — applying operations in any order gives the same result Associativity — grouping does not matter Idempotence — applying the same operation twice has the same effect as once If a data type satisfies these three, replicas can apply operations in any order, drop duplicates, and re-deliver freely, and they will all converge to the same state. That is the whole guarantee. Everything else — the formalism, the lattice theory, the distinctions between state-based (CvRDT) and operation-based (CmRDT) — is bookkeeping around this idea.\nThe simplest non-trivial example A grow-only counter (G-Counter):\nstate: map\u0026lt;replica_id, int\u0026gt; increment(my_id): state[my_id] += 1 merge(other): for k in other: state[k] = max(state[k], other[k]) value(): return sum(state.values()) That is it. Each replica owns its own slot. Merging takes the max of each slot. The sum of slots is the value. Commutative, associative, idempotent. You can replicate this across regions, hand a node to a network partition for a week, and when it comes back, the merge will be correct.\nWhere it gets interesting The fun starts when you want operations that are not naturally monotonic. The classic example is \u0026ldquo;remove from a set.\u0026rdquo;\nA G-Set (grow-only set) is trivial — add an element, never remove. But what if you want a real set with removes? You can build a 2P-Set: one G-Set for adds, one G-Set for \u0026ldquo;tombstones.\u0026rdquo; An element is in the set iff it is in adds and not in tombstones. Now you have removes. But you cannot re-add an element after removing it.\nIf you want re-adds, you can use an OR-Set, where each add carries a unique tag. A remove tombstones the specific tags you saw. A concurrent re-add gets a new tag that the remove did not see, so it survives.\nThe pattern is consistent: each new requirement adds a layer of metadata. The metadata is where the cost is. CRDT papers tend to focus on correctness; production CRDT implementations tend to focus on metadata garbage collection.\nWhere you actually find them in production CRDTs feel like an academic curiosity until you go looking, at which point they show up almost everywhere collaborative or partition-tolerant systems exist:\nRiak shipped CRDTs as first-class data types years ago — counters, sets, maps, registers. The Riak data types are still one of the cleanest API examples around. Redis Enterprise uses CRDTs as its multi-region replication primitive. The single-region Redis you usually run does not, but the geo-distributed product does. Automerge and Yjs are the two CRDT libraries powering most of the recent generation of collaborative editors (Figma uses a custom OT/CRDT hybrid, but most of the smaller competitors land on Yjs). Soundcloud famously used a G-Counter for view counts so different shards could increment without coordination, then sum on read. State-based vs operation-based The two flavors of CRDTs differ in what they ship between replicas. State-based CRDTs (CvRDTs) send their full state and rely on a merge function to converge. Operation-based CRDTs (CmRDTs) send only the operations themselves, but require the delivery layer to guarantee causal broadcast — every replica must see operations in an order consistent with causality.\nIn theory they are equivalent in expressive power. In practice, state-based is easier to deploy because it tolerates duplicate delivery and out-of-order arrival, which is exactly what real networks give you. Op-based is more bandwidth-efficient at steady state but pushes complexity into the messaging layer, which is where most of the production bugs end up living. Most production systems I have looked at are a hybrid: state-based at the boundary between replicas, op-based internally for streaming updates.\nWhat I wish someone had told me The hard part of CRDTs is not the math. It is convincing yourself that the convergent state is the state your application actually wants. A counter that survives partitions but counts double under retries is not useful if you wanted it to be exact. A set that converges to \u0026ldquo;both edits survived\u0026rdquo; is not useful if your application\u0026rsquo;s semantics demand last-write-wins.\nCRDTs do not give you \u0026ldquo;correctness\u0026rdquo; for free. They give you convergence for free. You still have to design the data type so that the converged state means what you need.\nA common failure mode The mistake I have watched teams make most often is reaching for CRDTs as a way to avoid thinking about conflict semantics. The pitch sounds appealing — \u0026ldquo;we\u0026rsquo;ll use a CRDT, so conflicts resolve automatically.\u0026rdquo; What this actually means is \u0026ldquo;we will pick a resolution policy that depends on the CRDT we chose, instead of one that depends on what our users want.\u0026rdquo;\nThat is rarely the right tradeoff. The conflict policy is a product decision, not an infrastructure decision. The CRDT should follow from that, not the other way around. When this gets inverted, you end up with systems that \u0026ldquo;converge\u0026rdquo; but feel wrong to users — silently dropped writes, ghost reappearing items, counters that drift slightly over time. None of these are bugs in the CRDT. They are bugs in the choice of CRDT.\nPractical reading order If you want to actually use this stuff, skip Shapiro et al. on the first pass. Read:\nRoh et al., Replicated Abstract Data Types — the intuition, with cleaner notation Bartolomeu et al.\u0026rsquo;s survey — for the landscape of practical CRDTs Then go back to Shapiro for the formal grounding ","permalink":"https://devlogbox.win/posts/crdts-are-easier-than-they-sound/","summary":"The hard part of CRDTs is not the math. It is convincing yourself the math is doing what you want.","title":"CRDTs are easier than they sound"},{"content":"The first time someone explained vector clocks to me, I came away thinking they were a clever trick for ordering writes in a Dynamo-style database. That is true, but it sells them short. Vector clocks are interesting because they force you to be precise about a question you usually avoid: which of these two events knew about which?\nHappens-before, restated Given two events a and b, exactly one of these is true: a → b, b → a, or they are concurrent. Scalar Lamport timestamps collapse cases 1 and 2 onto a total order but lose case 3 entirely. If a has timestamp 4 and b has timestamp 7, you cannot tell whether b actually saw a or whether they are concurrent events that happened to land at those numbers.\nVector clocks preserve the partial order exactly. If neither VC(a) \u0026lt; VC(b) nor VC(b) \u0026lt; VC(a), the events are concurrent. That third case is the whole point.\nTwo things that took me too long to internalize Vector clocks tell you that two events are concurrent. They do not tell you what to do about it. The merge policy is a separate decision. Comparison cost is O(N) in replicas. For systems with churn, this is the actual reason people move to dotted version vectors, not theoretical elegance. I keep coming back to vector clocks because they are a small, sharp tool for asking a question most systems pretend does not exist.\n","permalink":"https://devlogbox.win/posts/vector-clocks-again/","summary":"Vector clocks are not a data structure. They are a way of asking the right question about events in a distributed system.","title":"Why I keep coming back to vector clocks"},{"content":"I have re-read In Search of an Understandable Consensus Algorithm about once a year for the last three years. Every time I come back to it, a different section becomes the interesting one. This is partly because the paper rewards re-reading, and partly because what I am building at the time changes which parts feel load-bearing. The first time, it was leader election. The second time, log replication. This pass, it was section 5.4 and everything around it.\nThe thing the paper is actually about On the first read, I treated Raft as \u0026ldquo;how distributed consensus works.\u0026rdquo; On this read, I noticed that a huge portion of the paper is meta — it is not about how the algorithm works so much as why the algorithm is shaped the way it is. The decomposition into leader election, log replication, and safety is not an exposition trick. It is the core engineering claim: a consensus algorithm should be teachable in a single sitting, and Paxos, the paper argues at length, is not.\nThe interesting consequence is that a lot of the decisions in Raft are not motivated by performance or correctness in the narrow sense. They are motivated by what the authors call understandability — what I think of as the budget you have for things a reader is expected to keep in their head simultaneously. Strong leadership is the cleanest example. You could in principle have followers also accept entries from each other, or you could let any node propose entries directly. But each additional message path is something a reader has to keep track of. So Raft says: one leader, all entries go through it, followers are passive. Less expressive, dramatically more learnable.\nOnce you see this lens, you start noticing it everywhere. The randomized election timeout is another instance: you could have a more clever leader election protocol, but \u0026ldquo;wait a random amount, then ask for votes\u0026rdquo; is trivially explainable and converges fast enough. The membership change algorithm in the original paper used joint consensus, which is more general but harder to explain; the revised paper switches to single-server changes, with a footnote acknowledging that the more general version turned out to be unnecessary in practice. That is a useful footnote: the paper\u0026rsquo;s authors themselves changed their minds about which version belonged in the canon.\nThings I keep getting wrong Every time I re-read the paper, I find I had drifted on at least three details:\nI keep wanting commitIndex to be persistent. It is not. After a crash, a server recomputes commit position from the leader\u0026rsquo;s AppendEntries. The split between volatile and persistent state in figure 2 is sharper and more deliberate than I remember. currentTerm, votedFor, and the log are persistent. commitIndex, lastApplied, nextIndex[], matchIndex[] are not. There is a real reason for each cut, but I always have to re-derive it. The \u0026ldquo;no-op at the start of a term\u0026rdquo; trick. A new leader cannot commit entries from previous terms directly — it has to append a fresh entry under its own term first, and committing that entry implicitly commits everything before it. I forget this every single time, and every single time it bites me when I try to reason about a specific election scenario. lastApplied is per-server, not cluster-wide. Obvious once you say it, but I had a mental model where the cluster had a single \u0026ldquo;applied\u0026rdquo; pointer. It does not. Each server applies entries to its own state machine independently, in the same order, but possibly at different real-time moments. What the paper does not say The paper is light on what happens around the edges of Raft. Snapshotting is in section 7 and is mercifully short. Membership changes are in section 6 and are more subtle than the algorithm proper. Client interaction is in section 8 and is where most production systems end up adding non-trivial code: linearizable read indices, leader leases, client session tracking, idempotent retries. None of this is wrong in the paper — it is just compressed.\nThe other thing the paper does not say is that Raft is not really a \u0026ldquo;library.\u0026rdquo; Every production implementation I have looked at (etcd, TiKV, CockroachDB, Consul, RethinkDB) bolts a meaningful amount of code around the core algorithm. Flow control. Batching. Read-index optimizations. Learner nodes. Pre-vote to suppress disruptive servers. The 18 pages are the part that fits in your head; the production-grade implementation is many tens of thousands of lines of additional code.\nThe TLA+ spec is worth reading One thing I did not appreciate on earlier reads is how useful the formal TLA+ spec in the appendix is. The English description in the body is for understanding; the spec is for resolving ambiguity. There were two cases this pass where I thought I understood a corner of the protocol, went to the spec to confirm, and discovered I had been wrong about which variable update happened first. The English description is necessarily ordered linearly; the spec makes the actual dependencies explicit.\nIf you have not read it, the spec is short — under two pages — and the comments alone are worth the time.\nThe parts that age It is also worth noticing what has aged in the paper and what has not. The core algorithm has not aged at all — the figure 2 cheat sheet is still the right cheat sheet a decade later. The performance numbers in section 9 have aged, in the sense that hardware has moved on; the relative comparisons are still useful but the absolute throughput numbers are quaint. The \u0026ldquo;understandability study\u0026rdquo; in section 9.1 is the part of the paper most often dismissed as fluff, and I think that is unfair. The study itself is small, but it is unusual for a systems paper to provide any evidence for an understandability claim, and the bar that sets is good for the field.\nWhat has aged less well is the assumption that the cluster is small and fixed. The original paper sketches membership changes but does not engage with the operational reality of a large cluster where servers are added and removed continuously, and where the failure model includes things like a server returning with old state but appearing brand new. Production systems handle this with mechanisms — learner nodes, witness servers, fencing tokens — that postdate the paper. Raft is correct in their presence, but the paper does not teach you to expect them.\nWhat Raft buys you, and what it does not It is also worth being honest about what consensus actually gives you in exchange for the cost of running it. Raft guarantees that a committed entry will not be lost as long as a majority of servers survive. It guarantees that all servers apply the same entries in the same order. It does not guarantee that your application logic is deterministic, that your state machine is bug-free, or that the entries you replicated are the ones you should have replicated. The replicated log is a foundation, not a solution.\nThis sounds obvious, but I have seen teams reach for Raft as a way to \u0026ldquo;make a system reliable,\u0026rdquo; and then be surprised when it turns out their state machine has races inside the apply path, or when a logically-correct request was rejected because the leader had stepped down between the propose and the apply. Raft is excellent at the problem it solves. The problem it solves is narrower than people remember.\nA small reading tip If you have only read figure 2 — the cheat sheet — the section to read next is 5.4. The election restriction (a candidate cannot become leader unless its log is at least as up-to-date as a majority of voters\u0026rsquo; logs) and the commit rule for previous terms are the two places where Raft\u0026rsquo;s correctness lives. Figure 2 alone does not make them feel load-bearing; section 5.4 is what convinces you that the rest of the protocol does not work without them.\nI will probably re-read this paper again next year, and I am sure a different section will become the interesting one. That is, in a way, the highest compliment I can pay it. Most papers I read once and shelve. This is one of a handful I keep finding new things in.\n","permalink":"https://devlogbox.win/posts/rereading-raft/","summary":"Things I noticed on the third pass through Ongaro \u0026amp; Ousterhout that I missed the first two times.","title":"Re-reading the Raft paper"}]