I spent a long time being intimidated by CRDTs. The papers are full of lattices and join semilattices and monotonic state, and the impression you walk away with is that you need a graduate course in order theory to use them. You do not.
The minimum viable definition
A CRDT is a data type with three properties:
- Commutativity — applying operations in any order gives the same result
- Associativity — grouping does not matter
- Idempotence — applying the same operation twice has the same effect as once
If a data type satisfies these three, replicas can apply operations in any order, drop duplicates, and re-deliver freely, and they will all converge to the same state. That is the whole guarantee. Everything else — the formalism, the lattice theory, the distinctions between state-based (CvRDT) and operation-based (CmRDT) — is bookkeeping around this idea.
The simplest non-trivial example
A grow-only counter (G-Counter):
state: map<replica_id, int>
increment(my_id):
state[my_id] += 1
merge(other):
for k in other:
state[k] = max(state[k], other[k])
value():
return sum(state.values())
That is it. Each replica owns its own slot. Merging takes the max of each slot. The sum of slots is the value. Commutative, associative, idempotent. You can replicate this across regions, hand a node to a network partition for a week, and when it comes back, the merge will be correct.
Where it gets interesting
The fun starts when you want operations that are not naturally monotonic. The classic example is “remove from a set.”
A G-Set (grow-only set) is trivial — add an element, never remove. But what if you want a real set with removes? You can build a 2P-Set: one G-Set for adds, one G-Set for “tombstones.” An element is in the set iff it is in adds and not in tombstones. Now you have removes. But you cannot re-add an element after removing it.
If you want re-adds, you can use an OR-Set, where each add carries a unique tag. A remove tombstones the specific tags you saw. A concurrent re-add gets a new tag that the remove did not see, so it survives.
The pattern is consistent: each new requirement adds a layer of metadata. The metadata is where the cost is. CRDT papers tend to focus on correctness; production CRDT implementations tend to focus on metadata garbage collection.
Where you actually find them in production
CRDTs feel like an academic curiosity until you go looking, at which point they show up almost everywhere collaborative or partition-tolerant systems exist:
- Riak shipped CRDTs as first-class data types years ago — counters, sets, maps, registers. The Riak data types are still one of the cleanest API examples around.
- Redis Enterprise uses CRDTs as its multi-region replication primitive. The single-region Redis you usually run does not, but the geo-distributed product does.
- Automerge and Yjs are the two CRDT libraries powering most of the recent generation of collaborative editors (Figma uses a custom OT/CRDT hybrid, but most of the smaller competitors land on Yjs).
- Soundcloud famously used a G-Counter for view counts so different shards could increment without coordination, then sum on read.
State-based vs operation-based
The two flavors of CRDTs differ in what they ship between replicas. State-based CRDTs (CvRDTs) send their full state and rely on a merge function to converge. Operation-based CRDTs (CmRDTs) send only the operations themselves, but require the delivery layer to guarantee causal broadcast — every replica must see operations in an order consistent with causality.
In theory they are equivalent in expressive power. In practice, state-based is easier to deploy because it tolerates duplicate delivery and out-of-order arrival, which is exactly what real networks give you. Op-based is more bandwidth-efficient at steady state but pushes complexity into the messaging layer, which is where most of the production bugs end up living. Most production systems I have looked at are a hybrid: state-based at the boundary between replicas, op-based internally for streaming updates.
What I wish someone had told me
The hard part of CRDTs is not the math. It is convincing yourself that the convergent state is the state your application actually wants. A counter that survives partitions but counts double under retries is not useful if you wanted it to be exact. A set that converges to “both edits survived” is not useful if your application’s semantics demand last-write-wins.
CRDTs do not give you “correctness” for free. They give you convergence for free. You still have to design the data type so that the converged state means what you need.
A common failure mode
The mistake I have watched teams make most often is reaching for CRDTs as a way to avoid thinking about conflict semantics. The pitch sounds appealing — “we’ll use a CRDT, so conflicts resolve automatically.” What this actually means is “we will pick a resolution policy that depends on the CRDT we chose, instead of one that depends on what our users want.”
That is rarely the right tradeoff. The conflict policy is a product decision, not an infrastructure decision. The CRDT should follow from that, not the other way around. When this gets inverted, you end up with systems that “converge” but feel wrong to users — silently dropped writes, ghost reappearing items, counters that drift slightly over time. None of these are bugs in the CRDT. They are bugs in the choice of CRDT.
Practical reading order
If you want to actually use this stuff, skip Shapiro et al. on the first pass. Read:
- Roh et al., Replicated Abstract Data Types — the intuition, with cleaner notation
- Bartolomeu et al.’s survey — for the landscape of practical CRDTs
- Then go back to Shapiro for the formal grounding