Tiny & abstracted Raft algorithm implementation - Go version
Go to file
Vitaliy Filippov 141193b9f5 Always try to send heartbeats to all nodes (not just followers) if leadership expiration is disabled
Otherwise, the following situation may happen:
- Node 1 is the leader and its term is 24
- Node 2 is a follower of 1, its term is 24 too
- Node 3 also thinks he's the leader, his term is 23, and he doesn't know that a new leader is elected
2024-05-16 23:20:32 +03:00
README.en.md Add documentation 2023-10-16 23:41:37 +03:00
README.md Add documentation 2023-10-16 23:41:37 +03:00
go.mod Initial commit 2023-06-28 18:13:51 +03:00
tinyraft.go Always try to send heartbeats to all nodes (not just followers) if leadership expiration is disabled 2024-05-16 23:20:32 +03:00
tinyraft_test.go MDSTODO-229 - Fix missing onChange event in TinyRaft 2024-05-16 23:20:25 +03:00

README.en.md

TinyRaft

Raft leader election isolated from the rest of the algorithm.

Description

TinyRaft can be used as a building block for the standard Raft with a K/V database and log-based replication or for other variations of "weak consensus".

Leader is not required to replicate logs - he just has to support a recovery procedure making all replicas consistent. For example, he can use a CRDT or perform erasure coding on entries. He may use term number as logical time. The only requirement is to guarantee preservation of entries confirmed by all hosts participating in consensus.

Supports leader expiration like in NuRaft: https://github.com/eBay/NuRaft/blob/master/docs/leadership_expiration.md

Why?

Raft is a popular consensus algorithm. "Consensus" means making sure that multiple members of the cluster have the same view on some data and its changes over time. Some nodes may fall behind and have an older version of the data, but they shouldn't be able to see diverging changes.

However, Raft achieves consensus in a pretty strict way called log replication:

  1. Raft applies all changes strictly sequentially to the leader's log, sends that log to followers and applies it to persistent state when all followers confirm it.
  2. If a previous leader becomes a follower and happens to have changes newer than the current leader, these changes are thrown away. It's safe to do because Raft doesn't confirm such changes to clients as "stable".
  3. If a follower has too old state which is impossible to "catch up" using the log, full synchronisation is used and the whole state of the follower is copied from the leader.

That means Raft basically consists of 2 parts:

  • leader election
  • log replication (and state storage)

But in some applications you may want to use leader election without log replication. Maybe because you want to replace it with your own state reconcilication algorithm, maybe because you don't want any persistence at all.

There are other algorithms for leader election like Bully but it seems it's partition-prone and thus unpopular. And if you try to modify it to make it more stable you quickly end up with something very similar to Raft's leader election. :-)

But all Raft implementations I've seen contain both leader election and log replication as a whole and you can't untie them. This includes:

Some of these libraries don't even allow to supply a user-defined network layer or external event loop. Some require a full-featured K/V database underneath to work. Maybe you could trick these libraries and make them think that they always replicate an empty state, but it seems like a hack - libraries definitely weren't meant to be used that way.

That's where TinyRaft comes to the help.

Ideas For Replacing Log Replication

CRDT

Associate Raft's term number and an additional integer version number with each key. Increase version number on every write. Save a "tombstone" with an increased version number instead of deleting the key. Garbage collect deleted keys when all nodes confirm deletion. Make the newest (term, version) pair of each key win during synchronisation between nodes.

It'll be slightly less strict than log replication but it's still OK. The only problem is possible short non-repeatable read for unconfirmed writes:

  • Client 1 requests a write of version 1 of key X during term 1
  • Node A is the leader, it applies the change, now it has X=(1,1)
  • Node A sends write to the network, but network fails and node A comes offline
  • Client 1 doesn't get the confirmation
  • A new leader is elected, now it's node B
  • Client 2 requests a read of key X from node B and gets nothing as the key is not yet replicated
  • Node A comes back online, joins the leader (B) and synchronizes key X
  • Now node B also has X=(1,1)
  • Client 2 requests a read of key X from node B and now gets version (1,1)

But it's fine for key/value databases with eventual consistency or for an object storage which doesn't require atomic multi-key transactions.

Erasure coding

Split each value into chunks, calculate additional parity chunks using erasure codes (Reed-Solomon or similar), and instead of replicating the value to each node as is, make each node only store the corresponding chunk of it.

Usage example

Full-featured usage example would require network code.

Examples without real network layer can be seen in tinyraft_test.go.