Commit Graph

662 Commits (fb85da92e80d5f9d34ae8827360c2e5650dd31a0)

Author SHA1 Message Date
Yicheng Qin fa96e64b43 Merge pull request #2624 from yichengq/fix-raft-storage
raft: lock storage when compact it
2015-04-03 13:51:06 -07:00
Yicheng Qin 3d32c059dd raft: generate correct json-format status
Current json-format string misses the double quote around status field.

Use %q for better clearance.
2015-04-03 13:49:46 -07:00
Yicheng Qin d91ea7f199 raft: fix freeTo fails to free
If freeTo is called when to is set to the lastest inflight, freeTo
fails to free the slots.
2015-04-03 13:21:26 -07:00
Yicheng Qin c6de464587 raft: lock storage when compact it
etcd now compact raft storage asynchronously, and append entry to raft
storage may happen at the same time. Add the lock to fix the bug that
the entries saved in storage may be organized in a wrong way.
2015-04-03 11:38:01 -07:00
Xiang Li 3f867bc6ed raft: node bench matches reality 2015-03-28 14:53:42 -07:00
Xiang Li 05e240b892 *: update protobuf 2015-03-25 10:14:35 -07:00
Ben Darnell c9d507df11 raft: Use raft.Config in MultiNode. 2015-03-24 15:37:13 -04:00
Xiang Li b3fb052ad4 raft: make peers a prviate field in raft.Config 2015-03-24 11:10:07 -07:00
Xiang Li abddef0f28 raft: make node configurable 2015-03-23 21:20:49 -07:00
Brandon Philips 057978bbc6 raft: design: fixup markdown
Need a space between `1.` for markdown to render as a list.
2015-03-23 14:01:17 -07:00
Xiang Li d9b5b56c82 raft: make raft configurable 2015-03-23 09:55:19 -07:00
Xiang Li a552722f03 Merge pull request #2544 from xiang90/raft-inflight
raft: add flow control for progress
2015-03-20 20:12:31 -07:00
Xiang Li 4a64373225 raft: add flow control for progress
Each progress has a inflighs sliding window. When the progress
is in replicate state, inflights will control the sending speed
of the leader.

The leader can have at most maxInflight number of inflight
messages for each replicate progress. Receving a appResp moves
forward the sliding window. Heartbeat response free one
slot if the window is full.
2015-03-20 20:04:33 -07:00
Xiang Li 09a86cb9b9 Merge pull request #2553 from xiang90/raft-design
raft: add progress state machine graph
2015-03-20 19:57:51 -07:00
Xiang Li 86622537a1 raft: add progress state machine graph 2015-03-20 15:28:50 -07:00
Xiang Li 44d9209990 Merge pull request #2548 from xiang90/raft-design
raft: add our very first design.md
2015-03-20 09:07:44 -07:00
Yicheng Qin 6e557c58c7 Merge pull request #2532 from yichengq/342
raft: print out data and time in log
2015-03-20 08:03:23 -07:00
Xiang Li 59d8089295 raft: add our very first design.md 2015-03-19 21:00:47 -07:00
Xiang Li 2adb58f9de raft: move progress to progress.go 2015-03-19 10:05:04 -07:00
Xiang Li 7571b2cde2 raft: limit the size of msgApp
limit the max size of entries sent per message.
Lower the cost at probing state as we limit the size per message;
lower the penalty when aggressively decrease to a too low next.
2015-03-18 15:59:30 -07:00
Yicheng Qin 0634cf2cfe raft: print out data and time in log
Keep the default log setting consistent with other packages.
2015-03-18 15:49:06 -07:00
Yicheng Qin 7e7bc76038 Merge pull request #2514 from yichengq/340
raft: introduce progress states
2015-03-18 09:40:30 -07:00
Yicheng Qin 67194c0b22 raft: introduce progress states 2015-03-18 08:16:32 -07:00
Xiang Li d17f3a4452 Merge pull request #2519 from bdarnell/multinode-commit
raft: Use the correct commit index when advancing in MultiNode.
2015-03-17 10:31:53 -07:00
Ben Darnell cd1ff78ff3 raft: Elaborate a little more about committed entries in commitReady. 2015-03-17 13:22:36 -04:00
funkygao 0b912c0faf raft: fix godoc about starting a node 2015-03-17 17:35:18 +08:00
Ben Darnell 271d911c32 raft: Use the correct commit index when advancing in MultiNode.
This fixes an issue when restoring from a snapshot and brings
MultiNode closer to Node.
2015-03-16 18:40:51 -04:00
Ben Darnell 5e19adcf70 raft: correctly pass arguments to Logger.Panicf() 2015-03-12 16:15:43 -04:00
Iago López Galeiras e698192e4a rafttest: fix build error
raftLogger is not exported so we can't access it from here. Go back to
using log.
2015-03-12 11:47:13 +01:00
Xiang Li 39731724ff Merge pull request #2485 from yichengq/337
raft: fall back to bad path when unreachable
2015-03-11 14:16:39 -07:00
Yicheng Qin be0bf2a2bd raft: fall back to bad path when unreachable 2015-03-11 13:21:23 -07:00
Xiang Li c643967a41 raft: reply with the commit index when receives a smaller append message
Follower should not reject the append message with a smaller index than its commit
index. Or it will trigger the leader's resending logic, which might have a high cost.
2015-03-10 22:32:36 -07:00
Xiang Li a2be25cba4 Merge pull request #2460 from xiang90/raft-logger
raft: introduce logger interface
2015-03-09 08:00:21 -07:00
Xiang Li 97579e2e1d raft: introduce logger interface 2015-03-08 21:36:32 -07:00
Xiang Li 7fe608532a raft: do not reset vote if term is not changed
raft MUST keep the voting information for the same term. reset
should not reset vote if term is not changed.
2015-03-07 22:31:20 -08:00
Ben Darnell 725c411346 Add ReportUnreachable and ReportSnapshot to MultiNode.
Add ReportSnapshot requirement to doc.go.
2015-03-05 12:39:52 -05:00
Xiang Li 6b9b695167 Merge pull request #2435 from bdarnell/multinode
raft: Introduce MultiNode.
2015-03-04 21:27:20 -08:00
Ben Darnell c824c867ec raft: more doc updates.
Including parallelism of persist and send, cancellation of
ConfChanges, and the risks of two-node clusters.
2015-03-04 15:48:35 -05:00
Ben Darnell 4e74d81bbb raft: Introduce MultiNode.
MultiNode is an alternative to raft.Node that is more efficient
when a node may participate in many consensus groups. It is currently
used in the CockroachDB project; this commit merges the
github.com/cockroachdb/etcd fork back into the mainline.
2015-03-04 15:30:21 -05:00
Ben Darnell 250970cc23 raft: Expand doc.go
Includes more details on the required caller behavior and the safety of
membership changes.

Closes #2397
2015-03-04 13:18:02 -05:00
Yicheng Qin b4b9b9118a rafthttp: report MsgSnap status 2015-03-02 09:38:11 -08:00
Yicheng Qin 09f181f585 raft: log unreachable remote node 2015-03-01 16:47:49 -08:00
Yicheng Qin fbd5c81139 raft: remove shadowing of variables from test 2015-02-28 12:09:33 -08:00
Xiang Li 9b4d52ee73 raft: do not resend snapshot if not necessary
raft relies on the link layer to report the status of the sent snapshot.
If the snapshot is still sending, the replication to that remote peer will
be paused. If the snapshot finish sending, the replication will begin
optimistically after electionTimeout. If the snapshot fails, raft will
try to resend it.
2015-02-28 11:41:58 -08:00
Xiang Li 2185ac5ac8 raft: cleanup unreachable 2015-02-28 11:35:16 -08:00
Xiang Li 2af33fd494 raft: add reportUnreachable 2015-02-28 10:45:22 -08:00
Xiang Li cbef6ab152 raft: clean up storage 2015-02-28 10:09:07 -08:00
Xiang Li 5ede18be74 raft: separate compact and createsnap in memory storage 2015-02-28 10:08:30 -08:00
Ben Darnell b53dc0826e Only use the EntryFormatter for normal entries.
ConfChange entries also have a Data field but the application-supplied
formatter won't know what to do with them.
2015-02-20 13:51:14 -05:00
Barak Michener 92dca0af0f *: remove shadowing of variables from etcd and add travis test
We've been bitten by this enough times that I wrote a tool so that
it never happens again.
2015-02-17 16:31:42 -05:00
Xiang Li fa66055f66 rafttest: drop isPaused 2015-02-09 18:52:34 -08:00
Xiang Li 085b608de9 rafttest: support node pause 2015-02-09 16:26:43 -08:00
Xiang Li 279b216f9a raftest: wait for network sending 2015-02-09 15:52:16 -08:00
Xiang Li 65cd0051fe rafttest: add network delay 2015-02-06 15:01:07 -08:00
Xiang Li d423946fa4 rafttest: add network drop 2015-02-06 10:50:55 -08:00
Xiang Li 83edf0d862 rafttest: separate network interface and network 2015-02-03 22:50:27 -08:00
Xiang Li b147a6328d raftest: add restart and related simple test 2015-02-03 10:08:52 -08:00
Xiang Li d65af21b73 raft: add raft test suite 2015-02-01 14:53:22 -08:00
Xiang Li bff2ccaa22 Merge pull request #2170 from xiang90/remove_log
raft: remove default verbose logging
2015-01-27 15:58:53 -08:00
Xiang Li 553379e82b raft: remove default verbose logging 2015-01-27 15:57:44 -08:00
Ben Darnell 33d2400063 raft: Send any waiting appends after receiving MsgAppResp.
This addresses a problem that comes up in the cockroach tests,
in which the order of messages may lead to deadlocks (due to
the fact that we don't have regular heartbeat timers in most
of our tests).
2015-01-27 17:43:29 -05:00
Xiang Li 276c9540b4 etcdserver: support raft.status 2015-01-26 16:39:33 -08:00
Jonathan Boulle f1ed69e883 *: switch to line comments for copyright
Build tags are not compatible with block comments.
Also adds copyright header to a few places it was missing.
2015-01-26 09:53:30 -08:00
Ben Darnell 8c3a6508e9 raft: Add applied to the newRaft log message. 2015-01-22 12:04:40 -05:00
Ben Darnell 59214978a2 raft: Add applied index as an argument to newRaft and RestartNode. 2015-01-22 11:38:05 -05:00
Ben Darnell cd9d5573d4 raft: make EntryFormatter less clever. 2015-01-21 19:27:26 -05:00
Ben Darnell e73d442e32 raft: Add support for custom formatters in DescribeMessage/DescribeEntry 2015-01-21 14:12:58 -05:00
Xiang Li 003b97a60f raft: public progress struct in raft 2015-01-20 10:26:22 -08:00
Xiang Li b34936b097 raft: add progress into status 2015-01-18 15:23:50 -08:00
Xiang Li 0eaaad0e48 raft: add Status interface
Status returns the current status of raft state machine.
2015-01-16 14:02:04 -08:00
Ben Darnell 2e1c36cdd9 raft: introduce MsgHeartbeatResp.
Now that heartbeats are distinct from MsgApp{,Resp}, the retries
currently performed in stepLeader's MsgAppResp section are only
performed on an actual MsgAppResp (or a new MsgProp). This means
that it may take a long time to recover from a dropped MsgAppResp
in a quiet cluster.

This commit adds a dedicated heartbeat response message. This message
does not convey the follower's current log position because the
MsgHeartbeat does not include the leaders term and index. Upon receipt
of a heartbeat response, the leader may retry the latest MsgApp if it
believes the follower to be behind.
2015-01-14 17:34:10 -05:00
Ben Darnell 9972e62d94 raft: Use <= instead of < for heartbeat ticks.
In code outside the raft package, we cannot call raft.bcastHeartbeat
directly. Instead, to control heartbeats we set heartbeatInterval to 1
and call Tick().
2015-01-14 15:27:32 -05:00
Yicheng Qin 7a2fa39e52 Merge pull request #2012 from andybons/master
raft: add link to the paper raft_paper_test.go refers to
2015-01-06 00:27:47 -08:00
Xiang Li 2a83e350b1 Merge pull request #1992 from xiang90/rm_leader
*: support removing the leader from a 2 members cluster
2015-01-02 14:15:12 -08:00
Xiang Li 35b907ac58 raft: add lastIndex as rejectHint
Add the lastindex of the raft log as reject hint, so the leader can
bypass the greater index probing and decrease the next index directly
to last + 1.
2015-01-01 19:04:07 -08:00
Xiang Li 152676f43a *: support removing the leader from a 2 members cluster 2014-12-29 11:34:33 -08:00
Andrew Bonventre 4463f5c4b3 raft: add link to the paper raft_paper_tests.go refers to 2014-12-29 14:17:48 -05:00
Xiang Li fc96a9e4a7 raft: remove unnecessary funcs in raft.go 2014-12-25 17:04:33 -08:00
Xiang Li 2dbdf87f86 raft: add doc for storage 2014-12-22 12:33:14 -08:00
Xiang Li 896bac1f76 raft: flush the commit to fix a race in test 2014-12-18 17:10:37 -08:00
Xiang Li 88767d913d raft: leader waits for the reply of previous message when follower is not in good path.
It is reasonable for the leader to wait for the reply before sending out the next
msgApp or msgSnap for the follower in bad path. Or the leader will send out useless
messages if the previous message is rejected or the previous message is a snapshot.
Especially for the snapshot case, the leader will be 100% to send out duplicate message
including the snapshot, which is a huge waste.

This commit implement a timeout based wait mechanism. The timeout for normal msgApp is a
heartbeatTimeout and the timeout for snapshot is electionTimeout(snapshot is larger). We
can implement a piggyback mechanism(application notifies the msg lost) in the future
if necessary.
2014-12-18 15:01:50 -08:00
Xiang Li 044e35b814 raft: use newRaft 2014-12-15 11:25:35 -08:00
Xiang Li c586d5012c raft: log term as %d 2014-12-14 10:06:45 -08:00
Xiang Li 2c2e032155 Merge pull request #1908 from bdarnell/error-fixes
raft: remove panic when we see a proposal with no leader.
2014-12-11 13:58:51 -08:00
Ben Darnell b26856b603 raft: add detail to "no leader" log message 2014-12-11 15:07:32 -05:00
Xiang Li 89cba625d6 Merge pull request #1897 from xiang90/raft
raft: get rid of the using of defer in critical path
2014-12-10 21:24:38 -08:00
Yicheng Qin e89cc25c50 Merge pull request #1901 from yichengq/260
rafthttp: batch MsgProp
2014-12-10 21:16:07 -08:00
Yicheng Qin 3867c72c8a raft: support to do multiple proposals in one message 2014-12-10 20:00:59 -08:00
Ben Darnell fa247d09cc raft: remove panic when we see a proposal with no leader.
This panic can never be reached when using raft.Node, because we only
read from propc when there is a leader. However, it is possible to see
this error when using raft the raft object directly (as in MultiNode),
and in this case it is better to simply drop the proposal (as if we had
sent it to a leader that immediately vanished).

Add an error return to MemoryStorage.Append for consistency.
2014-12-10 17:34:40 -05:00
Xiang Li 96de9776b7 raft: get rid of allocation 2014-12-10 13:41:04 -08:00
Xiang Li e4c0f5c1a8 Merge pull request #1895 from xiang90/snap_nodes
etcd: update conf when apply the confChange entry
2014-12-09 11:45:01 -08:00
Xiang Li a5efbf826d raft: drop nodes in softState 2014-12-09 11:43:52 -08:00
Yicheng Qin 0472ddf05f Merge pull request #1890 from yichengq/259
raft: set raft.Commit too when setting raftLog.committed
2014-12-09 11:28:05 -08:00
Yicheng Qin 4804c45e14 raft: set raft.Commit too when setting raftLog.committed 2014-12-08 22:35:55 -08:00
Yicheng Qin 22dd3b039c Merge pull request #1888 from yichengq/258
raft: increase term to 1 before append initial entries
2014-12-08 22:27:23 -08:00
Yicheng Qin 7317834417 raft: increase term to 1 before append initial entries
Because the term of new raft is 0, it is weird to have term-1 committed
entries in the log.
2014-12-08 22:21:39 -08:00
Xiang Li ba45637ba3 raft: group step funcs 2014-12-08 15:29:54 -08:00
Xiang Li 099f4f10ea raft: one line 2014-12-08 15:28:48 -08:00
Xiang Li 8ead428e76 raft: group getter funcs 2014-12-08 15:24:34 -08:00
Xiang Li f73d059d80 raft: group configuration related funcs 2014-12-08 15:23:21 -08:00
Xiang Li 25313b1210 raft: move poll close to campaign 2014-12-08 15:21:57 -08:00
Xiang Li d52c66ad42 raft: removed unused func 2014-12-08 15:20:43 -08:00
Xiang Li 62ed1de10d raft: refactoring logging 2014-12-08 15:16:02 -08:00
Xiang Li 6cb7f2d9e9 raft: print out log when creating a newraft 2014-12-08 14:37:39 -08:00
Ben Darnell ea4d645a83 raft: Ignore redundant addNode calls.
This avoids clobbering any state when bootstrapping entries are
applied twice.
2014-12-05 17:15:50 -05:00
Ben Darnell 3d91faf85a Pre-apply the bootstrapping ConfChange entries.
This eliminates the need to fake an ApplyConfChange call before Campaign
in tests.

Fixes #1856.
2014-12-05 15:35:39 -05:00
Xiang Li 6409a8bf0d raft: filter out messages from unknow sender.
If we cannot find the `m.from` from current peers in the raft and it is a response
message, we should filter it out or raft panics. We are not targetting to avoid
malicious peers.

It has to be done in the raft node layer syncchronously. Although we can check
it at the application layer asynchronously, but after the checking and before
the message going into raft, the raft state machine might make progress and
unfortunately remove the `m.from` peer.
2014-12-05 11:34:56 -08:00
Xiang Li 182c30a41a raft: refactor logging at node level 2014-12-04 21:03:06 -08:00
Xiang Li 197e6b1b20 Merge pull request #1858 from vlajos/typofixes-vlajos-20141204
typofixes - https://github.com/vlajos/misspell_fixer
2014-12-04 14:52:27 -08:00
Veres Lajos 3de2ab2c04 *: typofixes
https://github.com/vlajos/misspell_fixer
2014-12-04 22:51:19 +00:00
Xiang Li a47690dd30 Merge pull request #1845 from xiang90/testunstable
raft: add TestUnstableTruncateAndAppend
2014-12-04 11:03:37 -08:00
Xiang Li 4ebd3a0b10 Merge pull request #1852 from xiang90/heartbeat
raft: add msgHeartbeat type
2014-12-04 10:25:46 -08:00
Xiang Li 149389cbfa raft: add msgHeartbeat type 2014-12-04 08:29:31 -08:00
Yicheng Qin e344774c10 Merge pull request #1850 from yichengq/247
raft: return 0 for term of compacted index
2014-12-03 17:23:32 -08:00
Yicheng Qin 34a468de36 raft: return 0 for term of compacted index
It is necessary to make this check because of the following case:

1. memory storage contains ents from index 0 to 50, and unstable has
ents from index 50 to 60.
2. raft receives an incoming snapshot with index 100.
3. raft restores its unstable to 100, but has not applied snapshot on memory storage.
4. raft receives an out-dated MsgApp from index 60.
5. raft finds the term of index 60 to check the match.
6. raft asks memory storage about the term of index 60 after it failed to get
it from unstable.
7. memory storage panics because it knows nothing about index 60.
2014-12-03 17:22:36 -08:00
Xiang Li ddd9cb7345 raft: add TestUnstableTruncateAndAppend 2014-12-03 16:37:19 -08:00
Xiang Li 2caf4f5f22 raft: fix log format in sendAppend 2014-12-03 16:11:44 -08:00
Xiang Li 06a5892a18 raft: more logging 2014-12-03 14:46:24 -08:00
Xiang Li 8074a5b5a4 raft: fix error message format in test 2014-12-03 13:36:47 -08:00
Xiang Li 37ab463e86 raft: add TestUnstableStableTo 2014-12-03 13:26:35 -08:00
Xiang Li 7703d4942c raft: add TestUnstableRestore 2014-12-03 13:03:56 -08:00
Xiang Li be60c88603 Merge pull request #1842 from xiang90/unstable_test
raft: add TestUnstableFirstIndex
2014-12-03 11:50:39 -08:00
Yicheng Qin 63ed202db6 raft: print out term in decimal format 2014-12-03 11:33:51 -08:00
Xiang Li 48f75ca645 raft: add TestUnstableMaybeTerm 2014-12-03 11:30:59 -08:00
Xiang Li 058356d9bd raft: add TestUnstableLastIndex 2014-12-03 11:11:31 -08:00
Xiang Li 98ebfa3468 raft: add TestUnstableFirstIndex 2014-12-03 11:11:11 -08:00
Yicheng Qin 23b32a6cbe Merge pull request #1716 from yichengq/225
raft: panic if loaded commit is out of range
2014-12-02 22:14:12 -08:00
Yicheng Qin 38768e5396 raft: panic if loaded commit is out of range 2014-12-02 22:09:34 -08:00
Xiang Li b3841afcc3 raft: do not restore snapshot if local raft has longer matching history
Raft should not restore the snapshot if it has longer matching history.
Or restoring snapshot might remove the matched entries.
2014-12-02 21:34:14 -08:00
Xiang Li 3209fd544b raft: panic on bad slice 2014-12-02 17:48:03 -08:00
Xiang Li 79014556e9 Merge pull request #1831 from xiang90/fix_unstable
raft: fix unstable
2014-12-02 14:43:11 -08:00
Xiang Li 2f5b748a90 raft: clearify that the firstIndex might not be available. 2014-12-02 14:27:52 -08:00
Yicheng Qin 1c7b9317a9 Merge pull request #1833 from yichengq/244
raft: not call stableTo for restored snapshot
2014-12-02 13:20:39 -08:00
Yicheng Qin 551a56fb98 raft: not call stableTo for restored snapshot
Stable has been set when restoring the snapshot in raftlog, so we don't need
to set it after advance.
2014-12-02 13:10:35 -08:00
Xiang Li b7ca56e3c8 raft: move good case of truncateAndAppend to the first place 2014-12-02 13:05:55 -08:00
Xiang Li 3cadaca1a3 Merge pull request #1830 from xiang90/raft_snap_log
raft: log snapshot events
2014-12-02 12:06:15 -08:00
Xiang Li 411063e14f raft: log snapshot events 2014-12-02 11:57:10 -08:00
Xiang Li 788d1e59a2 raft: use index in entry 2014-12-02 10:25:27 -08:00
Xiang Li 51de095d2c raft: logging state change events and events on bad path 2014-12-02 10:08:19 -08:00
Xiang Li 312db7f0f3 raft: fix memory storage
Memory storage should append all entries that have greater index
than the snap.Matedata.Index. We first truncate the old parts of
incoming entries. Then truncate the existing entries in the storage.
At last, we append the incoming entries to the existing entries.
2014-12-01 16:37:16 -08:00
Xiang Li 19ccdbee18 Merge pull request #1806 from xiang90/no_copy
No copy
2014-12-01 13:15:13 -08:00
Xiang Li 92d4112feb Merge pull request #1809 from xiang90/unstable
raft: stableTo checks term matching
2014-12-01 11:09:40 -08:00
Xiang Li 649176934a raft: add tests for stableTo 2014-12-01 10:54:34 -08:00
Xiang Li 3c0fbe285c raft: stableTo checks term matching
stableTo should only mark the index stable if the term is matched. After raft sends out unstable
entries to application, raft makes progress without waiting for reply. When the appliaction
calls the stableTo to notify the entries up to "index" are stable, raft might have truncated
some entries before "index" due to leader lost. raft must verify the (index,term) of stableTo,
before marking the entries as stable.
2014-11-28 14:13:07 -08:00
Xiang Li d214e87aee raft: make unstable.entries immutable; copy the entries at bad path 2014-11-27 19:35:03 -08:00
Xiang Li d244e3bf6e raft: fix node bench 2014-11-26 23:07:35 -08:00
Xiang Li fe0bc4ff36 Merge pull request #1805 from xiang90/fix_raft_b
raft: fix start term
2014-11-26 21:41:38 -08:00
Xiang Li 746c66b466 raft: fix start term 2014-11-26 21:21:13 -08:00
Xiang Li 7929e46dd8 raft: clean up 2014-11-26 15:31:07 -08:00
Xiang Li 8a626257c7 raft: move unstable related function to log_unstable.go 2014-11-26 15:25:24 -08:00