Commit Graph

219 Commits (2bbd26e8e07705b1132a766f4491a26c0a706132)

Author SHA1 Message Date
Ben Darnell 46bb2582fe raft: Call maybeCommit after removing a node.
removeNode reduces the required quorum size, so some pending entries may
be able to commit after it is applied.

Discovered in cockroachdb/cockroach#3642
2016-01-20 11:05:48 -04:00
Hitoshi Mitake 9b2da76796 raft: remove go vet compliants 2015-12-16 13:29:23 +09:00
Xiang Li cc6d98bf89 etcdserver: only send snapshot when the member is active 2015-12-10 16:15:26 -08:00
Xiang Li a8cc1570d0 raft: support quorum check when raft is leader
If quorum check fails, the leader will step down to follower.
2015-11-24 09:36:37 -08:00
Ben Darnell fbeb58d265 raft: no-op instead of panic for Campaigning while leader
We need to be able to force an election (on one node) after creating a
new group (cockroachdb/cockroach#1384), but it is difficult to ensure
that our call to Campaign does not race with an election that may be
started by raft itself. A redundant call to Campaign should be a no-op
instead of a panic. (But the panic in becomeCandidate remains, because
we don't want to update the term or change the committed index in this
case)
2015-11-16 21:44:14 -05:00
Dmitry Smirnov b2f4a5f587 *: fix spelling issues (codespell).
Signed-off-by: Dmitry Smirnov <onlyjob@member.fsf.org>
2015-09-11 10:22:29 +10:00
Xiang Li 6b23a8131f *: test gofmt with -s and fix reported issues 2015-08-21 18:52:16 -07:00
es-chow cc362ccdad raft: set logger to raft so log context such as multinode groupID can be logged 2015-08-12 22:56:00 +08:00
Xiang Li b4022899eb raft: fix panic in send app
sendApp accesses the storage several times. Perviously, we
assume that the storage will not be modified during the read
opeartions. The assumption is not true since the storage can
be compacted between the read operations. If a compaction
causes a read entries error, we should not painc. Instead, we
can simply retry the sendApp logic until succeed.
2015-06-15 14:23:33 -07:00
Ben Darnell c9d507df11 raft: Use raft.Config in MultiNode. 2015-03-24 15:37:13 -04:00
Xiang Li b3fb052ad4 raft: make peers a prviate field in raft.Config 2015-03-24 11:10:07 -07:00
Xiang Li d9b5b56c82 raft: make raft configurable 2015-03-23 09:55:19 -07:00
Xiang Li 4a64373225 raft: add flow control for progress
Each progress has a inflighs sliding window. When the progress
is in replicate state, inflights will control the sending speed
of the leader.

The leader can have at most maxInflight number of inflight
messages for each replicate progress. Receving a appResp moves
forward the sliding window. Heartbeat response free one
slot if the window is full.
2015-03-20 20:04:33 -07:00
Xiang Li 7571b2cde2 raft: limit the size of msgApp
limit the max size of entries sent per message.
Lower the cost at probing state as we limit the size per message;
lower the penalty when aggressively decrease to a too low next.
2015-03-18 15:59:30 -07:00
Yicheng Qin 67194c0b22 raft: introduce progress states 2015-03-18 08:16:32 -07:00
Yicheng Qin be0bf2a2bd raft: fall back to bad path when unreachable 2015-03-11 13:21:23 -07:00
Yicheng Qin fbd5c81139 raft: remove shadowing of variables from test 2015-02-28 12:09:33 -08:00
Xiang Li 2af33fd494 raft: add reportUnreachable 2015-02-28 10:45:22 -08:00
Xiang Li 5ede18be74 raft: separate compact and createsnap in memory storage 2015-02-28 10:08:30 -08:00
Barak Michener 92dca0af0f *: remove shadowing of variables from etcd and add travis test
We've been bitten by this enough times that I wrote a tool so that
it never happens again.
2015-02-17 16:31:42 -05:00
Ben Darnell 33d2400063 raft: Send any waiting appends after receiving MsgAppResp.
This addresses a problem that comes up in the cockroach tests,
in which the order of messages may lead to deadlocks (due to
the fact that we don't have regular heartbeat timers in most
of our tests).
2015-01-27 17:43:29 -05:00
Jonathan Boulle f1ed69e883 *: switch to line comments for copyright
Build tags are not compatible with block comments.
Also adds copyright header to a few places it was missing.
2015-01-26 09:53:30 -08:00
Ben Darnell 59214978a2 raft: Add applied index as an argument to newRaft and RestartNode. 2015-01-22 11:38:05 -05:00
Xiang Li 003b97a60f raft: public progress struct in raft 2015-01-20 10:26:22 -08:00
Ben Darnell 2e1c36cdd9 raft: introduce MsgHeartbeatResp.
Now that heartbeats are distinct from MsgApp{,Resp}, the retries
currently performed in stepLeader's MsgAppResp section are only
performed on an actual MsgAppResp (or a new MsgProp). This means
that it may take a long time to recover from a dropped MsgAppResp
in a quiet cluster.

This commit adds a dedicated heartbeat response message. This message
does not convey the follower's current log position because the
MsgHeartbeat does not include the leaders term and index. Upon receipt
of a heartbeat response, the leader may retry the latest MsgApp if it
believes the follower to be behind.
2015-01-14 17:34:10 -05:00
Xiang Li 35b907ac58 raft: add lastIndex as rejectHint
Add the lastindex of the raft log as reject hint, so the leader can
bypass the greater index probing and decrease the next index directly
to last + 1.
2015-01-01 19:04:07 -08:00
Xiang Li 896bac1f76 raft: flush the commit to fix a race in test 2014-12-18 17:10:37 -08:00
Xiang Li 88767d913d raft: leader waits for the reply of previous message when follower is not in good path.
It is reasonable for the leader to wait for the reply before sending out the next
msgApp or msgSnap for the follower in bad path. Or the leader will send out useless
messages if the previous message is rejected or the previous message is a snapshot.
Especially for the snapshot case, the leader will be 100% to send out duplicate message
including the snapshot, which is a huge waste.

This commit implement a timeout based wait mechanism. The timeout for normal msgApp is a
heartbeatTimeout and the timeout for snapshot is electionTimeout(snapshot is larger). We
can implement a piggyback mechanism(application notifies the msg lost) in the future
if necessary.
2014-12-18 15:01:50 -08:00
Xiang Li 044e35b814 raft: use newRaft 2014-12-15 11:25:35 -08:00
Yicheng Qin 3867c72c8a raft: support to do multiple proposals in one message 2014-12-10 20:00:59 -08:00
Xiang Li 197e6b1b20 Merge pull request #1858 from vlajos/typofixes-vlajos-20141204
typofixes - https://github.com/vlajos/misspell_fixer
2014-12-04 14:52:27 -08:00
Veres Lajos 3de2ab2c04 *: typofixes
https://github.com/vlajos/misspell_fixer
2014-12-04 22:51:19 +00:00
Xiang Li 149389cbfa raft: add msgHeartbeat type 2014-12-04 08:29:31 -08:00
Xiang Li b3841afcc3 raft: do not restore snapshot if local raft has longer matching history
Raft should not restore the snapshot if it has longer matching history.
Or restoring snapshot might remove the matched entries.
2014-12-02 21:34:14 -08:00
Xiang Li 788d1e59a2 raft: use index in entry 2014-12-02 10:25:27 -08:00
Xiang Li 3c0fbe285c raft: stableTo checks term matching
stableTo should only mark the index stable if the term is matched. After raft sends out unstable
entries to application, raft makes progress without waiting for reply. When the appliaction
calls the stableTo to notify the entries up to "index" are stable, raft might have truncated
some entries before "index" due to leader lost. raft must verify the (index,term) of stableTo,
before marking the entries as stable.
2014-11-28 14:13:07 -08:00
Xiang Li 66252c7d62 raft: move all unstable stuff into one struct for future cleanup 2014-11-26 13:36:17 -08:00
Xiang Li 65ad1f6ffd raft: attach Index to Entry in all tests 2014-11-24 17:13:47 -08:00
Ben Darnell 9ddd8ee539 Rename Storage.HardState back to InitialState and include ConfState.
This fixes integration/migration_test.go (and highlights the fact that
we need some more raft-level testing of restoring from snapshots).
2014-11-21 17:22:20 -05:00
Ben Darnell 03c8881e35 Fix TestSlowNodeRestore 2014-11-21 16:40:41 -05:00
Ben Darnell 0d680d0e6b Merge remote-tracking branch 'coreos/master' into merge
* coreos/master:
  rafthttp: fix import
  raft: should not decrease match and next when handling out of order msgAppResp
  Fix migration to allow snapshots to have the right IDs
  add snapshotted integration test
  fix test import loop
  fix import loop, add set to types, and fix comments
  etcdserver: autodetect v0.4 WALs and upgrade them to v0.5 automatically
  wal: add a bench for write entry
  rafthttp: add streaming server and client
  dep: use vendored imports in codegangsta/cli
  dep: bump golang.org/x/net/context

Conflicts:
	etcdserver/server.go
	etcdserver/server_test.go
	migrate/snapshot.go
2014-11-21 15:40:11 -05:00
Xiang Li 063c5c77a0 raft: should not decrease match and next when handling out of order msgAppResp 2014-11-20 17:58:23 -08:00
Ben Darnell b29240baf0 Merge remote-tracking branch 'coreos/master' into merge
* coreos/master:
  scripts: build-docker tag and use ENTRYPOINT
  scripts: build-release add etcd-migrate
  create .godir
  raft: optimistically increase the next if the follower is already matched
  raft: add handleHeartbeat handleHeartbeat commits to the commit index in the message. It never decreases the commit index of the raft state machine.
  rafthttp: send takes raft message instead of bytes
  *: add rafthttp pkg into test list
  raft: include commitIndex in heartbeat
  rafthttp: move server stats in raftHandler to etcdserver
  *: etcdhttp.raftHandler -> rafthttp.RaftHandler
  etcdserver: rename sender.go -> sendhub.go
  *: etcdserver.sender -> rafthttp.Sender

Conflicts:
	raft/log.go
	raft/raft_paper_test.go
2014-11-19 17:05:16 -05:00
Ben Darnell 355ee4f393 raft: Integrate snapshots into the raft.Storage interface.
Compaction is now treated as an implementation detail of Storage
implementations; Node.Compact() and related functionality have been
removed. Ready.Snapshot is now used only for incoming snapshots.

A return value has been added to ApplyConfChange to allow applications
to track the node information that must be stored in the snapshot.

raftpb.Snapshot has been split into Snapshot and SnapshotMetadata, to
allow the full snapshot data to be read from disk only when needed.

raft.Storage has new methods Snapshot, ApplySnapshot, HardState, and
SetHardState. The Snapshot and HardState parameters have been removed
from RestartNode() and will now be loaded from Storage instead.
The only remaining difference between StartNode and RestartNode is that
the former bootstraps an initial list of Peers.
2014-11-19 16:40:26 -05:00
Xiang Li b50f331558 Merge pull request #1744 from xiang90/next
raft: optimistically increase the next if the follower is already matched
2014-11-19 13:21:11 -08:00
Xiang Li 4c1fd07311 raft: optimistically increase the next if the follower is already matched
This is useful since we want to pipeline the appendEntry requests. Without
enabling optimistic increasing, the second pipelining appendEntry request
will include the entries the first one has already sent out. We decrease
the next directly to match if the leader receives a rejection for a matched
follower. This happens if one pipelining request get lost and following ones
arrives at the follower.
2014-11-18 13:41:38 -08:00
Xiang Li bd4cfa2a07 raft: add handleHeartbeat
handleHeartbeat commits to the commit index in the message. It never decreases the
commit index of the raft state machine.
2014-11-18 08:34:06 -08:00
Xiang Li b93d87f17f raft: include commitIndex in heartbeat 2014-11-17 16:19:28 -08:00
Ben Darnell 64d9bcabf1 Add Storage.Term() method and hide the first entry from other methods.
The first entry in the log is a dummy which is used for matchTerm
but may not have an actual payload. This change permits Storage
implementations to treat this term value specially instead of
storing it as a dummy Entry.

Storage.FirstIndex() no longer includes the term-only entry.

This reverses a recent decision to create entry zero as initially
unstable; Storage implementations are now required to make
Term(0) == 0 and the first unstable entry is now index 1.
stableTo(0) is no longer allowed.
2014-11-17 16:54:12 -05:00
Ben Darnell b29c512f50 Merge remote-tracking branch 'coreos/master' into log-storage-interface
* coreos/master: (27 commits)
  pkg/wait: move wait to pkg/wait
  etcdserver: do not add/remove/update local member to/from sender hub
  etcdserver: not record attributes when add member
  raft: add a test for proposeConfChange
  raft: block Stop() on n.done, support idempotency
  raft: add a test for node proposal
  integration: add increase cluster size test
  integration: remove unnecessary t.Testing argument
  raft: stop the node synchronously
  integration: fix test to propagate NewServer errors
  etcdserver: move peer URLs check to config
  etcdserver: ensure initial-advertise-peer-urls match initial-cluster
  raft: add a test for node.Tick
  raft: add comment string for TestNodeStart
  etcdserver: use member instead of node at etcd level
  raft: nodes return sorted ids
  raft: update unstable when calling stableTo with 0
  *: support updating advertise-peer-url Users might want to update the peerurl of the etcd member in several cases. For example, if the IP address of the physical machine etcd running on is changed, user need to update the adversite-pee-rurl accordingly. This commit makes etcd support updating the advertise-peer-url of its members.
  transport: create a tls listener only if the tlsInfo is not empty and the scheme is HTTPS
  etcdserver: use member pointer for all tests
  ...

Conflicts:
	etcdserver/server.go
	raft/log.go
	raft/log_test.go
	raft/node.go
2014-11-13 14:21:09 -05:00
Ben Darnell 76a3de9a33 Require a non-nil Storage parameter in newLog.
Callers must in general have a reference to their Storage objects to
transfer entries from Ready to Storage, so it doesn't make sense to
create a hidden Storage for them.

By explicitly creating Storage objects in tests we can remove a
few casts of raftLog's storage field.
2014-11-12 16:38:50 -05:00
Yicheng Qin 78cbb1512c raft: nodes return sorted ids
This makes raft.softState return the same result when its soft state is
not changed.
2014-11-11 22:58:15 -08:00
Ben Darnell 25b6590547 raft: introduce log storage interface.
This change splits the raftLog.entries array into an in-memory
"unstable" list and a pluggable interface for retrieving entries that
have been persisted to disk. An in-memory implementation of this
interface is provided which behaves the same as the old version;
in a future commit etcdserver could replace the MemoryStorage with
one backed by the WAL.
2014-11-10 17:40:39 -05:00
Ben Darnell 21987c8701 raft: remove raftLog.resetUnstable and resetNextEnts
These methods are no longer used outside of tests and are redundant with
the new stableTo and appliedTo methods.
2014-11-06 17:18:00 -05:00
Jonathan Boulle b99633207c raft: minor cleanup in comments 2014-10-30 10:01:44 -07:00
Jonathan Boulle 0cf0cb3d02 raft: add tests for progress.maybeDecr 2014-10-29 22:57:25 -07:00
Jonathan Boulle 81304b2b7e raft: move test helper function into tests 2014-10-29 14:04:45 -07:00
Yicheng Qin bffe611fe6 raft: add tests based on section 5.2 in raft paper 2014-10-29 10:17:09 -07:00
Xiang Li 507300130b raft: add tests for ignoring heartbeat reply 2014-10-24 11:50:21 -07:00
Yicheng Qin e200d2a8e2 etcdserver/raft: remove msgDenied, removedNodes, shouldStop
The future plan is to do all these in etcdserver level.
2014-10-20 15:13:18 -07:00
Jonathan Boulle 7a4d42166b *: add license header to all source files 2014-10-17 15:41:22 -07:00
Jonathan Boulle fc42bdb904 raft: remove unused compactThreshold 2014-10-16 17:11:10 -07:00
Yicheng Qin 32c38820c1 raft: protobuf messageType 2014-10-13 11:13:43 -07:00
Xiang Li 1eb1020717 raft: fix raft test 2014-10-09 14:42:29 +08:00
Xiang Li 8bbbaa88b2 *: raft related int64 -> uint64 2014-10-09 14:29:21 +08:00
Xiang Li af5b8c6c44 raft: int64 -> uint64 2014-10-09 14:26:43 +08:00
Xiang Li 38af14b0f4 Merge pull request #1260 from coreos/snap_rm
raft: save removed nodes in snapshot
2014-10-09 08:00:33 +08:00
Xiang Li 73f2aaf98f raft: removedSlice -> removedNodes 2014-10-09 06:55:25 +08:00
Xiang Li c67fd14fe8 Merge pull request #1257 from bdarnell/cleanups
Raft: assorted cleanups (golint and go vet)
2014-10-09 05:55:21 +08:00
Ben Darnell d2e858587f Raft: a few more improvements to test messages. 2014-10-08 15:07:11 -04:00
Xiang Li 7b61565c0a raft: save removed nodes in snapshot 2014-10-08 15:33:55 +08:00
Xiang Li 1cd3345e00 raft: address issues with election timeout 2014-10-08 07:41:17 +08:00
Ben Darnell 36558b1924 Raft: fix printf strings found by go vet. 2014-10-07 18:44:06 -04:00
Ben Darnell 3ad0df3722 Raft: fix problems reported by golint. 2014-10-07 18:44:06 -04:00
Xiang Li f65d117462 raft: add a test for randElectionTimeout 2014-10-07 20:34:15 +08:00
Xiang Li d7d6f84f64 raft: rand election timeout 2014-10-07 20:12:49 +08:00
Xiang Li 45e4a8643a raft: add tests for raft.compact 2014-10-07 16:03:11 +08:00
Xiang Li 5587e0d73f raft: compact takes index and nodes parameters
Before this commit, compact always compact log at current appliedindex of raft.
This prevents us from doing non-blocking snapshot since we have to make snapshot
and compact atomically. To prepare for non-blocking snapshot, this commit make
compact supports index and nodes parameters. After completing snapshot, the applier
should call compact with the snapshot index and the nodes at snapshot index to do
a compaction at snapsohot index.
2014-10-07 16:03:11 +08:00
Xiang Li 01ecc60a88 Merge pull request #1203 from coreos/fix_raft
raft: commitIndex=min(leaderCommit, index of last new entry)
2014-10-03 22:24:07 +08:00
Xiang Li 172bd7d096 raft: add test for maybeappend change 2014-10-03 22:21:35 +08:00
Xiang Li 16ba77767e raft: do not decrease nextIndex and send entries for stale reply 2014-10-03 13:41:27 +08:00
Yicheng Qin 8490904f20 Merge pull request #1224 from unihorn/149
raft: msg.Denied -> msg.Reject
2014-10-02 12:24:09 -07:00
Yicheng Qin fff918c672 raft: msg.Denied -> msg.Reject
Change the field name because it has msgDenied already.
2014-10-02 12:22:12 -07:00
Yicheng Qin 182c8316e1 raft: refine comment for doc and removed list tests 2014-10-01 14:57:39 -07:00
Yicheng Qin e4a6c9651a raft: add removed
The usage of removed:
1. tell removed node about its removal explicitly using msgDenied
2. prevent removed node disrupt cluster progress by launching leader election

It is set when apply node removal, or receive msgDenied.
2014-10-01 14:57:38 -07:00
Xiang Li d7b4e44a66 raft: heartbeat is only response for maintaining leader dominance 2014-09-29 16:57:43 -07:00
Xiang Li e26ff32fd8 raft: fix error msg 2014-09-28 21:17:51 -07:00
Xiang Li 51529cc3f2 raft: remove index field in msg AppResp 2014-09-28 21:13:53 -07:00
Xiang Li adefd83855 raft: remove index field in msg voteResp 2014-09-28 21:13:43 -07:00
Xiang Li 86473d8a27 raft: add msg denied field 2014-09-28 21:13:33 -07:00
Yicheng Qin 1ca03d8e9d raft: move logic to separate func 2014-09-24 10:23:44 -07:00
Yicheng Qin b07be74a82 raft: stop tickElection when the node is not in peer list
This prevents the bug like this:
1. a node sends join to a cluster and succeeds
2. it starts with empty peers and waits for sync, but it have not
received anything
3. election timeout passes, and it promotes itself to leader
4. it commits some log entry
5. its log conflicts with the cluster's
2014-09-23 23:15:02 -07:00
Yicheng Qin bc7b0108dc raft: ConfigChange -> ConfChange 2014-09-23 12:02:44 -07:00
Yicheng Qin d92931853e raft: Config -> ConfigChange
Configure -> ProposeConfigChange
AddNode, RemoveNode -> ApplyConfigChange
2014-09-22 23:39:53 -07:00
Yicheng Qin b82d70871f raft: use EntryType in protobuf 2014-09-22 15:44:46 -07:00
Yicheng Qin ff6705b94b raft: add Configure, AddNode, RemoveNode
Configure is used to propose config change. AddNode and RemoveNode is
used to apply cluster change to raft state machine. They are the
basics for dynamic configuration.
2014-09-22 15:43:13 -07:00
Yicheng Qin 023dc7cba2 etcdserver: add SYNC request 2014-09-16 13:42:03 -07:00
Xiang Li f9c12e2053 Merge pull request #1075 from coreos/fix_heartbeat
raft: fix heartbeat
2014-09-15 10:04:12 -07:00
Xiang Li 21d116d3e1 raft: fix heartbeat 2014-09-15 09:58:22 -07:00
Xiang Li e7ea6a374a main: check node id is not noneid 2014-09-14 23:28:11 -07:00