Commit Graph

345 Commits (ba9fd620e89254f60c18b8fd3383fafb65ee183d)

Author SHA1 Message Date
Gyuho Lee e899023f3f
Merge pull request #10640 from shrajfr12/gomodulecompat
Fix module path to have the major version to comply with go modules specification.
2019-05-01 22:46:03 -07:00
Xiang Li 0bc219a91e
Merge pull request #10679 from nvanbenschoten/nvanbenschoten/commitAlloc
raft: Avoid allocation when boxing slice in maybeCommit
2019-04-30 10:55:16 -07:00
shivaramr 9150bf52d6 go modules: Fix module path version to include version number 2019-04-26 15:29:50 -07:00
Nathan VanBenschoten 24f35a9861 raft: avoid allocation of Raft entry due to logging
`raftpb.Entry.String` takes a pointer receiver, so calling it
on a loop variable was causing the variable to escape. Removing
the `.String()` call was enough to avoid the allocation, but
this also avoids a memory copy and prevents similar bugs.

This was responsible for 11.63% of total allocations in an
experiment with https://github.com/nvanbenschoten/raft-toy.
2019-04-26 14:56:31 -04:00
Nathan VanBenschoten 208b8a349c raft: Avoid allocation when boxing slice in maybeCommit
By boxing a heap-allocated slice header instead of the slice
header on the stack, we can avoid an allocation when passing
through the sort.Interface interface.

This was responsible for 26.61% of total allocations in an
experiment with https://github.com/nvanbenschoten/raft-toy.
2019-04-26 00:10:45 -04:00
Jingyi Hu 5088d70d69 raft: leader response to learner MsgReadIndex
Leader should check message sender after receiving MsgReadIndex, even
when raft quorum is 1. The message could be sent from learner node, and
leader should respond.
2019-03-28 16:14:32 -07:00
Sahdev P. Zala 56f1bce161 raft/doc: clarify the case of out of date term
Clarify the doc.

Fixes #10491
2019-03-26 14:00:24 -04:00
shawnli 23731bf9ba raft: set lead to none when becomePreCandidate 2018-12-28 19:57:26 +08:00
Tobias Schottdorf bfaae1ba46 raft: enter ProgressStateReplica immediately after snapshot
When a follower requires a snapshot and the snapshot is sent at the
committed (and last) index in an otherwise idle Raft group, the follower
would previously remain in ProgressStateProbe even though it had been
caught up completely.

In a busy Raft group this wouldn't be an issue since the next round of
MsgApp would update the state, but in an idle group there's nothing
that rectifies the status (since there's nothing to append or update).

The reason this matters is that the state is exposed through
`RaftStatus()`. Concretely, in CockroachDB, we use the Raft status to
make sharding decisions (since it's dangerous to make rapid changes
to a fragile Raft group), and had to work around this problem[1].

[1]: 91b11dae41/pkg/storage/split_delay_helper.go (L138-L141)
2018-12-06 11:09:59 +01:00
Tobias Schottdorf 5c209d66d2 raft: ensure leader is in ProgressStateReplicate
The leader perpetually kept itself in ProgressStateProbe even though of
course it has perfect knowledge of its log. This wasn't usually an issue
because it also doesn't care about its own Progress, but it's better to
keep this data correctly maintained, especially since this is part of
raft.Status and thus becomes visible to applications using the Raft
library.

(Concretely, in CockroachDB we use the Progress to inform log
truncations).
2018-11-23 17:57:36 +01:00
Andrew Werner e4af2be5bb raft: separate MaxCommittedSizePerReady config from MaxSizePerMsg
Prior to this change, MaxSizePerMsg was used both to cap the total byte size of
entries in messages as well as the total byte size of entries passed through
CommittedEntries in the Ready struct. This change adds a new Config parameter
MaxCommittedSizePerReady which defaults to MaxSizePerMsg and contols the second
of above descibed settings.
2018-11-14 09:59:09 -05:00
Tobias Schottdorf ad49c8fd98 raft: fix bug in unbounded log growth prevention mechanism
The previous code was using the proto-generated `Size()` method to
track the size of an incoming proposal at the leader. This includes
the Index and Term, which were mutated after the call to `Size()`
when appending to the log. Additionally, it was not taking into
account that an ignored configuration change would ignore the
original proposal and append an empty entry instead.

As a result, a fully committed Raft group could end up with a non-
zero tracked uncommitted Raft log counter that would eventually hit
the ceiling and drop all future proposals indiscriminately. It would
also immediately imply that proposals exceeding the threshold alone
would get refused (as the "first uncommitted proposal" gets special
treatment and is always allowed in).

Track only the size of the payload actually appended to the Raft log
instead.

For context, see:
https://github.com/cockroachdb/cockroach/issues/31618#issuecomment-431374938
2018-10-22 11:28:39 +02:00
Nathan VanBenschoten 73c20cc1b7 raft: Fix comment on sendHeartbeat 2018-10-14 00:03:43 -04:00
Nathan VanBenschoten f89b06dc6d raft: provide protection against unbounded Raft log growth
The suggested pattern for Raft proposals is that they be retried
periodically until they succeed. This turns out to be an issue
when a leader cannot commit entries because the leader will continue
to append re-proposed entries to its log without committing anything.
This can result in the uncommitted tail of a leader's log growing
without bound until it is able to commit entries.

This change add a safeguard to protect against this case where a
leader's log can grow without bound during loss of quorum scenarios.
It does so by introducing a new, optional ``MaxUncommittedEntriesSize
configuration. This config limits the max aggregate size of uncommitted
entries that may be appended to a leader's log. Once this limit
is exceeded, proposals will begin to return ErrProposalDropped
errors.

See cockroachdb/cockroach#27772
2018-10-13 23:25:05 -04:00
Gyuho Lee bb60f8ab1d raft: change import paths to "go.etcd.io/etcd"
Signed-off-by: Gyuho Lee <leegyuho@amazon.com>
2018-08-28 17:47:52 -07:00
Xiang Li 11dd0b583b
Merge pull request #9982 from bdarnell/pagination
raft: Introduce CommittedEntries pagination
2018-08-11 09:12:46 +08:00
Ben Darnell a9e7c1e11f raft: Make flow control more aggressive
We allow multiple in-flight append messages, but prior to this change
the only way we'd ever send them is if there is a steady stream of new
proposals. Catching up a follower that is far behind would be
unnecessarily slow (this is exacerbated by a quirk of CockroachDB's
use of raft which limits our ability to catch up via snapshot in some
cases).

See cockroachdb/cockroach#27983
2018-08-08 11:10:54 -04:00
Ben Darnell 0a670b7c9b raft: Introduce CommittedEntries pagination
The MaxSizePerMsg setting is now used to limit the size of
Ready.CommittedEntries. This prevents out-of-memory errors if the raft
log has become very large and commits all at once.
2018-08-07 12:54:34 -04:00
Nathan VanBenschoten 0a415cf0d6 raft: dont allocate slice and sort on every commit 2018-07-25 23:42:16 -04:00
Ben Darnell 20422c5b4d raft: Really avoid scanning raft log in becomeLeader
I meant to do this in #9073, but sent the PR before it was finished.
The last log index is known directly; there is no need to fetch any
entries here.
2018-06-26 15:29:51 -04:00
Xiang Li 357308bfcd
Merge pull request #9679 from lorneli/lorneli-raft-dev
raft: describe the purpose of lockedRand
2018-05-26 22:03:18 -07:00
lorneli a083282482 raft: describe the purpose of lockedRand
Struct lockedRand wraps rand.Rand with mutex lock because it's
accessed by multiple raft groups.
2018-05-26 21:59:24 +08:00
Jia Zhan d14b705355 raft: fix a few comments 2018-04-27 11:25:06 -07:00
Gyuho Lee 8aae8c1c9c raft: document disruptive rejoining server, add tests
Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-03-06 09:54:29 -08:00
Gyuho Lee 01db389ea8 raft: document why reuse candidate's term for vote response in stepCandidate
"stepCandidate" should reuse candidate's own term, not term in Message,
because pre-vote is requested with future term.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-02-21 16:11:01 -08:00
Gyuho Lee 38846c220a raft: use leader's term when candidate becomes follower
`raft.Step` already ensures that when `m.Term > r.Term`,
candidate reverts back to follower with its term being
reset with `m.Term`, thus it's always true that
`m.Term == r.Term` in `stepCandidate`.

This just makes `r.becomeFollower` calls consistent.

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-02-21 16:10:52 -08:00
Gyuho Lee 2b7c12fb12 raft: reuse "last index" in "appendEntry"
No need to call "lastIndex" again.
"append" call already returns "lastIndex".

Signed-off-by: Gyuho Lee <gyuhox@gmail.com>
2018-02-05 21:26:45 -08:00
Xiang Li d54f281b26
Merge pull request #8525 from shuaili87/pre-vote-compatible
raft: fix deadlock during PreVote migration process
2018-01-26 16:34:59 -08:00
Ben Darnell 4e0291ff91 raft: Clarify conditions for granting votes and prevotes.
This includes one theoretical logic change: A node that knows the
leader of the current term will no longer grant votes, even if it has
not yet voted in this term. It also adds a `m.Type == MsgPreVote`
guard on the `m.Term > r.Term` check, which was previously thought to
be incorrect (see #8517) but was actually just unclear.

Closes #8517
Closes #8571
2018-01-23 15:05:11 -05:00
Xiang Li c5532ebbf6
Merge pull request #9067 from absolute8511/optimize-raft-drop
raft: let raft step return error when proposal is dropped to allow fail-fast
2018-01-11 19:54:52 -08:00
Vincent Lee 30ced5b2be raft: let raft step return error when proposal is dropped to allow fail-fast. 2018-01-12 10:16:47 +08:00
Vincent Lee 11fa4f0275 raft: raft learners should be returned after applyConfChange 2018-01-11 17:30:17 +08:00
Xiang Li ed1ff9e952
Merge pull request #9073 from bdarnell/pending-conf-index
raft: Avoid scanning raft log in becomeLeader
2018-01-08 16:37:36 -08:00
Nathan VanBenschoten e6dc57f708 raft: s/leaner/learner/g 2018-01-03 08:16:50 -05:00
Ben Darnell 8d8f3195e4 raft: Avoid scanning raft log in becomeLeader
Scanning the uncommitted portion of the raft log to determine whether
there are any pending config changes can be expensive. In
cockroachdb/cockroach#18601, we've seen that a new leader can spend so
much time scanning its log post-election that it fails to send
its first heartbeats in time to prevent a second election from
starting immediately.

Instead of tracking whether a pending config change exists with a
boolean, this commit tracks the latest log index at which a pending
config change *could* exist. This is a less expensive solution to
the problem, and the impact of false positives should be minimal since
a newly-elected leader should be able to quickly commit the tail of
its log.
2017-12-30 10:13:36 -05:00
siddontang c6f2db2e92 raft: support learner 2017-11-11 10:38:21 +08:00
Xiang 9801fd7297 raft: ensure CheckQuorum is enabled when readonlyoption is lease based 2017-09-17 10:46:12 -07:00
gladiator 58b98c6a14 raft: check leader request when becomeFollower 2017-09-15 08:23:18 +08:00
gladiator 8597361f01 raft: fix Pre-Vote migration 2017-09-09 09:12:39 +08:00
gladiator 3740793b42 raft: reset votes when becomePreCandidate 2017-08-01 22:42:09 +08:00
irfan sharif a92ceeec25 raft: introduce/fix TestNodeWithSmallerTermCanCompleteElection
TestNodeWithSmallerTermCanCompleteElection tests the scenario where a
node that has been partitioned away (and fallen behind) rejoins the
cluster at about the same time the leader node gets partitioned away.
Previously the cluster would come to a standstill when run with PreVote
enabled.

When responding to Msg{Pre,}Vote messages we now include the term from
the message, not the local term. To see why consider the case where a
single node was previously partitioned away and it's local term is now
of date. If we include the local term (recall that for pre-votes we
don't update the local term), the (pre-)campaigning node on the other
end will proceed to ignore the message (it ignores all out of date
messages).
The term in the original message and current local term are the same in
the case of regular votes, but different for pre-votes.

NB: Had to change TestRecvMsgVote to include pb.Message.Term when
sending MsgVote messages. The new sanity checks on MsgVoteResp
(m.Term != 0) would panic with the old test as raft.Term would be equal
to 0 when responding with MsgVoteResp messages.
2017-07-21 02:26:02 -04:00
smetro e461017ac5 raft: add DisableProposalForwarding option
this allows users to disable followers from forwarding proposals to the
leader.
2017-06-21 14:58:28 -07:00
Aaron Lehmann 52613b262b raft: Set the RecentActive flag for newly added nodes
I found that enabling the CheckQuorum flag led to spurious leader
elections when new nodes joined. It looks like in the time between a new
node joining the cluster, and that node first communicating with the
leader, the quorum check could fail because the new node looks inactive.
To solve this, set the RecentActive flag when nodes are first added.
This gives a grace period for the node to communicate before it causes
the quorum check to fail.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
2017-04-27 11:19:29 -07:00
Dylan.Wen 9342647e0c raft: fix read index request for #7331 2017-02-17 09:45:41 +08:00
Xiang Li b940e0d514 Merge pull request #7042 from petermattis/pmattis/resume-after-heartbeat-resp
raft: resume paused followers on receipt of MsgHeartbeatResp
2016-12-27 21:15:53 -08:00
Peter Mattis e625400f1d raft: resume paused followers on receipt of MsgHeartbeatResp
Previously, paused followers were resumed upon sending a MsgHearbeat.

Fixes #7037
2016-12-20 08:22:09 -05:00
Ben Darnell f60a5d6025 raft: Export Progress.IsPaused
CockroachDB would like to use this method for monitoring.
2016-12-04 13:14:08 +08:00
Vincent Lee 4401d88546 raft: add node should reset the pendingConf state
After add node conf proposed twice with the same node id, the pending state is not reset because
the addNode returned without setting the pending state at the second
time and the pending state will always be true unless other conf changed. During this we
can not add any new node because the propose will be ignored since the
pending state is true.
2016-11-17 15:50:13 +08:00
Ben Darnell 2f34547d39 raft: Check promotable() in MsgTimeoutNow handling
If MsgTimeoutNow arrived after a node was removed, the node could start
and win an election, then panic in becomeLeader (see
cockroachdb/cockroach#8535)
2016-11-07 20:02:21 +08:00
Gyu-Ho Lee cb5c92f69b raft: do not attach term to MsgReadIndex
Fix https://github.com/coreos/etcd/issues/6744.

MsgReadIndex, as MsgProp, is to be forwarded to leader.
So we should treat it as local message.
2016-10-28 22:12:25 -07:00