Move all vote handling from the per-state step functions to the
top-level Step(). This wasn't necessary before because MsgVote would
cause us to become a follower, but MsgPreVote needs to be handled
without changing the node's current state.
Some tests were starting nodes with a non-empty log but a term of zero,
which cannot happen in the real world. This was affecting the final term
being tested in TestLeaderElection.
TickQuiesced allows the caller to support "quiesced" Raft groups which
do not perform periodic heartbeats and elections. This is useful in a
system with thousands of Raft groups where these periodic operations can
be overwhelming in an otherwise idle system.
It might seem possible to avoid advancing the logical clock at all in
such Raft groups, but doing so has an interaction with the CheckQuorum
functionality. If a follower is not quiesced while the leader is the
follower can call an election that will fail because the leader's lease
has not expired (electionElapsed < electionTimeout). The next time the
leader sends a heartbeat to this follower the follower will see that the
heartbeat is from a previous term and respond with a MsgAppResp. This in
turn will cause the leader to step down and become a follower even
though there isn't a leader in the group. By allowing the leader's
logical clock to advance via TickQuiesced, the leader won't reject the
election and there will be a smooth transfer of leadership to the
follower.
Grow the inflights buffer as needed instead of preallocating it to its
max size. This avoids preallocating a lot of unnecessary
space (8*MaxInflightMsgs) when using lots of raft groups while still
allowing for a reasonable MaxInflightMsgs configuration.
rand.NewSource creates a 4872 byte object. With a small number of raft
groups in a process this isn't a problem. With 10k raft groups we'd use
46MB for these random sources. The only usage is in
raft.resetRandomizedElectionTimeout which isn't performance critical.
Fixes#6347.
Previously, the checkQuorum flag required an election timeout to
expire before a node could cast its first vote. This change permits
the node to cast a vote at any time when the leader is not known,
including immediately after startup.
Follower has already set its leader ID from
previous append messages from the leader, but
to be consistent, this adds a line to set its
leader id from leader snapshot message.
The Entry struct has misaligned fields that are accessed atomically. The
misalignment is caused by the EntryType enum which the Protocol Buffers
spec forces to be a 32bit int.
Moving the order of the fields without renumbering them in the .proto file
seems to align the go structure without changing the wire format.
The relevant structures are properly aligned, however, there is no comment
highlighting the need to keep it aligned as is present elsewhere in the
codebase.
Adding note to keep alignment, in line with similar comments in the codebase.
And 'maybeSnapshotAbort' does not 'unset'
the pendingSnapshot. 'resetState', which is called after this
metho, is the one that unsets pendingSnapshot. So this changes
the method name.
'MsgBeat' is an internal type to signal the leader, not the message type
that gets sent to its followers. 'MsgHeartbeat' is the type sent to followers.
We recently changed the randomized election timeout from (et, 2*et-1] tp
[et, 2*et-2], where et is user set election timeout.
So 2*et might trigger two elections instead of one. We need to fix the test
code accordingly.
Thanks for Tikv guys for finding this issue. We probably need to randomize
etcd/raft test more.
Those three log statements in node.go have not been using the logger that was passed via `raft.Config`, but instead the default raft logger. This changes it to use the proper logger.
removeNode reduces the required quorum size, so some pending entries may
be able to commit after it is applied.
Discovered in cockroachdb/cockroach#3642
This adds documentation on MessageType. Having clear explanation about
MessageType helps understand raft logic and debug etcd when there is a
message dropping. This is partially for coreos#3806.
We need to be able to force an election (on one node) after creating a
new group (cockroachdb/cockroach#1384), but it is difficult to ensure
that our call to Campaign does not race with an election that may be
started by raft itself. A redundant call to Campaign should be a no-op
instead of a panic. (But the panic in becomeCandidate remains, because
we don't want to update the term or change the committed index in this
case)
This fixes the failure met in semaphore CI:
```
--- FAIL: TestMultiNodeAdvance-2 (0.01s)
multinode_test.go:458: expect Ready after Advance, but there is
no Ready available
```
If I understand correctly, `progress` represents the states of follower. For
me, some comments weren't clear because it was missing the subjects of
`progress`. This adds more clarification on who is doing what. Please let me
know if I misunderstood anything. Thanks,
etcd is going to support incremental snapshot, and we design to let it
send at most one snapshot out at first stage. So when one snapshot is in
flight, snapshot request will return error.
When failing to get snapshot when sending MsgSnap, raft prints out
related log and abort sending this message.
Using Go-style import paths in protos is not idiomatic. Normally, this
detail would be internal to etcd, but the path from which gogoproto
is imported affects downstream consumers (e.g. cockroachdb).
In cockroach, we want to avoid including `$GOPATH/src` in our protoc
include path for various reasons. This patch puts etcd on the same
convention, which allows this for cockroach.
More information: https://github.com/cockroachdb/cockroach/pull/2339#discussion_r38663417
This commit also regenerates all the protos, which seem to have
drifted a tiny bit.
sendApp accesses the storage several times. Perviously, we
assume that the storage will not be modified during the read
opeartions. The assumption is not true since the storage can
be compacted between the read operations. If a compaction
causes a read entries error, we should not painc. Instead, we
can simply retry the sendApp logic until succeed.
ForceGosched() performs bad when GOMAXPROCS>1. When GOMAXPROCS=1, it
could promise that other goroutines run long enough
because it always yield the processor to other goroutines. But it cannot
yield processor to goroutine running on other processors. So when
GOMAXPROCS>1, the yield may finish when goroutine on the other
processor just runs for little time.
Here is a test to confirm the case:
```
package main
import (
"fmt"
"runtime"
"testing"
)
func ForceGosched() {
// possibility enough to sched up to 10 go routines.
for i := 0; i < 10000; i++ {
runtime.Gosched()
}
}
var d int
func loop(c chan struct{}) {
for {
select {
case <-c:
for i := 0; i < 1000; i++ {
fmt.Sprintf("come to time %d", i)
}
d++
}
}
}
func TestLoop(t *testing.T) {
c := make(chan struct{}, 1)
go loop(c)
c <- struct{}{}
ForceGosched()
if d != 1 {
t.Fatal("d is not incremented")
}
}
```
`go test -v -race` runs well, but `GOMAXPROCS=2 go test -v -race` fails.
Change the functionality to waiting for schedule to happen.
The commit > unstable might not true for follower. The leader only need
to ensure the entry is stored on the majority of nodes to commit an
entry. So the minority of the cluster might receive commit > unstable
append request. This is normal.
raft node should set initial prev hard state to empty.
Or it will not send the first hard coded state to application
until the state changes again.
This commit fixs the issue. It introduce a small overhead, that
the same tate might send to application twice when restarting.
But this is fine.
etcd now compact raft storage asynchronously, and append entry to raft
storage may happen at the same time. Add the lock to fix the bug that
the entries saved in storage may be organized in a wrong way.