In this case, we know we are waiting for an action happened on
storage. We can do a busy wait instead of calling waitSchedule.
The test previously failed on CI with no observed actions.
When a watch stream closes, both of the watcher.Chan and closec
will be closed.
If watcher.Chan is closed, we should not send out the empty event.
Sending the empty is wrong and waste a lot of CPU resources.
Instead we should just return.
We should open real txn for applying txn requests. Or the intermediate
state might be observed by reader.
This also fixes#3803. Same consistent(raft) index per multiple indenpendent
operations confuses consistentStore.
We have a structure called InternalRaftRequest. Making the function
shorter by calling it processInternalRaftReq seems to be random and
reduce the readability. So we just use the full name.
This commit fixes an error log caused by the strict reconfig checking
option.
Before:
14:21:38 etcd2 | 2015-11-05 14:21:38.870356 E | etcdhttp: got unexpected response error (etcdserver: re-configuration failed due to not enough started members)
After:
log
13:27:33 etcd2 | 2015-11-05 13:27:33.089364 E | etcdhttp: etcdserver: re-configuration failed due to not enough started members
The error is not an unexpected thing therefore the old message is
incorrect.
This moves the code to create listener and roundTripper for raft communication
to the same place, and use explicit functions to build them. This prevents
possible development errors in the future.
This pairs with remote timeout listeners.
etcd uses timeout listener, and times out the accepted connections
if there is no activity. So the idle connections may time out easily.
Becaus timeout transport doesn't reuse connections, it prevents using
timeouted connection.
This fixes the problem that etcd fail to get version of peers.
This attempts to decouple password-related functions, which previously
existed both in the Store and User structs, by splitting them out into a
separate interface, PasswordStore. This means that they can be more
easily swapped out during testing.
This also changes the relevant tests to use mock password functions
instead of the bcrypt-backed implementations; as a result, the tests are
much faster.
Before:
```
github.com/coreos/etcd/etcdserver/auth 31.495s
github.com/coreos/etcd/etcdserver/etcdhttp 91.205s
```
After:
```
github.com/coreos/etcd/etcdserver/auth 1.207s
github.com/coreos/etcd/etcdserver/etcdhttp 1.207s
```
When a slow follower receives the snapshot sent from the leader, it
should rename the snapshot file to the default KV file path, and
restore KV snapshot.
Have tested it manually and it works pretty well.
When snapshot store requests raft snapshot from etcdserver apply loop,
it may block on the channel for some time, or wait some time for KV to
snapshot. This is unexpected because raft state machine should be unblocked.
Even worse, this block may lead to deadlock:
1. raft state machine waits on getting snapshot from raft memory storage
2. raft memory storage waits snapshot store to get snapshot
3. snapshot store requests raft snapshot from apply loop
4. apply loop is applying entries, and waits raftNode loop to finish
messages sending
5. raftNode loop waits peer loop in Transport to send out messages
6. peer loop in Transport waits for raft state machine to process message
Fix it by changing the logic of getSnap to be asynchronously creation.
Use snapshotSender to send v3 snapshot message. It puts raft snapshot
message and v3 snapshot into request body, then sends it to the target peer.
When it receives http.StatusNoContent, it knows the message has been
received and processed successfully.
As receiver, snapHandler saves v3 snapshot and then processes the raft snapshot
message, then respond with http.StatusNoContent.
rafthttp has different requirements for connections created by the
transport for different usage, and this is hard to achieve when giving
one http.RoundTripper. Pass into pkg the data needed to build transport
now, and let rafthttp build its own transports.
time.AfterFunc() creates its own goroutine and calls the callback
function in the goroutine. It can cause datarace like the problem
fixed in the commit de1a16e0f1 . This
commit also fixes the potential dataraces of tests in
etcdserver/server_test.go .
This fixes the problem that proposal cannot be applied.
When start the etcdserver.run loop, it expects to get the latest
existing snapshot. It should not attempt to request one because the loop
is the entity to create the snapshot.
When using Snapshot function, it is expected:
1. know the size of snapshot before writing data
2. split snapshot-ready phase and write-data phase. so we could cut
snapshot first and write data later.
Update its interface to fit the requirement of etcdserver.
Before this PR, it always prints nil because cluster info has not been
covered when print:
```
2015-10-02 14:00:24.353631 I | etcdserver: loaded cluster information
from store: <nil>
```
snapshotStore is the store of snapshot, and it supports to get latest snapshot
and save incoming snapshot.
raftStorage supports to get latest snapshot when v3demo is open.
Before this PR, the log is
```
2015/09/1 13:18:31 etcdmain: client: etcd cluster is unavailable or
misconfigured
```
It is quite hard for people to understand what happens.
Now we print out the exact reason for the failure, and explains the way
to handle it.
Like the commit 6974fc63ed, this commit lets etcdserver forbid
removing started member if quorum cannot be preserved after
reconfiguration if the option -strict-reconfig-check is passed to
etcd. The removal can cause deadlock if unstarted members have wrong
peer URLs.
After enabling v3 demo, it may change the underlying data organization
for v3 store. So we forbid to unset --experimental-v3demo once it has
been used.
I found some grammatical errors in comments.
This pull request was submitted https://github.com/coreos/etcd/pull/3513.
I am resubmitting following the correct guidlines.
Current membership changing functionality of etcd seems to have a
problem which can cause deadlock.
How to produce:
1. construct N node cluster
2. add N new nodes with etcdctl member add, without starting the new members
What happens:
After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests.
Assume the peer URLs of the added nodes are wrong because of miss
operation or bugs in wrapping program which launch etcd. In such a
case, both of adding and removing members are impossible because the
quorum isn't preserved. Of course ordinal requests cannot be
served. The cluster would seem to be deadlock.
Of course, the best practice of adding new nodes is adding one node
and let the node start one by one. However, the effect of this problem
is so serious. I think preventing the problem forcibly would be
valuable.
Solution:
This patch lets etcd forbid adding a new node if the operation changes
quorum and the number of changed quorum is larger than a number of
running nodes. If etcd is launched with a newly added option
-strict-reconfig-check, the checking logic is activated. If the option
isn't passed, default behavior of reconfig is kept.
Fixes https://github.com/coreos/etcd/issues/3477
Using Go-style import paths in protos is not idiomatic. Normally, this
detail would be internal to etcd, but the path from which gogoproto
is imported affects downstream consumers (e.g. cockroachdb).
In cockroach, we want to avoid including `$GOPATH/src` in our protoc
include path for various reasons. This patch puts etcd on the same
convention, which allows this for cockroach.
More information: https://github.com/cockroachdb/cockroach/pull/2339#discussion_r38663417
This commit also regenerates all the protos, which seem to have
drifted a tiny bit.
It sets 10s timeout for public GetClusterFromRemotePeers.
This helps the following cases to work well in high latency scenario:
1. proxy sync members from the cluster
2. newly-joined member sync members from the cluster
Besides 10s request timeout, the request is also controlled by dial
timeout and read connection timeout.
It specifies request timeout error possibly caused by connection lost,
and print out better log for user to understand.
It handles two cases:
1. the leader cannot connect to majority of cluster.
2. the connection between follower and leader is down for a while,
and it losts proposals.
log format:
```
20:04:19 etcd3 | 2015-08-25 20:04:19.368126 E | etcdhttp: etcdserver:
request timed out, possibly due to connection lost
20:04:19 etcd3 | 2015-08-25 20:04:19.368227 E | etcdhttp: etcdserver:
request timed out, possibly due to connection lost
```
The request can still time out because we have set dial timeout and
read/write timeout. It increases timeout expectation from 1s to 5s,
but it makes it workable in globally-deployer cluster.
Before this PR, the timeout caused by leader election returns:
```
14:45:37 etcd2 | 2015-08-12 14:45:37.786349 E | etcdhttp: got unexpected
response error (etcdserver: request timed out)
```
After this PR:
```
15:52:54 etcd1 | 2015-08-12 15:52:54.389523 E | etcdhttp: etcdserver:
request timed out, possibly due to leader down
```
It uses heartbeat interval and election timeout to estimate the
expected request timeout.
This PR helps etcd survive under high roundtrip-time environment,
e.g., globally-deployed cluster.
It uses heartbeat interval and election timeout to estimate the
commit timeout for internal requests.
This PR helps etcd survive under high roundtrip-time environment,
e.g., globally-deployed cluster.
Follow the simple rule in the atomic package:
"On both ARM and x86-32, it is the caller's responsibility to arrange
for 64-bit alignment of 64-bit words accessed atomically. The first word
in a global variable or in an allocated struct or slice can be relied
upon to be 64-bit aligned."
Tested on a system with /proc/cpuinfo reporting:
processor : 0
model name : ARMv7 Processor rev 1 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3
tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc0d
CPU revision : 1
The behavior accelarates the happen of the first-time leader election,
so the cluster could elect its leader fast. Technically, it could
help to reduce `electionMs - heartbeatMs` wait time for the first leader election.
Main usage:
1. Quick start for the local cluster when setting a little longer
election timeout
2. Quick start for the global cluster, which sets election timeout to
its maximum 50s.
PUT on the endpoint sets the GlobalDebugLevel to json level value.
The action overwrites the origianl log level setting from
users. We need to write doc to warn this.
Before this PR, it may error like this:
```
--- FAIL: TestUpdateMember-2 (0.00s)
server_test.go:950: action =
[{ApplyConfChange:ConfChangeUpdateNode []}
{ProposeConfChange:ConfChangeUpdateNode []}], want
[{ProposeConfChange:ConfChangeUpdateNode []}
{ApplyConfChange:ConfChangeUpdateNode []}]
```
This fixes the test by recording the proposal event in time.
There exist the possiblity to update attributes of removed member in
reasonable workflow:
1. start member A
2. leader receives the proposal to remove member A
2. member A sends the proposal of update its attribute to the leader
3. leader commits the two proposals
So etcdserver should allow to update attributes of removed member.
ForceGosched() performs bad when GOMAXPROCS>1. When GOMAXPROCS=1, it
could promise that other goroutines run long enough
because it always yield the processor to other goroutines. But it cannot
yield processor to goroutine running on other processors. So when
GOMAXPROCS>1, the yield may finish when goroutine on the other
processor just runs for little time.
Here is a test to confirm the case:
```
package main
import (
"fmt"
"runtime"
"testing"
)
func ForceGosched() {
// possibility enough to sched up to 10 go routines.
for i := 0; i < 10000; i++ {
runtime.Gosched()
}
}
var d int
func loop(c chan struct{}) {
for {
select {
case <-c:
for i := 0; i < 1000; i++ {
fmt.Sprintf("come to time %d", i)
}
d++
}
}
}
func TestLoop(t *testing.T) {
c := make(chan struct{}, 1)
go loop(c)
c <- struct{}{}
ForceGosched()
if d != 1 {
t.Fatal("d is not incremented")
}
}
```
`go test -v -race` runs well, but `GOMAXPROCS=2 go test -v -race` fails.
Change the functionality to waiting for schedule to happen.
add godep for speakeasy and auth entry parsing
add security_user to client
add role to client
add role commands
add auth support to etcdclient and etcdctl(member/user)
add enable/disable to etcdctl
better error messages, read/write/readwrite
Bump go-etcd to include codec changes, add new dependency
verify the error for revoke/add if nothing changed, remove security-merging prefix
When it waits for apply to be done, it should stop the loop if it
receives stop signal.
This helps to print out panic information. Before this PR, if the panic
happens when server loop is applying entries, server loop will wait for
raft loop to stop forever.
Fix#2841.
From Prometheus developer:
```
the recommended way for etcd as an open source project and under
consideration of its size would be etcd_<subsystem>_<name>.
```
We made the naming change accordingly.
This reverts commit f8ce5996b0.
etcd no longer resolves TCP addresses passed in through flags,
so there is no need to compare hostname and IP slices anymore.
(for more details: a3892221ee)
Conflicts:
etcdserver/cluster.go
etcdserver/config.go
pkg/netutil/netutil.go
pkg/netutil/netutil_test.go
etcdhttp will check the cluster version and update its
capability version periodically.
Any new handler's after 2.0 needs to wrap by capability handler
to ensure it is not accessable until rolling upgrade finished.
After this PR, only cluster's interface Cluster is exposed, which makes
code much cleaner. And it avoids external packages to rely on cluster
struct in the future.
The PR extracts types.Cluster from etcdserver.Cluster. types.Cluster
is used for flag parsing and etcdserver config.
There is no need to expose etcdserver.Cluster public, which contains
lots of etcdserver internal details and methods. This is the first step
for it.
1. Persist the cluster version change through raft. When the member is restarted, it can recover
the previous known decided cluster version.
2. When there is a new leader, it is forced to do a version checking immediately. This helps to
update the first cluster version fast.
We store cluster related key in StoreAdminPrefix for some
historical reason. The previous API is called admin. But now,
the admin name is gone and `cluster` is a more clear and correct
name.
Cluster version is the min major.minor of all members in
the etcd cluster. Cluster version is set to the min version
that a etcd member is compatible with when first bootstrapp.
During a rolling upgrades, the cluster version will be updated
automatically.
For example:
```
Cluster [a:1, b:1 ,c:1] -> clusterVersion 1
update a -> 2, b -> 2
after a detection
Cluster [a:2, b:2 ,c:1] -> clusterVersion 1, since c is still 1
update c -> 2
after a detection
Cluster [a:2, b:2 ,c:2] -> clusterVersion 2
```
The API/raft component can utilize clusterVersion to determine if
it can accept a client request or a raft RPC.
We choose polling rather than pushing since we want to use the same
logic for cluster version detection and (TODO) cluster version checking.
Before a member actually joins a etcd cluster, it should check the version
of the cluster. Push does not work since the other members cannot push
version info to it before it actually joins. Moreover, we do not want our
raft RPC system (which is doing the heartbeat pushing) to coordinate cluster version.
The original process is stopping etcd only when pipeline message finds itself
has been removed. After this PR, stream dial has this functionality too.
It helps fast etcd stop, which doesn't need to wait for stream break to
fall back to pipeline, and wait for election timeout to send out message
to detect self removal.
Add remotes to rafthttp, who help newly joined members catch up the
progress of the cluster. It supports basic message sending to remote, and
has no stream connection for simplicity. remotes will not be used
after the latest peers have been added into rafthttp.
Subcommits:
decouple root and security enable/disable
create root role
prefix matching
godep: bump go-etcd to include credentials
add godep for speakeasy and auth entry parsing
appropriate errors for security enable/disable
WIP adding to etcd/client all the security client methods
add guest access
minor ui return tweaks
revert client changes
respond to comments, log more security operations
fix major ensure() bug, add better UX
block recursive access
fix some boneheaded mistakes
fix integration test
last comments
fix up security_api.md
philips nits
fix docs
It is more reasonable to init the variable before passing it as an
argument.
It fixes a bug that etcdserver may panic on server stats when processing
a message from rafthttp streamReader before server stats is initialized
in server.Start().
It exposes the metrics of file descriptor limit and file descriptor used.
Moreover, it prints out warning when more than 80% of fd limit has been used.
```
2015/04/08 01:26:19 etcdserver: 80% of the file descriptor limit is open
[open = 969, limit = 1024]
```
Stop raftNode goroutine before stopping server goroutine, so
server.Stop does stop all underlying stuffs elegantly now. This fixes
the problem that previous-round lock on WAL may not be released when
etcd is restarted.
It is safe to repair the unexpectedEOF error in the last wal. raft
will not send out message before the entry successfully comitted
into wal. Thus we can safely truncate the last entry in the wal
to repair.