If the node is stopped, then Status can hang forever because there is no
event loop to answer. So, just return empty status to avoid deadlocks.
Fix#6855
Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
The create header revision is the current etcd revision. For watches with
rev=0, the next revision is hdr.rev+1. For watches with rev=n, the next
revision should be n.
Fixes TestDoubleBarrier timeouts.
The single watcher / group watcher distinction limited and
complicated watcher coalescing more than necessary. Reworked:
Each server watcher is represented by a WatchBroadcast, each
client "Watcher" attaches to some WatchBroadcast. WatchBroadcasts
hold all WatchBroadcast instances for a range. WatchRanges holds
all WatchBroadcasts for the proxy.
WatchProxyStreams represent a grpc watch stream between the proxy and
a client. When a client requests a new watcher through its grpc stream,
the ProxyStream will allocate a Watcher and WatchRanges assigns it to
some WatchBroadcast based on its range.
Coalescing is done by WatchBroadcasts when it receives an update
notification from a WatchBroadcast.
Supports leader failure detection so watches on a bad member
can migrate to other members. Coincidentally, Fixes#6303.
This changes the two immutable defaults into constants which allows
packages embedding etcd to import them as const! If they are variables,
then you'll fail with "const initializer foo is not a constant".
And remove all legacy packages in glide.yaml on sub-dependency.
They were added when we migrated from godep. glide will handle
it automatically with glide.lock file.
'glide vc --no-tests' flag removes 'testify/assert' deps
in v2 client. Until we deprecate v2 tests, just copy the
necessary files as workaround.
And remove '--skip-tests' flags in case we add dependencies
in test files.
if we want to add an endpoint with lease, we need this option.
for example:
resp, err := cli.Grant(context.TODO(), 5)
if err != nil {
log.Fatal(err)
}
err = r.Update(context.TODO(), serviceName, naming.Update{Op:naming.Add, Addr: exposedAddr}, clientv3.WithLease(resp.ID))
if err != nil {
log.Fatalf(err)
}
This exists to prevent sending too many requests that
would lead into applier falling behind Raft accepting-proposal.
Based on recent benchmarks, etcd was able to process high workloads
(2 million writes with 1K concurrent clients).
The limit 1000 is too conservative to test those high workloads.
functional tester sometime experiences timeout during compaction phase. I changed the timeout calculation base on number of entries created and deleted.
FIX#6805
If MsgTimeoutNow arrived after a node was removed, the node could start
and win an election, then panic in becomeLeader (see
cockroachdb/cockroach#8535)
If the txn comparison block makes claims about a key's current
state, then it may say a key has been updated. Future range/txn
operations may expect this update to eventually be propagated through
the cluster and show up in serialized requests. To avoid spinning
forever on txn/serialized range loops, invalidate the comparison keys.
getting lease and keys info through raw rpcs rarely experience error such as EOF. This is considered as a failure and causes tester to clean up.
however, they are just transient problem with temporary connection issue which should not be considered as a testing failure. so we add retry logic in case of transient failure.
FIX#6754
Address https://github.com/coreos/etcd/issues/6754.
In case there are network errors or unexpected EOF errors
in TimeToLive http requests to leader, we translate that into
ErrNoLeader, and expects the client to retry its request.
Couldn't find watcher group from rid on server stream close, leading to
the watcher group sending on a closed channel.
Also got rid of send closing the watcher stream if the buffer is full,
this could lead to a send after close while broadcasting to all receivers.
Instead, if a send times out then the server stream is canceled.
Fixes#6739
The checkers and stressers should be composable without special cases; this
patch tries to address that while refactoring out some old cruft.
Namely,
* Single stresser/checker for a tester; built from composition
* Composite stresser via comma-separated list of stressers
* Split stressers into separate files
* Removed v2 only flags and special cases
* Rate limiter shared among key stresser and leases stresser
* Composite checker is now concurrent
* Stresser can return a Checker to check its invariants
* Each lease checker only operates on a single lease stresser
These really belong in tester code; the stressers and
checkers are higher order operations that are orchestrated
by the tester. They're not really cluster primitives.
The current tester doesn't not clean up if any of the failure injection/recovery fails. if tester fails to recover a dead node, tester hangs in the next round because the tester will keep waiting until cluster becomes healthy which is impossible since a node is down. To fix this issue, we will always clean up if any error happens during each round so that cluster will be healthy for next round.
FIX#6743
lease stresser now generates short lived leases that will expire before invariant checking.
this addition verifies that the expired leases are indeed being deleted on the sever side.
increasing lease TTL ensure that lease doesn't expire during hashes stabilization period.
I observed that it can take a long time for etcd cluster to become stable.
This allows the client to spot if the cluster ID changes, which
would indicate that the cluster has been rebuilt and watches may be
out of sync.
Helps work around #6652.
The watch's etcd server is shutdown to keep the watch in a retry state as
keys are put and compacted on the cluster. When the server restarts,
there is a window where the compact hasn't been applied which may cause
the watch to receive all events instead of only a compaction error.
Fixes#6535
Was overcounting the number of expected closing messages; the resuming
list may have nil entries. Also the full client wasn't closing the watcher
client, only canceling its context, so client closes weren't joining with
the watcher shutdown.
Fixes#6605
Using printf will try to parse the string and replace special
characters. In migrate code, we want to just output the raw
json string of client.Node.
For example,
Printf("%\\") => %!\(MISSING)
Print("%\\") => %\
Thus, we should use print instead.
Move all vote handling from the per-state step functions to the
top-level Step(). This wasn't necessary before because MsgVote would
cause us to become a follower, but MsgPreVote needs to be handled
without changing the node's current state.
Some tests were starting nodes with a non-empty log but a term of zero,
which cannot happen in the real world. This was affecting the final term
being tested in TestLeaderElection.
I move the checker logic from tester to cluster so that stressers and checkers can be initialized at the same time.
this is useful because some checker depends on stressers.
The simpleBalancer.Get() blocks grpc.Invoke() even when the Invoke() is called
with the FailFast option. Therefore currently any requests with the
FastFail option actually doesn't fail fast. They get blocked when there is
no endpoints available.
Get() method needs to respect the BlockingWait option when
picks up an endpoint address from the list and fail immediately when the option is
enabled and no endpoint is available.
The cost of bcrypt password checking is quite high (almost 100ms on a
modern machine) so executing it in apply loop will be
problematic. This commit exclude the checking mechanism to the API
layer. The password checking is validated with the OCC like way
similar to the auth of serializable get.
This commit also removes a unit test of Authenticate RPC from
auth/store_test.go. It is because the RPC now accepts an auth request
unconditionally and delegates the checking functionality to
authStore.CheckPassword() (so a unit test for CheckPassword() is
added). The combination of the two functionalities can be tested by
e2e (e.g. TestCtlV3AuthWriteKey).
Fixes https://github.com/coreos/etcd/issues/6530
This commit adds a new option --check-key to endpoint health command
for health checking with a custom key. It is mainly for avoiding
permission problem.
This commit adds a new option -failure-wrapper to etcd-tester. The
option receives a path of script that is used for enabling/disabling
external fault injectors. The script is called with an option "enable"
when it needs to be enabled (when failure.Inject() is called) and
called with "disabled" in an opposite case (when failure.Recover() is
called).
When running test suites for a client locally I'm getting spammed by log
lines such as:
etcdserver: apply entries took too long [14.226771ms for 1 entries]
The comments in #6278 mention there were future plans of increasing the
threshold for logging these warnings, but it hadn't been done yet.
Previous implementation watches a single key so there's no way
to have separate hosts associate with separate keys for a single
grpc target. Instead, accept all keys on a prefix.
Also fixes first the Next() to read current name data from etcd instead
of waiting for the next event on a synced watcher.
Try:
./etcdctl put foo bar
./etcdctl del foo
./etcdctl compact 3
restart etcd
./etcdctl get foo
mvcc: required revision has been compacted
The error is unexpected when range over the head revision.
Internally, we incorrectly set current revision smaller than the
compacted revision when we remove all keys around compacted revision.
This commit fixes the issue by recovering the current revision at least
to compacted revision.
Updates TestKVPutError. Change the quota to work with systems
that have a 64 KiB page size. Increase the db sync wait time to
one second. Also, add some comments for the hard coded value.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Use the system page size to set the test quota size. Also, change
a comment related to setting the node quota to be more clear.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Rework the over quota test to be more a realistic test. Take into
consideration that the system page size will be different across
platforms.
Signed-off-by: Geoff Levand <geoff@infradead.org>
We just need a small chunk of data to test put, so to be
consistent across platforms use a fixed size of 64 bytes.
Signed-off-by: Geoff Levand <geoff@infradead.org>
On slower or heavily loaded platforms running the integration pass in
parallel results in test timeout errors.
Rename the integration_pass function to integration_e2e_pass, and add two
new functions integration_pass and e2e_pass.
Signed-off-by: Geoff Levand <geoff@infradead.org>
If we promote the lessor before finish applying all
entries from the last term, we might incorrectly renew
the already revoked leases.
Here is an example:
- Term 1: revoke lease A accepted by raft
- Old leader failed, new election happened
- Term 2: promote
- Term 2: keep alive A succeed. A now has 10 seconds TTL
- Term 2: revoke lease A from Term 1 got committed and applied
- Term 2: the lease A with 10 seconds TTL is revoked
To solve this, the new leader MUST apply all entries from old term
before promote its lessor to start accept renew requests.
The initial revision was being updated in the substream goroutine defer;
this was racing with the resume path fetching the initial revision when
the substream closes during resume. Instead, update the initial revision
whenever the substream processes a new watch response. Since the substream
cannot receive a watch response while it is resuming, the write to the
initial revision is ordered to always happen after the resume read.
Fixes#6586
Some fixes related to release_pass:
o Create the output directory ./bin if it does not exist.
o Define the GOARCH variable if it is not defined.
o Simplify the race detection test.
o Download the relese archive based on GOARCH.
o If the release file is not found, return success. This will allow the tests
to continue.
Signed-off-by: Geoff Levand <geoff@infradead.org>
This commit adds a new option --failures to etcd-tester. The option
receives a comma-delimited argument like this:
"default,failpoints". The given arguments are interpreted as names of
failures and they are injected to an etcd cluster. Available failures
are default (default scenario in etcd-tester) and failpoints. If no
args are passed to the option (--failures=""), no failures are
injected during testing.
When the non Leader etcd server receives a LeaseTimeToLive on a nonexistent lease, it responds with a nil resp and a nil error The invoking function parses the nil resp and results a segmentation fault.
I fix the bug by making sure the lease not found error is returned so that the invoking function parses the the error message instead.
fix#6537
This commit decouples stresser from the tester of
functional-tester. For doing it, this commit adds a new option
--stresser to etcd-tester. The option accepts two types of stresser:
"default" and "nop". If the option is "default", etcd-tester stresses
its etcd cluster with the existing stresser. If the option is "nop",
etcd-tester does nothing for stressing.
Partially fixes https://github.com/coreos/etcd/issues/6446
When 1000 leases expired at the same time, etcd takes more than 5 seconds to clean them. This means that even after the leases have expired, keys associated with leases are still accessible. I increase the deletion throughput by parallelizing leases deletion process.
TestWatchReconnRequest occasionally triggers elections because it spins on
drop connections, eating up CPU. In case there's an election, submit an
l-read to wait for the cluster to settle down.
Fixes#6314
All outstanding goroutines now go into the etcdserver waitgroup. goroutines are
shutdown with a "stopping" channel which is closed when the run() goroutine
shutsdown. The done channel will only close once the waitgroup is totally cleared.
TickQuiesced allows the caller to support "quiesced" Raft groups which
do not perform periodic heartbeats and elections. This is useful in a
system with thousands of Raft groups where these periodic operations can
be overwhelming in an otherwise idle system.
It might seem possible to avoid advancing the logical clock at all in
such Raft groups, but doing so has an interaction with the CheckQuorum
functionality. If a follower is not quiesced while the leader is the
follower can call an election that will fail because the leader's lease
has not expired (electionElapsed < electionTimeout). The next time the
leader sends a heartbeat to this follower the follower will see that the
heartbeat is from a previous term and respond with a MsgAppResp. This in
turn will cause the leader to step down and become a follower even
though there isn't a leader in the group. By allowing the leader's
logical clock to advance via TickQuiesced, the leader won't reject the
election and there will be a smooth transfer of leadership to the
follower.
Added support into the v2 API to fix an issue (6433) where if there is a semicolon
and fields after it the API would return an "invalid Content-type" message even
if the content type was actually correct
This commit improves printing of role get command for prefix
permission. If a range permission corresponds to a prefix permission,
it is explicitly printed for a user. Below is an example of the new
printing:
$ ETCDCTL_API=3 bin/etcdctl --user root:p role get r1
Role r1
KV Read:
[/dir/, /dir0) (prefix /dir/)
[k1, k5)
KV Write:
[/dir/, /dir0) (prefix /dir/)
[k1, k5)
This commit adds a new option --prefix to "role grant-permission"
command. If the option is passed, the command interprets the key as a
prefix of range permission.
Example of usage:
$ ETCDCTL_API=3 bin/etcdctl --user root:p role grant-permission --prefix r1 readwrite /dir/
Role r1 updated
$ ETCDCTL_API=3 bin/etcdctl --user root:p role get r1
Role r1
KV Read:
[/dir/, /dir0)
[k1, k5)
KV Write:
[/dir/, /dir0)
[k1, k5)
$ ETCDCTL_API=3 bin/etcdctl --user u1:p put /dir/key val
OK
The timeout is too short. It might take more than 10ms to send
request over a blocking chan (buffer is full). Changing the timeout
to 1 second can fix this issue.
The following format "http://localhost:1234" causes existing port parser to fail. Add new logic to parse the host name first then extract port.
Fixes#6409
Test 15, counting from zero, in TestGetMergedPerms
in etcd/auth/range_perm_cache_test.go, was trying
incorrectly assert that [a, b) merged with [b, "")
should be [a, b). Added a test specifically for
this. This patch fixes the incorrect larger test
and the bugs in the code that it was hiding.
Fixes#6359
From (systemd NetworkTarget description)[https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/]:
```
[...]since the shutdown ordering of units in systemd is the reverse of the startup ordering, any unit that is order After=network.target can be sure that it is stopped before the network is shut down if the system is powered off. This allows services to cleanly terminate connections before going down, instead of abruptly losing connectivity for ongoing connections, leaving them in an undefined state.[...]
```
1. Under PUT example the put command was mentioned in capital which will
give the below error:
Error: unknown command "PUT" for "etcdctl"
Hence corrected the same.
2. The lease id is mentioned with 0x to denote hex but since its an
example, copy pasting the command will give the below error:
Error: bad lease ID (strconv.ParseInt: parsing "0x1234abcd": invalid
syntax), expecting ID in Hex
Hence modified the same to a sample correct value so that a user new to
etcd does not get confused.
3. The command ./etcdctl range foo does not work and gives the below
error:
Error: unknown command "range" for "etcdctl"
Hence corrected the same
#6372
Grow the inflights buffer as needed instead of preallocating it to its
max size. This avoids preallocating a lot of unnecessary
space (8*MaxInflightMsgs) when using lots of raft groups while still
allowing for a reasonable MaxInflightMsgs configuration.
Since c597d591b5 the release script uses
acbuild instead of actool, so purge all the references and have the
release script check for acbuild's presence instead.
rand.NewSource creates a 4872 byte object. With a small number of raft
groups in a process this isn't a problem. With 10k raft groups we'd use
46MB for these random sources. The only usage is in
raft.resetRandomizedElectionTimeout which isn't performance critical.
Fixes#6347.
We do not wait for the cancellation from actual etcd server,
but generate it at the proxy side. The rule is to return the
latest rev that the watcher has seen. This should be good
enough for most use cases if not all.
Do not restart the killed member immediately.
The member will advance its election timeout after restart
So it will have a better chance to become the leader again.
If a user upgrades etcd from 2.3.x to 3.0 and shutdown the
cluster immediately without triggering any new backend writes,
then the consistent index in backend would be zero.
The user cannot restart etcdserver due to today's strick index
match checking. We now have to lose this a bit for this case.
When using the embed functionality, you can't call the Server.Stop()
function until StartEtcd returns, which can block until there is a call
to Server.Stop() in error situations. Since we have a catch-22, the
ReadyNotify() can be called manually by the user if they wish to wait
for the server startup, or in parallel with a timeout if they wish to
cancel it after some time.
Chzz pointed out that this is also more consistent with the
etcdserver.Start() behaviour too.
purpleidea pointed out that this is actually more correct too, because
we can now register the stop interrupt handler before we block on
startup.
After winning an election or obtaining a lock, we
auto-append a slash after the provided key prefix.
This avoids the previous deadlock due to waiting
on the wrong key.
Fixes#6278
Previously, the checkQuorum flag required an election timeout to
expire before a node could cast its first vote. This change permits
the node to cast a vote at any time when the leader is not known,
including immediately after startup.
the get func was calling path's Join and clean method which is already
being in internalGet(nodePath) func. Hence the func was getting called
unnecessarily twice which is not needed.
#6295
Windows requires this lock to be released before the directory is
renamed. But on unix-like operating systems, releasing the lock and
trying to reacquire it immediately can be flaky if a process is forked
around the same time. The file descriptors are marked as close-on-exec
by the Go runtime, but there is a window between the fork and exec where
another process will be holding the lock.
Whenever the WAL is opened for writes, it should write zeroes to its tail
starting from the first zero record. Otherwise, if there are entries past
the first zero record due to a torn write, any new writes that overlap the
old entries will lead to a garbage record on the tail and cause a CRC
mismatch.
Was incorrectly trimming the trailing '.' from the target; this in turn
caused the etcd server to accept any SRV record with an IP target
instead of only targets with A records.
kv.commit updates the consistent index in backend. When
executing in parallel with apply, it might grab tx lock
after apply update the consistent index and before apply
starts to execute the opeartion. If the server dies right
after kv.commit, the consistent is updated but the opeartion
is not executed. If we restart etcd server, etcd will skip
the operation. :(
There are a few other places that we need to take care of,
but let us fix this first.
Was able to get 2s wait times with 500 concurrent requests on a fast machine;
a slower machine could possibly see similar delays with a single connection.
Fixes#6220
Builds already vendor through cmd/ so there's no reason to set the GOPATH; it
was also breaking gofail builds. For builds that need to override GOPATH, also
include the old GOPATH as a fallback for dependencies outside cmd/vendor/.
moved the code for perparing and sorting of advertising peer urls and
sorting of peer urls only when strict verification needs to be done.
This is done to avoid this processing when strict verification is not
required like in case of VerifyJoinExisting function.
#6165
If the test calls clock.Advance() after the compactor checks clock.Now()
but before the compactor calls clock.After(), the compactor will wait
forever on clock.After() expecting the lost clock.Advance().
Reproduced failure by putting a Sleep() in the clock.Now() continue path.
Fixes#6060 (again)
Add the bin-dir option to the command line, so the e2e tests can
run with an exist binary. For example(run the command under e2e
directory):
go test -v -timeout 10m -bin-dir /usr/bin -cpu 1,2,4
Signed-off-by: Yiqiao Pu <ypu@redhat.com>
In test situations, it's useful to create smaller than usual WAL files
to test rotation and to avoid the overhead of preallocation on old-style
filesystems that don't handle it efficiently. This commit changes
segmentSizeBytes to an exported variable so that tests can override it
from an init() function.
1. fix failure case counting
2. match ErrClientConnClosing in stresser
3. longer timeout for set-health-key
4. fixed range for range/delete stresser
5. remove Limit in RangeRequest
The previous logic is wrong. When we have hisotry like Put(foo, bar, lease1),
and Put(foo, bar, lease2), we will end up with attaching foo to two leases 1 and
2. Similar things can happen for deattach by clearing the lease of a key.
Now we try to fix this by starting to attach leases at the end of the recovery.
We use a map to keep the last lease attachment state.
Otherwise the http stream remains open and keeps receiving raft messages.
This can lead to "raft: stopped" log spam on closing an embedded server.
Fixes#5981
Current v2 auth API doesn't propagate its error code. This commit adds
utility functions for parsing error messages and getting detail of v2
auth errors.
Fixes https://github.com/coreos/etcd/issues/5894
If a server isn't serving txn requests from a client, the server
doesn't need the result of range requests in the txn.
This is a succeeding commit of
https://github.com/coreos/etcd/pull/5689
SdNotify() now returns 2 values, sent and err. So startEtcdOrProxyV2()
needs to check the 2 return values correctly. As the 2 values are
independent of each other, error checking needs to be slightly updated
too.
SdNotifyNoSocket, which was previously provided by go-systemd, does not
exist any more. In that case (false, nil) will be returned instead.
This commit introduces revision of authStore. The revision number
represents a version of authStore that is incremented by updating auth
related information.
The revision is required for avoiding TOCTOU problems. Currently there
are two types of the TOCTOU problems in v3 auth.
The first one is in ordinal linearizable requests with a sequence like
below ():
1. Request from client CA is processed in follower FA. FA looks up the
username (let it U) for the request from a token of the request. At
this time, the request is authorized correctly.
2. Another request from client CB is processed in follower FB. CB
is for changing U's password.
3. FB forwards the request from CB to the leader before FA. Now U's
password is updated and the request from CA should be rejected.
4. However, the request from CA is processed by the leader because
authentication is already done in FA.
For avoiding the above sequence, this commit lets
etcdserverpb.RequestHeader have a member revision. The member is
initialized during authentication by followers and checked in a
leader. If the revision in RequestHeader is lower than the leader's
authStore revision, it means a sequence like above happened. In such a
case, the state machine returns auth.ErrAuthRevisionObsolete. The
error code lets nodes retry their requests.
The second one, a case of serializable range and txn, is more
subtle. Because these requests are processed in follower directly. The
TOCTOU problem can be caused by a sequence like below:
1. Serializable request from client CA is processed in follower FA. At
first, FA looks up the username (let it U) and its permission
before actual access to KV.
2. Another request from client CB is processed in follower FB and
forwarded to the leader. The cluster including FA now commits a log
entry of the request from CB. Assume the request changed the
permission or password of U.
3. Now the serializable request from CA is accessing to KV. Even if
the access is allowed at the point of 1, now it can be invalid
because of the change introduced in 2.
For avoiding the above sequence, this commit lets the functions of
serializable requests (EtcdServer.Range() and EtcdServer.Txn())
compare the revision in the request header with the latest revision of
authStore after the actual access. If the saved revision is lower than
the latest one, it means the permission can be changed. Although it
would introduce false positives (e.g. changing other user's password),
it prevents the TOCTOU problem. This idea is an implementation of
Anthony's comment:
https://github.com/coreos/etcd/pull/5739#issuecomment-228128254
Otherwise GOARCH=386 PASSES="build integration" ./test fail on amd64
because the e2e tests can't find the binaries. Added a BINDIR option
for writing the build output to somewhere else, in case it's needed.
If etcdctl accesses the cluster before all members are published, it
will get an "unsupported protocol scheme" error. To fix, wait for both
the capabilities and published message.
Fixes#5824
Follower has already set its leader ID from
previous append messages from the leader, but
to be consistent, this adds a line to set its
leader id from leader snapshot message.
Most fields accessed with sync/atomic functions are 64bit aligned, but a couple
are not. This makes comments out of date and therefore misleading.
Affected fields reordered, comments scrubbed and updated.
The Entry struct has misaligned fields that are accessed atomically. The
misalignment is caused by the EntryType enum which the Protocol Buffers
spec forces to be a 32bit int.
Moving the order of the fields without renumbering them in the .proto file
seems to align the go structure without changing the wire format.
The relevant structures are properly aligned, however, there is no comment
highlighting the need to keep it aligned as is present elsewhere in the
codebase.
Adding note to keep alignment, in line with similar comments in the codebase.
This commit adds --user for auth in benchmarks. Its purpose is
measuring overhead of authentication of v3 API. Of course the given
user must be granted permission of target keys before benchmarking.
Example of a case with no authentication:
% ./benchmark range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m10s
Summary:
Total: 130.1850 secs.
Slowest: 0.4071 secs.
Fastest: 0.0064 secs.
Average: 0.0130 secs.
Stddev: 0.0079 secs.
Requests/sec: 76.8138
Response time histogram:
0.006 [1] |
0.046 [9990] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.087 [3] |
0.127 [0] |
0.167 [3] |
0.207 [2] |
0.247 [0] |
0.287 [0] |
0.327 [0] |
0.367 [0] |
0.407 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0086 secs.
50% in 0.0113 secs.
75% in 0.0146 secs.
90% in 0.0209 secs.
95% in 0.0272 secs.
99% in 0.0344 secs.
Example of a case with authentication:
% ./benchmark --user=u1:p range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m11s
Summary:
Total: 131.4923 secs.
Slowest: 0.1637 secs.
Fastest: 0.0065 secs.
Average: 0.0131 secs.
Stddev: 0.0070 secs.
Requests/sec: 76.0501
Response time histogram:
0.006 [1] |
0.022 [9075] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.038 [875] |∎∎∎
0.054 [36] |
0.069 [5] |
0.085 [1] |
0.101 [1] |
0.117 [0] |
0.132 [0] |
0.148 [5] |
0.164 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0087 secs.
50% in 0.0114 secs.
75% in 0.0150 secs.
90% in 0.0215 secs.
95% in 0.0272 secs.
99% in 0.0347 secs.
It seems that current auth mechanism does not introduce visible overhead.
This commit lets etcdserver skip needless log entry applying. If the
result of log applying isn't required by the node (client that issued
the request isn't talking with the node) and the operation has no side
effects, applying can be skipped.
It would contribute to reduce disk I/O on followers and be useful for
a cluster that processes much serializable get.
Old behavior would retry set and delete even if there's an error. This
can lead to the client returning an error for deleting twice, instead
of returning an error for an interdeterminate state.
Fixes#5832
Before
2016-07-01 14:57:50.927170 I | api: enabled capabilities for version 3.0.0
After
2016-07-01 14:57:50.927170 I | api: enabled capabilities for version 3.0
Now user can filter events with types. The API is also extensible.
It might make sense for the proxy to filter out events based on
more expensive/customized filter.
Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
Bootstrap another machine and use the [hey HTTP benchmark tool][hey] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
## Performance
@@ -48,5 +48,5 @@ Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send r
Bootstrap another machine, outside of the etcd cluster, and run the [`boom` HTTP benchmark tool](https://github.com/rakyll/boom) with a connection reuse patch to send requests to each etcd cluster member. See the [benchmark instructions](../../hack/benchmark/) for the patch and the steps to reproduce our procedures.
Bootstrap another machine, outside of the etcd cluster, and run the [`hey` HTTP benchmark tool](https://github.com/rakyll/hey) with a connection reuse patch to send requests to each etcd cluster member. See the [benchmark instructions](../../hack/benchmark/) for the patch and the steps to reproduce our procedures.
The performance is calulated through results of 100 benchmark rounds.
@@ -66,4 +66,4 @@ The performance is calulated through results of 100 benchmark rounds.
- Write QPS to cluster leaders seems to be increased by a small margin. This is because the main loop and entry apply loops were decoupled in the etcd raft logic, eliminating several blocks between them.
- Write QPS to all members seems to be increased by a significant margin, because followers now receive the latest commit index sooner, and commit proposals more quickly.
- Write QPS to all members seems to be increased by a significant margin, because followers now receive the latest commit index sooner, and commit proposals more quickly.
@@ -24,7 +24,7 @@ Also, we use 3 etcd 2.1.0 alpha-stage members to form cluster to get base perfor
## Testing
Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
Bootstrap another machine and use the [hey HTTP benchmark tool][hey] to send requests to each etcd member. Check the [benchmark hacking guide][hack-benchmark] for detailed instructions.
## Performance
@@ -66,7 +66,7 @@ Bootstrap another machine and use the [boom HTTP benchmark tool][boom] to send r
- write QPS to all servers is increased by 30~80% because follower could receive latest commit index earlier and commit proposals faster.
| LeaseGrant | LeaseGrantRequest | LeaseGrantResponse | LeaseGrant creates a lease which expires if the server does not receive a keepAlive within a given time to live period. All keys attached to the lease will be expired and deleted if the lease expires. Each expired key generates a delete event in the event history. |
| LeaseRevoke | LeaseRevokeRequest | LeaseRevokeResponse | LeaseRevoke revokes a lease. All keys attached to the lease will expire and be deleted. |
| LeaseKeepAlive | LeaseKeepAliveRequest | LeaseKeepAliveResponse | LeaseKeepAlive keeps the lease alive by streaming keep alive requests from the client to the server and streaming keep alive responses from the server to the client. |
| key | key is the first key to delete in the range. | bytes |
| range_end | range_end is the key following the last key to delete for the range [key, range_end). If range_end is not given, the range is defined to contain only the key argument. If range_end is '\0', the range is all keys greater than or equal to the key argument. | bytes |
| range_end | range_end is the key following the last key to delete for the range [key, range_end). If range_end is not given, the range is defined to contain only the key argument. If range_end is one bit larger than the given key, then the range is all the all keys with the prefix (the given key). If range_end is '\0', the range is all keys greater than or equal to the key argument. | bytes |
| prev_kv | If prev_kv is set, etcd gets the previous key-value pairs before deleting it. The previous key-value pairs will be returned in the delte response. | bool |
| serializable | serializable sets the range request to use serializable member-local reads. Range requests are linearizable by default; linearizable requests have higher latency and lower throughput than serializable requests but reflect the current consensus of the cluster. For better performance, in exchange for possible stale reads, a serializable range request is served locally without needing to reach consensus with other nodes in the cluster. | bool |
| keys_only | keys_only when set returns only the keys and not the values. | bool |
| count_only | count_only when set returns only the count of the keys in the range. | bool |
| min_mod_revision | min_mod_revision is the lower bound for returned key mod revisions; all keys with lesser mod revisions will be filtered away. | int64 |
| max_mod_revision | max_mod_revision is the upper bound for returned key mod revisions; all keys with greater mod revisions will be filtered away. | int64 |
| min_create_revision | min_create_revision is the lower bound for returned key create revisions; all keys with lesser create trevisions will be filtered away. | int64 |
| max_create_revision | max_create_revision is the upper bound for returned key create revisions; all keys with greater create revisions will be filtered away. | int64 |
@@ -736,9 +762,10 @@ From google paxosdb paper: Our implementation hinges around a powerful primitive
| Field | Description | Type |
| ----- | ----------- | ---- |
| key | key is the key to register for watching. | bytes |
| range_end | range_end is the end of the range [key, range_end) to watch. If range_end is not given, only the key argument is watched. If range_end is equal to '\0', all keys greater than or equal to the key argument are watched. | bytes |
| range_end | range_end is the end of the range [key, range_end) to watch. If range_end is not given, only the key argument is watched. If range_end is equal to '\0', all keys greater than or equal to the key argument are watched. If the range_end is one bit larger than the given key, then all keys with the prefix (the given key) will be watched. | bytes |
| start_revision | start_revision is an optional revision to watch from (inclusive). No start_revision is "now". | int64 |
| progress_notify | progress_notify is set so that the etcd server will periodically send a WatchResponse with no events to the new watcher if there are no recent events. It is useful when clients wish to recover a disconnected watcher starting from a recent known revision. The etcd server may decide how often it will send notifications based on current load. | bool |
| filters | filter out put event. filter out delete event. filters filter the events at server side before it sends back to the watcher. | (slice of) FilterType |
| prev_kv | If prev_kv is set, created watcher gets the previous KV before the event happens. If the previous KV is already compacted, nothing will be returned. | bool |
@@ -798,6 +825,22 @@ From google paxosdb paper: Our implementation hinges around a powerful primitive
"summary":"Put puts the given key into the key-value store.\nA put request increments the revision of the key-value store\nand generates one event in the event history.",
@@ -951,7 +978,8 @@
"enum":[
"EQUAL",
"GREATER",
"LESS"
"LESS",
"NOT_EQUAL"
],
"default":"EQUAL"
},
@@ -993,6 +1021,15 @@
],
"default":"KEY"
},
"WatchCreateRequestFilterType":{
"type":"string",
"enum":[
"NOPUT",
"NODELETE"
],
"default":"NOPUT",
"description":"- NOPUT: filter out put event.\n - NODELETE: filter out delete event."
},
"authpbPermission":{
"type":"object",
"properties":{
@@ -1482,7 +1519,7 @@
"range_end":{
"type":"string",
"format":"byte",
"description":"range_end is the key following the last key to delete for the range [key, range_end).\nIf range_end is not given, the range is defined to contain only the key argument.\nIf range_end is '\\0', the range is all keys greater than or equal to the key argument."
"description":"range_end is the key following the last key to delete for the range [key, range_end).\nIf range_end is not given, the range is defined to contain only the key argument.\nIf range_end is one bit larger than the given key, then the range is all\nthe all keys with the prefix (the given key).\nIf range_end is '\\0', the range is all keys greater than or equal to the key argument."
}
}
},
@@ -1605,6 +1642,52 @@
}
}
},
"etcdserverpbLeaseTimeToLiveRequest":{
"type":"object",
"properties":{
"ID":{
"type":"string",
"format":"int64",
"description":"ID is the lease ID for the lease."
},
"keys":{
"type":"boolean",
"format":"boolean",
"description":"keys is true to query all the keys attached to this lease."
}
}
},
"etcdserverpbLeaseTimeToLiveResponse":{
"type":"object",
"properties":{
"ID":{
"type":"string",
"format":"int64",
"description":"ID is the lease ID from the keep alive request."
},
"TTL":{
"type":"string",
"format":"int64",
"description":"TTL is the remaining TTL in seconds for the lease; the lease will expire in under TTL+1 seconds."
},
"grantedTTL":{
"type":"string",
"format":"int64",
"description":"GrantedTTL is the initial granted time in seconds upon lease creation/renewal."
},
"header":{
"$ref":"#/definitions/etcdserverpbResponseHeader"
},
"keys":{
"type":"array",
"items":{
"type":"string",
"format":"byte"
},
"description":"Keys is the list of keys attached to this lease."
}
}
},
"etcdserverpbMember":{
"type":"object",
"properties":{
@@ -1783,6 +1866,26 @@
"format":"int64",
"description":"limit is a limit on the number of keys returned for the request."
},
"max_create_revision":{
"type":"string",
"format":"int64",
"description":"max_create_revision is the upper bound for returned key create revisions; all keys with\ngreater create revisions will be filtered away."
},
"max_mod_revision":{
"type":"string",
"format":"int64",
"description":"max_mod_revision is the upper bound for returned key mod revisions; all keys with\ngreater mod revisions will be filtered away."
},
"min_create_revision":{
"type":"string",
"format":"int64",
"description":"min_create_revision is the lower bound for returned key create revisions; all keys with\nlesser create trevisions will be filtered away."
},
"min_mod_revision":{
"type":"string",
"format":"int64",
"description":"min_mod_revision is the lower bound for returned key mod revisions; all keys with\nlesser mod revisions will be filtered away."
"description":"filters filter the events at server side before it sends back to the watcher."
},
"key":{
"type":"string",
"format":"byte",
@@ -2022,7 +2132,7 @@
"range_end":{
"type":"string",
"format":"byte",
"description":"range_end is the end of the range [key, range_end) to watch. If range_end is not given,\nonly the key argument is watched. If range_end is equal to '\\0', all keys greater than\nor equal to the key argument are watched."
"description":"range_end is the end of the range [key, range_end) to watch. If range_end is not given,\nonly the key argument is watched. If range_end is equal to '\\0', all keys greater than\nor equal to the key argument are watched.\nIf the range_end is one bit larger than the given key,\nthen all keys with the prefix (the given key) will be watched."
etcd provides a gRPC resolver to support an alternative name system that fetches endpoints from etcd for discovering gRPC services. The underlying mechanism is based on watching updates to keys prefixed with the service name.
## Using etcd discovery with go-grpc
The etcd client provides a gRPC resolver for resolving gRPC endpoints with an etcd backend. The resolver is initialized with an etcd client and given a target for resolution:
The etcd resolver treats all keys under the prefix of the resolution target following a "/" (e.g., "my-service/") with JSON-encoded go-grpc `naming.Update` values as potential service endpoints. Endpoints are added to the service by creating new keys and removed from the service by deleting keys.
### Adding an endpoint
New endpoints can be added to the service through `etcdctl`:
```sh
ETCDCTL_API=3 etcdctl put my-service/1.2.3.4 '{"Addr":"1.2.3.4","Metadata":"..."}'
```
The etcd client's `GRPCResolver.Update` method can also register new endpoints with a key matching the `Addr`:
Registering an endpoint with a lease ensures that if the host can't maintain a keepalive heartbeat (e.g., its machine fails), it will be removed from the service:
```sh
lease=`ETCDCTL_API=3 etcdctl lease grant 5| cut -f2 -d' '`
ETCDCTL_API=3 etcdctl put --lease=$lease my-service/1.2.3.4 '{"Addr":"1.2.3.4","Metadata":"..."}'
@@ -4,28 +4,51 @@ Users mostly interact with etcd by putting or getting the value of a key. This s
By default, etcdctl talks to the etcd server with the v2 API for backward compatibility. For etcdctl to speak to etcd using the v3 API, the API version must be set to version 3 via the `ETCDCTL_API` environment variable.
``` bash
```bash
exportETCDCTL_API=3
```
## Find versions
etcdctl version and Server API version can be useful in finding the appropriate commands to be used for performing various opertions on etcd.
Here is the command to find the versions:
```bash
$ etcdctl version
etcdctl version: 3.1.0-alpha.0+git
API version: 3.1
```
## Write a key
Applications store keys into the etcd cluster by writing to keys. Every stored key is replicated to all etcd cluster members through the Raft protocol to achieve consistency and reliability.
Here is the command to set the value of key `foo` to `bar`:
``` bash
```bash
$ etcdctl put foo bar
OK
```
Also a key can be set for a specified interval of time by attaching lease to it.
Here is the command to set the value of key `foo1` to `bar1` for 10s.
```bash
$ etcdctl put foo1 bar1 --lease=1234abcd
OK
```
Note: The lease id `1234abcd` in the above command refers to id returned on creating the lease of 10s. This id can then be attached to the key.
## Read keys
Applications can read values of keys from an etcd cluster. Queries may read a single key, or a range of keys.
Applications can read values of keys from an etcd cluster. Queries may read a single key, or a range of keys.
Suppose the etcd cluster has stored the following keys:
```
```bash
foo= bar
foo1= bar1
foo3= bar3
@@ -39,6 +62,21 @@ foo
bar
```
Here is the command to read the value of key `foo` in hex format:
```bash
$ etcdctl get foo --hex
\x66\x6f\x6f # Key
\x62\x61\x72# Value
```
Here is the command to read only the value of key `foo`:
```bash
$ etcdctl get foo --print-value-only
bar
```
Here is the command to range over the keys from `foo` to `foo9`:
```bash
@@ -51,6 +89,16 @@ foo3
bar3
```
Here is the command to range over the keys from `foo` to `foo9` limiting the number of results to 2:
```bash
$ etcdctl get foo foo9 --limit 2
foo
bar
foo1
bar1
```
## Read past version of keys
Applications may want to read superseded versions of a key. For example, an application may wish to roll back to an old configuration by accessing an earlier version of a key. Alternatively, an application may want a consistent view over multiple keys through multiple requests by accessing key history.
@@ -58,11 +106,11 @@ Since every modification to the etcd cluster key-value store increments the glob
Suppose an etcd cluster already has the following keys:
``` bash
$ etcdctl put foo bar # revision = 2
$ etcdctl put foo1 bar1 # revision = 3
$ etcdctl put foo bar_new # revision = 4
$ etcdctl put foo1 bar1_new # revision = 5
```bash
foo= bar # revision = 2
foo1= bar1 # revision = 3
foo= bar_new # revision = 4
foo1= bar1_new # revision = 5
```
Here are an example to access the past versions of keys:
@@ -93,10 +141,46 @@ bar
$ etcdctl get --rev=1 foo foo9 # access the versions of keys at revision 1
```
## Read keys which are greater than or equal to the byte value of the specified key
Applications may want to read keys which are greater than or equal to the byte value of the specified key.
Suppose an etcd cluster already has the following keys:
```bash
a=123
b=456
z=789
```
Here is the command to read keys which are greater than or equal to the byte value of key `b` :
```bash
$ etcdctl get --from-key b
b
456
z
789
```
## Delete keys
Applications can delete a key or a range of keys from an etcd cluster.
Suppose an etcd cluster already has the following keys:
```bash
foo= bar
foo1= bar1
foo3= bar3
zoo= val
zoo1= val1
zoo2= val2
a=123
b=456
z=789
```
Here is the command to delete key `foo`:
```bash
@@ -111,6 +195,29 @@ $ etcdctl del foo foo9
2# two keys are deleted
```
Here is the command to delete key `zoo` with the deleted key value pair returned:
```bash
$ etcdctl del --prev-kv zoo
1# one key is deleted
zoo # deleted key
val # the value of the deleted key
```
Here is the command to delete keys having prefix as `zoo`:
```bash
$ etcdctl del --prefix zoo
2# two keys are deleted
```
Here is the command to delete keys which are greater than or equal to the byte value of key `b` :
```bash
$ etcdctl del --from-key b
2# two keys are deleted
```
## Watch key changes
Applications can watch on a key or a range of keys to monitor for any updates.
@@ -118,38 +225,86 @@ Applications can watch on a key or a range of keys to monitor for any updates.
Here is the command to watch on key `foo`:
```bash
$ etcdctl watch foo
$ etcdctl watch foo
# in another terminal: etcdctl put foo bar
PUT
foo
bar
```
Here is the command to watch on key `foo` in hex format:
```bash
$ etcdctl watch foo --hex
# in another terminal: etcdctl put foo bar
PUT
\x66\x6f\x6f # Key
\x62\x61\x72# Value
```
Here is the command to watch on a range key from `foo` to `foo9`:
```bash
$ etcdctl watch foo foo9
# in another terminal: etcdctl put foo bar
PUT
foo
bar
# in another terminal: etcdctl put foo1 bar1
PUT
foo1
bar1
```
Here is the command to watch on keys having prefix `foo`:
```bash
$ etcdctl watch --prefix foo
# in another terminal: etcdctl put foo bar
PUT
foo
bar
# in another terminal: etcdctl put fooz1 barz1
PUT
fooz1
barz1
```
Here is the command to watch on multiple keys `foo` and `zoo`:
```bash
$ etcdctl watch -i
$ watch foo
$ watch zoo
# in another terminal: etcdctl put foo bar
PUT
foo
bar
# in another terminal: etcdctl put zoo val
PUT
zoo
val
```
## Watch historical changes of keys
Applications may want to watch for historical changes of keys in etcd. For example, an application may wish to receive all the modifications of a key; if the application stays connected to etcd, then `watch` is good enough. However, if the application or etcd fails, a change may happen during the failure, and the application will not receive the update in real time. To guarantee the update is delivered, the application must be able to watch for historical changes to keys. To do this, an application can specify a historical revision on a watch, just like reading past version of keys.
Suppose we finished the following sequence of operations:
``` bash
etcdctl put foo bar # revision = 2
etcdctl put foo1 bar1 # revision = 3
etcdctl put foo bar_new # revision = 4
etcdctl put foo1 bar1_new # revision = 5
```bash
$ etcdctl put foo bar # revision = 2
OK
$ etcdctl put foo1 bar1 # revision = 3
OK
$ etcdctl put foo bar_new # revision = 4
OK
$ etcdctl put foo1 bar1_new # revision = 5
OK
```
Here is an example to watch the historical changes:
```bash
# watch for changes on key `foo` since revision 2
$ etcdctl watch --rev=2 foo
@@ -159,7 +314,9 @@ bar
PUT
foo
bar_new
```
```bash
# watch for changes on key `foo` since revision 3
$ etcdctl watch --rev=3 foo
PUT
@@ -167,6 +324,19 @@ foo
bar_new
```
Here is an example to watch only from the last historical change:
```bash
# watch for changes on key `foo` and return last revision value along with modified value
$ etcdctl watch --prev-kv foo
# in another terminal: etcdctl put foo bar_latest
PUT
foo # key
bar_new # last value of foo key before modification
foo # key
bar_latest # value of foo key after modification
```
## Compacted revisions
As we mentioned, etcd keeps revisions so that applications can read past versions of keys. However, to avoid accumulating an unbounded amount of history, it is important to compact past revisions. After compacting, etcd removes historical revisions, releasing resources for future use. All superseded data with revisions before the compacted revision will be unavailable.
@@ -182,13 +352,20 @@ $ etcdctl get --rev=4 foo
Error: rpc error: code=11desc= etcdserver: mvcc: required revision has been compacted
```
Note: The current revision of etcd server can be found using get command on any key (existent or non-existent) in json format. Example is shown below for mykey which does not exist in etcd server:
Applications can grant leases for keys from an etcd cluster. When a key is attached to a lease, its lifetime is bound to the lease's lifetime which in turn is governed by a time-to-live (TTL). Each lease has a minimum time-to-live (TTL) value specified by the application at grant time. The lease's actual TTL value is at least the minimum TTL and is chosen by the etcd cluster. Once a lease's TTL elapses, the lease expires and all attached keys are deleted.
Here is the command to grant a lease:
```
```bash
# grant a lease with 10 second TTL
$ etcdctl lease grant 10
lease 32695410dcc0ca06 granted with TTL(10s)
@@ -204,7 +381,7 @@ Applications revoke leases by lease ID. Revoking a lease deletes all of its atta
Suppose we finished the following sequence of operations:
```
```bash
$ etcdctl lease grant 10
lease 32695410dcc0ca06 granted with TTL(10s)
$ etcdctl put --lease=32695410dcc0ca06 foo bar
@@ -213,7 +390,7 @@ OK
Here is the command to revoke the same lease:
```
```bash
$ etcdctl lease revoke 32695410dcc0ca06
lease 32695410dcc0ca06 revoked
@@ -227,17 +404,54 @@ Applications can keep a lease alive by refreshing its TTL so it does not expire.
Suppose we finished the following sequence of operations:
```
```bash
$ etcdctl lease grant 10
lease 32695410dcc0ca06 granted with TTL(10s)
```
Here is the command to keep the same lease alive:
```
$ etcdctl lease keep-alive 32695410dcc0ca0
lease 32695410dcc0ca0 keepalived with TTL(100)
lease 32695410dcc0ca0 keepalived with TTL(100)
lease 32695410dcc0ca0 keepalived with TTL(100)
```bash
$ etcdctl lease keep-alive 32695410dcc0ca06
lease 32695410dcc0ca06 keepalived with TTL(100)
lease 32695410dcc0ca06 keepalived with TTL(100)
lease 32695410dcc0ca06 keepalived with TTL(100)
...
```
## Get lease information
Applications may want to know about lease information, so that they can be renewed or to check if the lease still exists or it has expired. Applications may also want to know the keys to which a particular lease is attached.
Suppose we finished the following sequence of operations:
```bash
# grant a lease with 500 second TTL
$ etcdctl lease grant 500
lease 694d5765fc71500b granted with TTL(500s)
# attach key zoo1 to lease 694d5765fc71500b
$ etcdctl put zoo1 val1 --lease=694d5765fc71500b
OK
# attach key zoo2 to lease 694d5765fc71500b
$ etcdctl put zoo2 val2 --lease=694d5765fc71500b
OK
```
Here is the command to get information about the lease:
```bash
$ etcdctl lease timetolive 694d5765fc71500b
lease 694d5765fc71500b granted with TTL(500s), remaining(258s)
```
Here is the command to get information about the lease along with the keys attached with the lease:
A Procfile is provided to easily set up a local multi-member cluster. Start a multi-member cluster with a few commands:
A `Procfile` at the base of this git repo is provided to easily set up a local multi-member cluster. To start a multi-member cluster go to the root of an etcd source tree and run:
```
# install goreman program to control Profile-based applications.
@@ -37,7 +37,7 @@ $ goreman -f Procfile start
...
```
The started members listen on `localhost:12379`, `localhost:22379`, and `localhost:32379` for client requests respectively.
The started members listen on `localhost:2379`, `localhost:22379`, and `localhost:32379` for client requests respectively.
To interact with the started cluster by using etcdctl:
@@ -49,12 +49,12 @@ $ etcdctl --write-out=table --endpoints=localhost:12379 member list
@@ -31,8 +31,8 @@ All releases version numbers follow the format of [semantic versioning 2.0.0](ht
## Write release note
- Write introduction for the new release. For example, what major bug we fix, what new features we introduce or what performance improvement we make.
- Write changelog for the last release. ChangeLog should be straightforward and easy to understand for the end-user.
- Put `[GH XXXX]` at the head of change line to reference Pull Request that introduces the change. Moreover, add a link on it to jump to the Pull Request.
- Find PRs with `release-note` label and explain them in `NEWS` file, as a straightforward summary of changes for end-users.
## Tag version
@@ -47,7 +47,7 @@ All releases version numbers follow the format of [semantic versioning 2.0.0](ht
## Build release binaries and images
- Ensure `actool` is available, or installing it through `go get github.com/appc/spec/actool`.
@@ -14,17 +14,21 @@ The easiest way to get started using etcd as a distributed key-value store is to
- [Interacting with etcd][interacting]
- [API references][api_ref]
- [gRPC gateway][api_grpc_gateway]
- [gRPC naming and discovery][grpc_naming]
- [Embedding etcd][embed_etcd]
- [Experimental features and APIs][experimental]
## Operating etcd clusters
Administrators who need to create reliable and scalable key-value stores for the developers they support should begin with a [cluster on multiple machines][clustering].
- [Setting up clusters][clustering]
- [Setting up etcd clusters][clustering]
- [Setting up etcd gateways][gateway]
- [Setting up etcd gRPC proxy (pre-alpha)][grpc_proxy]
This document defines the various terms used in etcd documentation, command line and source code.
## Node
## Alarm
Node is an instance of raft state machine.
The etcd server raises an alarm whenever the cluster needs operator intervention to remain reliable.
It has a unique identification, and records other nodes' progress internally when it is the leader.
## Authentication
## Member
Authentication manages user access permissions for etcd resources.
Member is an instance of etcd. It hosts a node, and provides service to clients.
## Client
A client connects to the etcd cluster to issue service requests such as fetching key-value pairs, writing data, or watching for updates.
## Cluster
@@ -18,6 +20,42 @@ Cluster consists of several members.
The node in each member follows raft consensus protocol to replicate logs. Cluster receives proposals from members, commits them and apply to local store.
## Compaction
Compaction discards all etcd event history and superseded keys prior to a given revision. It is used to reclaim storage space in the etcd backend database.
## Election
The etcd cluster holds elections among its members to choose a leader as part of the raft consensus protocol.
## Endpoint
A URL pointing to an etcd service or resource.
## Key
A user-defined identifier for storing and retrieving user-defined values in etcd.
## Key range
A set of keys containing either an individual key, a lexical interval for all x such that a <x<=b,orallkeysgreaterthanagivenkey.
| peer_sent_bytes_total | The total number of bytes sent to the peer with ID `To`. | Counter(To) |
| peer_received_bytes_total | The total number of bytes received from the peer with ID `From`. | Counter(From) |
| peer_sent_failures_total | The total number of send failures from the peer with ID `To`. | Counter(To) |
| peer_received_failures_total | The total number of receive failures from the peer with ID `From`. | Counter(From) |
| peer_round_trip_time_seconds | Round-Trip-Time histogram between peers. | Histogram(To) |
| client_grpc_sent_bytes_total | The total number of bytes sent to grpc clients. | Counter |
| client_grpc_received_bytes_total| The total number of bytes received to grpc clients. | Counter |
@@ -80,30 +82,7 @@ All these metrics are prefixed with `etcd_network_`
### gRPC requests
These metrics describe the requests served by a specific etcd member: total received requests, total failed requests, and processing latency. They are useful for tracking user-generated traffic hitting the etcd cluster.
If the cluster needs encrypted communication but does not require authenticated connections, etcd can be configured to automatically generate its keys. On initialization, each member creates its own set of keys based on its advertised IP addresses and hosts.
On each machine, etcd would be started with these flag:
On each machine, etcd would be started with these flags:
In a number of cases, the IPs of the cluster peers may not be known ahead of time. This is common when utilizing cloud providers or when the network uses DHCP. In these cases, rather than specifying a static configuration, use an existing etcd cluster to bootstrap a new one. We call this process "discovery".
In a number of cases, the IPs of the cluster peers may not be known ahead of time. This is common when utilizing cloud providers or when the network uses DHCP. In these cases, rather than specifying a static configuration, use an existing etcd cluster to bootstrap a new one. This process is called "discovery".
There two methods that can be used for discovery:
@@ -214,17 +214,17 @@ There two methods that can be used for discovery:
### etcd discovery
To better understand the design about discovery service protocol, we suggest reading the discovery service protocol [documentation][discovery-proto].
To better understand the design of the discovery service protocol, we suggest reading the discovery service protocol [documentation][discovery-proto].
#### Lifetime of a discovery URL
A discovery URL identifies a unique etcd cluster. Instead of reusing a discovery URL, always create discovery URLs for new clusters.
A discovery URL identifies a unique etcd cluster. Instead of reusing an existing discovery URL, each etcd instance shares a new discovery URL to bootstrap the new cluster.
Moreover, discovery URLs should ONLY be used for the initial bootstrapping of a cluster. To change cluster membership after the cluster is already running, see the [runtime reconfiguration][runtime-conf] guide.
#### Custom etcd discovery service
Discovery uses an existing cluster to bootstrap itself. If using a private etcd cluster, can create a URL like so:
Discovery uses an existing cluster to bootstrap itself. If using a private etcd cluster, create a URL like so:
```
$ curl -X PUT https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83/_config/size -d value=3
**Each member must have a different name flag specified. `Hostname` or `machine-id` can be a good choice. Or discovery will fail due to duplicated name.**
**Each member must have a different name flag specified or else discovery will fail due to duplicated names. `Hostname` or `machine-id` can be a good choice. **
Now we start etcd with those relevant flags for each member:
@@ -456,6 +456,10 @@ $ etcd --name infra2 \
--listen-peer-urls http://10.0.1.12:2380
```
### Gateway
etcd gateway is a simple TCP proxy that forwards network data to the etcd cluster. Please read [gateway guide] for more information.
### Proxy
When the `--proxy` flag is set, etcd runs in [proxy mode][proxy]. This proxy mode only supports the etcd v2 API; there are no plans to support the v3 API. Instead, for v3 API support, there will be a new proxy with enhanced features following the etcd 3.0 release.
@@ -472,3 +476,4 @@ To setup an etcd cluster with proxies of v2 API, please read the the [clustering
Production clusters which refer to peers by DNS name known to the local resolver must mount the [host's DNS configuration](https://coreos.com/kubernetes/docs/latest/kubelet-wrapper.html#customizing-rkt-options).
## Docker
In order to expose the etcd API to clients outside of Docker host, use the host IP address of the container. Please see [`docker inspect`](https://docs.docker.com/engine/reference/commandline/inspect) for more detail on how to get the IP address. Alternatively, specify `--net=host` flag to `docker run` command to skip placing the container inside of a separate network stack.
@@ -59,3 +121,7 @@ To run `etcdctl` using API version 3:
To provision a 3 node etcd cluster on bare-metal, you might find the examples in the [baremetal repo](https://github.com/coreos/coreos-baremetal/tree/master/examples) useful.
etcd gateway is a simple TCP proxy that forwards network data to the etcd cluster. The gateway is stateless and transparent; it neither inspects client requests nor interferes with cluster responses.
The gateway supports multiple etcd server endpoints. When the gateway starts, it randomly picks one etcd server endpoint and forwards all requests to that endpoint. This endpoint serves all requests until the gateway detects a network failure. If the gateway detects an endpoint failure, it will switch to a different endpoint, if available, to hide failures from its clients. Other retry policies, such as weighted round-robin, may be supported in the future.
## When to use etcd gateway
Every application that accesses etcd must first have the address of an etcd cluster client endpoint. If multiple applications on the same server access the same etcd cluster, every application still needs to know the advertised client endpoints of the etcd cluster. If the etcd cluster is reconfigured to have different endpoints, every application may also need to update its endpoint list. This wide-scale reconfiguration is both tedious and error prone.
etcd gateway solves this problem by serving as a stable local endpoint. A typical etcd gateway configuration has
each machine running a gateway listening on a local address and every etcd application connecting to its local gateway. The upshot is only the gateway needs to update its endpoints instead of updating each and every application.
In summary, to automatically propagate cluster endpoint changes, the etcd gateway runs on every machine serving multiple applications accessing same etcd cluster.
## When not to use etcd gateway
- Improving performance
The gateway is not designed for improving etcd cluster performance. It does not provide caching, watch coalescing or batching. The etcd team is developing a caching proxy designed for improving cluster scalability.
- Running on a cluster management system
Advanced cluster management systems like Kubernetes natively support service discovery. Applications can access an etcd cluster with a DNS name or a virtual IP address managed by the system. For example, kube-proxy is equivalent to etcd gateway.
## Start etcd gateway
Consider an etcd cluster with the following static endpoints:
|Name|Address|Hostname|
|------|---------|------------------|
|infra0|10.0.1.10|infra0.example.com|
|infra1|10.0.1.11|infra1.example.com|
|infra2|10.0.1.12|infra2.example.com|
Start the etcd gateway to use these static endpoints with the command:
*This is a pre-alpha feature, we are looking for early feedback.*
The gRPC proxy is a stateless etcd reverse proxy operating at the gRPC layer (L7). The proxy is designed to reduce the total processing load on the core etcd cluster. For horizontal scalability, it coalesces watch and lease API requests. To protect the cluster against abusive clients, it caches key range requests.
The gRPC proxy supports multiple etcd server endpoints. When the proxy starts, it randomly picks one etcd server endpoint to use. This endpoint serves all requests until the proxy detects an endpoint failure. If the gRPC proxy detects an endpoint failure, it switches to a different endpoint, if available, to hide failures from its clients. Other retry policies, such as weighted round-robin, may be supported in the future.
## Scalable watch API
The gRPC proxy coalesces multiple client watchers (`c-watchers`) on the same key or range into a single watcher (`s-watcher`) connected to an etcd server. The proxy broadcasts all events from the `s-watcher` to its `c-watchers`.
Assuming N clients watch the same key, one gRPC proxy can reduce the watch load on the etcd server from N to 1. Users can deploy multiple gRPC proxies to further distribute server load.
In the following example, three clients watch on key A. The gRPC proxy coalesces the three watchers, creating a single watcher attached to the etcd server.
```
+-------------+
| etcd server |
+------+------+
^ watch key A (s-watcher)
|
+-------+-----+
| gRPC proxy | <-------+
| | |
++-----+------+ |watch key A (c-watcher)
watch key A ^ ^ watch key A |
(c-watcher) | | (c-watcher) |
+-------+-+ ++--------+ +----+----+
| client | | client | | client |
| | | | | |
+---------+ +---------+ +---------+
```
### Limitations
To effectively coalesce multiple client watchers into a single watcher, the gRPC proxy coalesces new `c-watchers` into an existing `s-watcher` when possible. This coalesced `s-watcher` may be out of sync with the etcd server due to network delays or buffered undelivered events. When the watch revision is unspecified, the gRPC proxy will not guarantee the `c-watcher` will start watching from the most recent store revision. For example, if a client watches from an etcd server with revision 1000, that watcher will begin at revision 1000. If a client watches from the gRPC proxy, may begin watching from revision 990.
Similar limitations apply to cancellation. When the watcher is cancelled, the etcd server’s revision may be greater than the cancellation response revision.
These two limitations should not cause problems for most use cases. In the future, there may be additional options to force the watcher to bypass the gRPC proxy for more accurate revision responses.
## Scalable lease API
TODO
## Abusive clients protection
The gRPC proxy caches responses for requests when it does not break consistency requirements. This can protect the etcd server from abusive clients in tight for loops.
## Start etcd gRPC proxy
Consider an etcd cluster with the following static endpoints:
|Name|Address|Hostname|
|------|---------|------------------|
|infra0|10.0.1.10|infra0.example.com|
|infra1|10.0.1.11|infra1.example.com|
|infra2|10.0.1.12|infra2.example.com|
Start the etcd gRPC proxy to use these static endpoints with the command:
The space quota in `etcd` ensures the cluster operates in a reliable fashion. Without a space quota, `etcd` may suffer from poor performance if the keyspace grows excessively large, or it may simply run out of storage space, leading to unpredictable cluster behavior. If the keyspace's backend database for any member exceeds the space quota, `etcd` raises a cluster-wide alarm that puts the cluster into a maintenance mode which only accepts key reads and deletes. After freeing enough space in the keyspace, the alarm can be disarmed and the cluster will resume normal operation.
The space quota in `etcd` ensures the cluster operates in a reliable fashion. Without a space quota, `etcd` may suffer from poor performance if the keyspace grows excessively large, or it may simply run out of storage space, leading to unpredictable cluster behavior. If the keyspace's backend database for any member exceeds the space quota, `etcd` raises a cluster-wide alarm that puts the cluster into a maintenance mode which only accepts key reads and deletes. Only after freeing enough space in the keyspace and defragmenting the backend database, along with clearing the space quota alarm can the cluster resume normal operation.
By default, `etcd` sets a conservative space quota suitable for most applications, but it may be configured on the command line, in bytes:
```sh
# set a very small 16MB quota
$ etcd --quota-backend-bytes=16777216
$ etcd --quota-backend-bytes=$((16*1024*1024))
```
The space quota can be triggered with a loop:
```sh
# fill keyspace
$ while[1];do dd if=/dev/urandom bs=1024count=1024| etcdctl put key || break;done
$ while[1];do dd if=/dev/urandom bs=1024count=1024|ETCDCTL_API=3 etcdctl put key || break;done
...
Error: rpc error: code=8desc= etcdserver: mvcc: database space exceeded
# confirm quota space is exceeded
$ etcdctl --write-out=table endpoint status
$ETCDCTL_API=3 etcdctl --write-out=table endpoint status
@@ -169,7 +169,7 @@ As described in the above, the best practice of adding new members is to configu
For avoiding this problem, etcd provides an option `-strict-reconfig-check`. If this option is passed to etcd, etcd rejects reconfiguration requests if the number of started members will be less than a quorum of the reconfigured cluster.
It is recommended to enable this option. However, it is disabled by default because of keeping compatibility.
| amd64 | Darwin | Experimental | etcd maintainers |
| amd64 | Linux | Stable | etcd maintainers |
| amd64 | Windows | Experimental | |
| arm64 | Linux | Experimental | @glevand |
| arm | Linux | Unstable | |
| 386 | Linux | Unstable | |
* etcd-maintainers are listed in https://github.com/coreos/etcd/blob/master/MAINTAINERS.
Experimental platforms appear to work in practice and have some platform specific code in etcd, but do not fully conform to the stable support policy. Unstable platforms have been lightly tested, but less than experimental. Unlisted architecture and operating system pairs are currently unsupported; caveat emptor.
### Supporting a new platform
For etcd to officially support a new platform as stable, a few requirements are necessary to ensure acceptable quality:
1. An "official" maintainer for the platform with clear motivation; someone must be responsible for taking care of the platform.
2. Set up CI for build; etcd must compile.
3. Set up CI for running unit tests; etcd must pass simple tests.
4. Set up CI (TravisCI, SemaphoreCI or Jenkins) for running integration tests; etcd must pass intensive tests.
5. (Optional) Set up a functional testing cluster; an etcd cluster should survive stress testing.
### 32-bit and other unsupported systems
etcd has known issues on 32-bit systems due to a bug in the Go runtime. See #[358][358] for more information.
etcd has known issues on 32-bit systems due to a bug in the Go runtime. See the [Go issue][go-issue] and [atomic package][go-atomic] for more information.
To avoid inadvertently running a possibly unstable etcd server, `etcd` on unsupported architectures will print
a warning message and immediately exit if the environment variable `ETCD_UNSUPPORTED_ARCH` is not set to
the target architecture.
To avoid inadvertently running a possibly unstable etcd server, `etcd` on unstable or unsupported architectures will print a warning message and immediately exit if the environment variable `ETCD_UNSUPPORTED_ARCH` is not set to the target architecture.
Currently only the amd64 architecture is officially supported by `etcd`.
If the etcd leader serves a large number of concurrent client requests, it may delay processing follower peer requests due to network congestion. This manifests as send buffer error messages on the follower nodes:
```
dropped MsgProp to 247ae21ff9436b2d since streamMsg's sending buffer is full
dropped MsgAppResp to 247ae21ff9436b2d since streamMsg's sending buffer is full
```
These errors may be resolved by prioritizing etcd's peer traffic over its client traffic. On Linux, peer traffic can be prioritized by using the traffic control mechanism:
```
tc qdisc add dev eth0 root handle 1: prio bands 3
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip sport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 1 u32 match ip dport 2380 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip sport 2739 0xffff flowid 1:1
tc filter add dev eth0 parent 1: protocol ip prio 2 u32 match ip dport 2739 0xffff flowid 1:1
@@ -216,7 +216,7 @@ To recover from such scenarios, etcd provides functionality to backup and restor
#### Backing up the datastore
**NB:** Windows users must stop etcd before running the backup command.
**Note:** Windows users must stop etcd before running the backup command.
The first step of the recovery is to backup the data directory and wal directory, if stored separately, on a functioning etcd node. To do this, use the `etcdctl backup` command, passing in the original data (and wal) directory used by etcd. For example:
@@ -262,7 +262,9 @@ Once you have verified that etcd has started successfully, shut it down and move
Now that the node is running successfully, [change its advertised peer URLs][update-a-member], as the `--force-new-cluster` option has set the peer URL to the default listening on localhost.
You can then add more nodes to the cluster and restore resiliency. See the [add a new member][add-a-member] guide for more details.**NB:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
You can then add more nodes to the cluster and restore resiliency. See the [add a new member][add-a-member] guide for more details.
**Note:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
@@ -39,13 +39,14 @@ See [etcdctl][etcdctl] for a simple command line client.
The easiest way to get etcd is to use one of the pre-built release binaries which are available for OSX, Linux, Windows, AppC (ACI), and Docker. Instructions for using these binaries are on the [GitHub releases page][github-release].
For those wanting to try the very latest version, you can build the latest version of etcd from the `master` branch.
For those wanting to try the very latest version, you can [build the latest version of etcd][dl-build] from the `master` branch.
You will first need [*Go*](https://golang.org/) installed on your machine (version 1.6+ is required).
All development occurs on `master`, including new features and bug fixes.
Bug fixes are first targeted at `master` and subsequently ported to release branches, as described in the [branch management][branch-management] guide.
@@ -6,26 +6,19 @@ This document defines a high level roadmap for etcd development.
The dates below should not be considered authoritative, but rather indicative of the projected timeline of the project. The [milestones defined in GitHub](https://github.com/coreos/etcd/milestones) represent the most up-to-date and issue-for-issue plans.
etcd 2.3 is our current stable branch. The roadmap below outlines new features that will be added to etcd, and while subject to change, define what future stable will look like.
etcd 3.0 is our current stable branch. The roadmap below outlines new features that will be added to etcd, and while subject to change, define what future stable will look like.
### etcd 3.0 (April)
-v3 API ([see also the issue tag](https://github.com/coreos/etcd/issues?utf8=%E2%9C%93&q=label%3Aarea/v3api))
- Leases
- Binary protocol
- Support a large number of watchers
-Failure guarantees documented
- Simple v3 client (golang)
-v3 API
- Locking
- Better disk backend
- Improved write throughput
- Support larger datasets and histories
- Simpler disaster recovery UX
- Integrated with Kubernetes
- Mirroring
### etcd 3.1 (2016-Oct)
-Stable L4 gateway
- Experimental support for scalable proxy
- Automatic leadership transfer for the rolling upgrade
The etcd client optionally exposes RPC metrics through [go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus). See the [examples](https://github.com/coreos/etcd/blob/master/clientv3/example_metrics_test.go).
## Examples
More code examples can be found at [GoDoc](https://godoc.org/github.com/coreos/etcd/clientv3).
File diff suppressed because it is too large
Load Diff
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.