status.FromError can return nil, false. We are handling the return values
most places in code but some places we aren't. Fixing it herewith.
Fixes#9117
Setting only latency options is a pain since every fault must
be disabled on the command line. Instead, by default start
as a standard bridge without any fault injection.
This commit adds a new option --txn-ops to `benchmark mvcc put`. A
number specified with this option will be used as a number of written
keys in a single transaction. It will be useful for checking the
effect of the batching.
Persistent data should be configured in agent side.
There is no need to specify the data-dir in tester side.
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
election runner can deadlock in atomic release().
suppose election runner has two clients A and B.
if A is a leader and B is a follower, B obtains lock
for release() and waits for A to close(nextc) which signal
next round is ready. However, A can only close(nextc) if it
obtains lock for release(); hence deadlock.
this pr removes atomicity of validate() and release() in global.go
and gives the responsibility of locking to each runner.
FIXES#7891
etcd tester runs etcd runner as a separate binary.
it signals sigstop to the runner when tester wants to stop stressing.
it signals sigcont to the runner when tester wants to start stressing.
when tester needs to clean up, it signals sigint to runner.
FIXES#7026
Current benchmark picks destinations of RPCs in a random
manner. However, it will result divergent benchmarking result because
RPCs other than serializable range must be forwarded to a leader node
when a follower node receives it. This commit adds a new flag
--target-leader for avoid the problem. If the flag is passed,
benchmark always picks an endpoint of a leader node.
Pure Snapshot isolation would permit read conflicts. Change the name
from Snapshot to SerializableSnapshot to reflect that it will also
reject read conflicts.
etcd panic-ed, so defrag response just blocked for "days"
when the actual 'v3rpc' path never returned.
We should catch this earlier.
ref. https://github.com/coreos/etcd/issues/7526
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Getting gosimple suggestion while running test script, so this PR is for fixing gosimple S1019 check.
raft/node_test.go:456:40: should use make([]raftpb.Entry, 1) instead (S1019)
raft/node_test.go:457:49: should use make([]raftpb.Entry, 1) instead (S1019)
raft/node_test.go:458:43: should use make([]raftpb.Message, 1) instead (S1019)
Refer https://github.com/dominikh/go-tools/blob/master/cmd/gosimple/README.md#checks for more information.
Current benchmark doesn't have an option for configuring dial timeout
of gRPC. This commit adds --dial-timeout for the purpose. It is useful
for stopping long sticking benchmarks.
functional tester sometime experiences timeout during compaction phase. I changed the timeout calculation base on number of entries created and deleted.
FIX#6805
getting lease and keys info through raw rpcs rarely experience error such as EOF. This is considered as a failure and causes tester to clean up.
however, they are just transient problem with temporary connection issue which should not be considered as a testing failure. so we add retry logic in case of transient failure.
FIX#6754
The checkers and stressers should be composable without special cases; this
patch tries to address that while refactoring out some old cruft.
Namely,
* Single stresser/checker for a tester; built from composition
* Composite stresser via comma-separated list of stressers
* Split stressers into separate files
* Removed v2 only flags and special cases
* Rate limiter shared among key stresser and leases stresser
* Composite checker is now concurrent
* Stresser can return a Checker to check its invariants
* Each lease checker only operates on a single lease stresser
These really belong in tester code; the stressers and
checkers are higher order operations that are orchestrated
by the tester. They're not really cluster primitives.
The current tester doesn't not clean up if any of the failure injection/recovery fails. if tester fails to recover a dead node, tester hangs in the next round because the tester will keep waiting until cluster becomes healthy which is impossible since a node is down. To fix this issue, we will always clean up if any error happens during each round so that cluster will be healthy for next round.
FIX#6743
lease stresser now generates short lived leases that will expire before invariant checking.
this addition verifies that the expired leases are indeed being deleted on the sever side.
increasing lease TTL ensure that lease doesn't expire during hashes stabilization period.
I observed that it can take a long time for etcd cluster to become stable.
I move the checker logic from tester to cluster so that stressers and checkers can be initialized at the same time.
this is useful because some checker depends on stressers.
This commit adds a new option -failure-wrapper to etcd-tester. The
option receives a path of script that is used for enabling/disabling
external fault injectors. The script is called with an option "enable"
when it needs to be enabled (when failure.Inject() is called) and
called with "disabled" in an opposite case (when failure.Recover() is
called).
This commit adds a new option --failures to etcd-tester. The option
receives a comma-delimited argument like this:
"default,failpoints". The given arguments are interpreted as names of
failures and they are injected to an etcd cluster. Available failures
are default (default scenario in etcd-tester) and failpoints. If no
args are passed to the option (--failures=""), no failures are
injected during testing.
This commit decouples stresser from the tester of
functional-tester. For doing it, this commit adds a new option
--stresser to etcd-tester. The option accepts two types of stresser:
"default" and "nop". If the option is "default", etcd-tester stresses
its etcd cluster with the existing stresser. If the option is "nop",
etcd-tester does nothing for stressing.
Partially fixes https://github.com/coreos/etcd/issues/6446
The following format "http://localhost:1234" causes existing port parser to fail. Add new logic to parse the host name first then extract port.
Fixes#6409
1. fix failure case counting
2. match ErrClientConnClosing in stresser
3. longer timeout for set-health-key
4. fixed range for range/delete stresser
5. remove Limit in RangeRequest
This commit adds --user for auth in benchmarks. Its purpose is
measuring overhead of authentication of v3 API. Of course the given
user must be granted permission of target keys before benchmarking.
Example of a case with no authentication:
% ./benchmark range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m10s
Summary:
Total: 130.1850 secs.
Slowest: 0.4071 secs.
Fastest: 0.0064 secs.
Average: 0.0130 secs.
Stddev: 0.0079 secs.
Requests/sec: 76.8138
Response time histogram:
0.006 [1] |
0.046 [9990] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.087 [3] |
0.127 [0] |
0.167 [3] |
0.207 [2] |
0.247 [0] |
0.287 [0] |
0.327 [0] |
0.367 [0] |
0.407 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0086 secs.
50% in 0.0113 secs.
75% in 0.0146 secs.
90% in 0.0209 secs.
95% in 0.0272 secs.
99% in 0.0344 secs.
Example of a case with authentication:
% ./benchmark --user=u1:p range k1
bench with linearizable range
10000 / 10000 Booooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo! 100.00%2m11s
Summary:
Total: 131.4923 secs.
Slowest: 0.1637 secs.
Fastest: 0.0065 secs.
Average: 0.0131 secs.
Stddev: 0.0070 secs.
Requests/sec: 76.0501
Response time histogram:
0.006 [1] |
0.022 [9075] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.038 [875] |∎∎∎
0.054 [36] |
0.069 [5] |
0.085 [1] |
0.101 [1] |
0.117 [0] |
0.132 [0] |
0.148 [5] |
0.164 [1] |
Latency distribution:
10% in 0.0076 secs.
25% in 0.0087 secs.
50% in 0.0114 secs.
75% in 0.0150 secs.
90% in 0.0215 secs.
95% in 0.0272 secs.
99% in 0.0347 secs.
It seems that current auth mechanism does not introduce visible overhead.
Fix https://github.com/coreos/etcd/issues/5573.
Currently stresser starts at the same time as cluster start.
If the stresser got launched too fast/early, all stressers
exit from the error 'etcdserver: not capable', which
means the cluster is not ready yet. This adds additional
error checking, so stresser can retry.