Found 1 known vulnerability.
Vulnerability #1: GO-2022-1144
An attacker can cause excessive memory growth in a Go server
accepting HTTP/2 requests. HTTP/2 server connections contain a
cache of HTTP header keys sent by the client. While the total
number of entries in this cache is capped, an attacker sending
very large keys can cause the server to allocate approximately
64 MiB per open connection.
Call stacks in your code:
Error: tools/etcd-dump-metrics/main.go:158:5: go.etcd.io/etcd/v3/tools/etcd-dump-metrics.main calls go.etcd.io/etcd/server/v3/embed.StartEtcd, which eventually calls golang.org/x/net/http2.Server.ServeConn
Found in: golang.org/x/net/http2@v0.2.0
Fixed in: golang.org/x/net/http2@v0.4.0
More info: https://pkg.go.dev/vuln/GO-2022-1144
Error: Process completed with exit code 3.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
$ govulncheck ./...
govulncheck is an experimental tool. Share feedback at https://go.dev/s/govulncheck-feedback.
Scanning for dependencies with known vulnerabilities...
Found 1 known vulnerability.
Vulnerability #1: GO-2022-1144
An attacker can cause excessive memory growth in a Go server
accepting HTTP/2 requests. HTTP/2 server connections contain a
cache of HTTP header keys sent by the client. While the total
number of entries in this cache is capped, an attacker sending
very large keys can cause the server to allocate approximately
64 MiB per open connection.
Call stacks in your code:
tools/etcd-dump-metrics/main.go:159:31: go.etcd.io/etcd/v3/tools/etcd-dump-metrics.main$4 calls go.etcd.io/etcd/server/v3/embed.StartEtcd, which eventually calls golang.org/x/net/http2.ConfigureServer$1
Found in: golang.org/x/net/http2@v0.2.0
Fixed in: golang.org/x/net/http2@v1.19.4
More info: https://pkg.go.dev/vuln/GO-2022-1144
Vulnerability #2: GO-2022-1144
An attacker can cause excessive memory growth in a Go server
accepting HTTP/2 requests. HTTP/2 server connections contain a
cache of HTTP header keys sent by the client. While the total
number of entries in this cache is capped, an attacker sending
very large keys can cause the server to allocate approximately
64 MiB per open connection.
Call stacks in your code:
contrib/lock/storage/storage.go:106:28: go.etcd.io/etcd/v3/contrib/lock/storage.main calls net/http.ListenAndServe
contrib/raftexample/httpapi.go:113:31: go.etcd.io/etcd/v3/contrib/raftexample.serveHTTPKVAPI$1 calls net/http.Server.ListenAndServe
tools/etcd-dump-metrics/main.go:159:31: go.etcd.io/etcd/v3/tools/etcd-dump-metrics.main$4 calls go.etcd.io/etcd/server/v3/embed.StartEtcd, which eventually calls net/http.Serve
tools/etcd-dump-metrics/main.go:159:31: go.etcd.io/etcd/v3/tools/etcd-dump-metrics.main$4 calls go.etcd.io/etcd/server/v3/embed.StartEtcd, which eventually calls net/http.Server.Serve
Found in: net/http@go1.19.3
Fixed in: net/http@go1.19.4
More info: https://pkg.go.dev/vuln/GO-2022-1144
Signed-off-by: Benjamin Wang <wachao@vmware.com>
This shortens operation history and avoids having to many failed requests.
Failed requests are problematic as too many of them can cause linearizability
verification complexity to become exponential.
Signed-off-by: Marek Siarkowicz <siarkowicz@google.com>
Executed commands below,
1. Removed go.etcd.io/raft/v3 => ../raft;
2. go get go.etcd.io/raft/v3@eaa6808e1f7ab2247c13778250f70520b0527ff1;
3. go mod tidy
Signed-off-by: Benjamin Wang <wachao@vmware.com>
The change did in https://github.com/etcd-io/etcd/pull/14824 fixed
the test instead of the product code. It isn't correct. After we
fixed the product code in this PR, we can revert the change in
that PR.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Comments fixed as per goword in go _test package files that
shell function go_srcs_in_module lists as per changes on #14827
Helps in #14827
Signed-off-by: Bhargav Ravuri <bhargav.ravuri@infracloud.io>
Comments fixed as per goword in go test files that shell
function go_srcs_in_module lists as per changes on #14827
Helps in #14827
Signed-off-by: Bhargav Ravuri <bhargav.ravuri@infracloud.io>
If the corrupted member has been elected as leader, the memberID in alert
response won't be the corrupted one. It will be a smaller follower ID since
the raftCluster.Members always sorts by ID. We should check the leader
ID and decide to use which memberID.
Fixes: #14823
Signed-off-by: Wei Fu <fuweid89@gmail.com>
This changes the builds to always add -trimpath which removes specific
build time paths from the binary (like current directories etc).
Improves build reproducability to make the final binary independent from
the specific build path.
Lastly, when stripping debug symbols, also add -w to strip DWARF symbols
as well which aren't needed in that case either.
Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com>
1. Fixed the test failures which are caused by recent test framework rafactoring;
2. renamed the file to promote_experimental_flag_test.go.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
--warning-unary-request-duration is a duplicate of --experimental-warning-unary-request-duration
experimental-warning-unary-request-duration will be removed in v3.7.
fixes https://github.com/etcd-io/etcd/issues/13783
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
ExpectProcess and ExpectFunc now take the exit code of the process into
account, not just the matching of the tty output.
This also refactors the many tests that were previously succeeding on
matching an output from a failing cmd execution.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
The etcdctl and etcdutl built with `-tags cov` mode will show go-test result
after each execution, like
```
...
PASS
coverage: 0.0% of statements in ./...
```
Since the PASS is not real command, the `source completion` command will
fail with command-not-found error. And there is no easy way to disable
the (*testing.M).Run's output. Therefore, this patch uses build tag !cov
to disable cases when enable coverage.
Fixes: #14694
Signed-off-by: Wei Fu <fuweid89@gmail.com>
When e2e test cases specify the DataDirPath and there are more than
one member in the cluster, we need to create a subdirectory for each
member. Otherwise all members share the same directory and accordingly
lead to conflict.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
This fix avoids the assumption of knowing the current version of the
binary. We can query the binary with the version flag to get the actual
version of the given binary we upgrade and downgrade to. The
respectively reported versions should match what is returned by the
version endpoint.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Check the values of myKey and myRev first in Unlock method to prevent calling Unlock without Lock. Because this may cause the value of pfx to be deleted by mistake.
Signed-off-by: chenyahui <cyhone@qq.com>
Afer moving `ClusterVersion` and related constants into e2e packages,
some e2e test cases are broken, so we need to update them to use the
correct definitions.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
ClusterContext is used by "e2e" or "integration" to extend the
ClusterConfig. The common test cases shouldn't care about what
data is encoded or included; instead "e2e" or "integration"
framework should decode or parse it separately.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
In the TestDowngradeUpgradeCluster case, the brand-new cluster is using
simple-config-changer, which means that entries has been committed
before leader election and these entries will be applied when etcdserver
starts to receive apply-requests. The simple-config-changer will mark
the `confState` dirty and the storage backend precommit hook will update
the `confState`.
For the new cluster, the storage version is nil at the beginning. And
it will be v3.5 if the `confState` record has been committed. And it
will be >v3.5 if the `storageVersion` record has been committed.
When the new cluster is ready, the leader will set init cluster version
with v3.6.x. And then it will trigger the `monitorStorageVersion` to
update the `storageVersion` to v3.6.x. If the `confState` record has
been updated before cluster version update, we will get storageVersion
record.
If the storage backend doesn't commit in time, the
`monitorStorageVersion` won't update the version because of `cannot
detect storage schema version: missing confstate information`.
And then we file the downgrade request before next round of
`monitorStorageVersion`(per 4 second), the cluster version will be
v3.5.0 which is equal to the `UnsafeDetectSchemaVersion`'s result.
And we won't see that `The server is ready to downgrade`.
It is easy to reproduce the issue if you use cpuset or taskset to limit
in two cpus.
So, we should wait for the new cluster's storage ready before downgrade
request.
Fixes: #14540
Signed-off-by: Wei Fu <fuweid89@gmail.com>
We defines two common `WithAuth` functions for e2e and integration
test respectively. They are calling `integration.WithAuth` and
`e2e.WithAuth` respectively.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
It doesn't make sense to always pass a AuthConfig parameter for
test cases which do not enable auth at all. So refactoring the
Client interface method so that it accepts a `ClientOption`
variadic parameter.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Check the client count before creating the ephemeral key, do not
create the key if there are already too many clients. Check the
count after creating the key again, if the total kvs is bigger
than the expected count, then check the rev of the current key,
and take action accordingly based on its rev. If its rev is in
the first ${count}, then it's valid client, otherwise, it should
fail.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Rebased this PR. There is no response from the original author,
so Benjamin (ahrtr@) continue to work on this PR.
Signed-off-by: Vitalii Levitskii <vitalii@uber.com>
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Since http2 spec defines the receive windows's size and max size of
frame in the stream, the underlayer - gRPC client can pre-read data
from server even if the application layer hasn't read it yet.
And the initialized cluster has 20KiB snapshot, which can be pre-read
by underlayer. We should increase the snapshot's size, just in case
that the io.Copy won't return the canceled or timeout error.
Fixes: #14477
Signed-off-by: Wei Fu <fuweid89@gmail.com>
submitConcurrentWatch use sleep 3s to wait for all the watch connections
ready. When the number of connections increases, like 1000, the 3s is not
enough and the test case becomes flaky.
In this commit, spawn curl process and check the ouput line with
`created":true}}` to make sure that the connection has been initialized
and ready to receive the events. It is reliable to test the following
range request.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
github.com/golang-jwt/jwt adds go mod support startig from 4.0.0,
and it's backwards-compatible with existing v3.x.y tags.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
Problem: both SIGQUIT_ETCD_AND_REMOVE_DATA_AND_STOP_AGENT and test.sh
will attempt to stop agents and remove directories.
Solution: since test.sh creates directories and starts test, it should be
responsible for cleanup.
See https://github.com/etcd-io/etcd/issues/14384
Signed-off-by: Bogdan Kanivets <bkanivets@apple.com>
Due to a duplicate call of clientConfigFromCmd, the move-leader command
would fail with "conflicting environment variable is shadowed by corresponding command-line flag".
Also in scenarios where no command-line flag was supplied.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
The alarm list is the only exception that doesn't move consistent_index
forward. The reproduction steps are as simple as,
```
etcd --snapshot-count=5 &
for i in {1..6}; do etcdctl alarm list; done
kill -9 <etcd_pid>
etcd
```
Signed-off-by: Benjamin Wang <wachao@vmware.com>
When we can't reach quorum, we were waiting forever and never sending
the systemd notify message. As a result, systemd would eventually time out
and restart the etcd process which likely would make the unhealthy cluster
in an even worse state
Improves #13785
Signed-off-by: Nicolai Moore <niconorsk@gmail.com>
Notes:
1. compactPhysical in ctlCtx and withQuota aren't used at all, they are dead code.
2. quotaBackendBytes in ctlCtx isn't used either. Instead, users (test cases) set the QuotaBackendBytes directly.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
There are two cases, when interrupted by users, then forcibly kill
all processes. Otherwise, gracefully terminate all processes.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
The proxy must be waiting for the etcd to be running, but the current
implementation hard codes the wating time as 5 seconds. The improvement
is to dynamically check whether the etcd is running, and start the
proxy when etcd port is reachable.
Signed-off-by: Benjamin Wang <wachao@vmware.com>
To verify distributed tracing feature is correctly setup, this PR adds
an integration test for this feature.
In the process of writing the test, I discovered a goroutine leak due to
the TraceProvider not being closed. This PR fixs this issue as well.
Signed-off-by: Yingrong Zhao <yingrong.zhao@gmail.com>
This should aid in debugging test flakes, especially in tests where the process is restarted very often and thus changes its pid.
Now it's a lot easier to grep for different members, also when different tests fail at the same time.
The test TestDowngradeUpgradeClusterOf3 as mentioned in #13167 is a good example for that.
Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
etcdctl/ctlv3: migrate cheggaaa/pb.v1 to cheggaaa/pb/v3
This commit also changes the format of the progress bar, from using a
custom progress bar to the default provided by the library.
Old behaviour:
./benchmarkv1 put
0 / 10000 B ! 0.00%
3987 / 10000 Boooooooooooooom ! 39.87%
10000 / 10000 Boooooooooooooooooooooooooooooooooooooooooooo! 100.00% 1s
New behaviour:
./benchmark put
6536 / 10000 [----------------------->________________] 65.36% 7053 p/s
10000 / 10000 [---------------------------------------] 100.00% 7581 p/s
Signed-off-by: Mikel Olasagasti Uranga <mikel@olasagasti.info>
We have already defined all the constant etcd versions in the
centralized place api/version/version.go. So we should replace all
the versions with the centralized definitions.
Problem: TestLeaseGrantAndList is flaky because lists won't match at the end.
Test uses revision to verify that all members got leases. But checking for revision isn't enough.
Solution: use size of the list to stop polling.
When clients have no permission to perform whatever operation, then
the applying may fail. We should also move consistent_index forward
in this case, otherwise the consitent_index may smaller than the
snapshot index.
Downstream users of etcd experience build issues when using dependencies
which require more recent (incompatible) versions of opentelemetry. This
commit upgrades the dependencies so that downstream users stop
experiencing these issues.
When a user sets up a Mirror with a restricted user that doesn't have
access to the `foo` path, we will fail to get the most recent revision
due to permissions issues.
With this change, when a prefix is provided we will get the initial
revision from the prefix rather than /foo. This allows restricted users
to setup sync.
When etcdserver receives a LeaseRenew request, it may be still in
progress of processing the LeaseGrantRequest on exact the same
leaseID. Accordingly it may return a TTL=0 to client due to the
leaseID not found error. So the leader should wait for the appliedID
to be available before processing client requests.
Currently there are a handful of tests within etcd that silently fail
because LeakDetection will skip the test before it manages to hit this
check.
Here we move the check to the beginning of the process to highlight
these cases earlier, and to avoid them accidentally presenting as leaks.
We don't consistently reach the same etcd server during the lifetime of
a test and in some cases, this means that this test will flake if an
etcd server was slow to update its state and the test hits the outdated
server.
Here we switch to using an `Eventually` case which will wait upto a
second for the expected result before failing - with a 10ms gap between
invocations.
```
[tests(dani/leasefix)] $ gotestsum -- ./common -tags integration -count 100 -timeout 15m -run TestLeaseGrantAndList
✓ common (2m26.71s)
DONE 1600 tests in 147.258s
```
I think strong (not-equal) relationship was too restrictive when expressed with 1s granularity.
```
logger.go:130: 2022-04-03T22:15:15.242+0200 WARN m1 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m1", "to": "cb785755eb80ac1", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "24.666613ms"}
logger.go:130: 2022-04-03T22:15:15.262+0200 INFO m-1 published local member to cluster through raft {"member": "m-1", "local-member-id": "e2dd9f523aa7be87", "local-member-attributes": "{Name:m-1 ClientURLs:[unix://127.0.0.1:2196386040]}", "cluster-id": "b4b8e7e41c23c8b5", "publish-timeout": "5.2s"}
v3_lease_test.go:415: Expected lease ttl (4m58s) to be greather than (4m58s)
```
The code now ensures that each of the test is running in its own directory as opposed to shared os.tempdir.
```
$ (cd tests && env go test -timeout=15m --race go.etcd.io/etcd/tests/v3/integration/clientv3/examples -run ExampleAuth)
2022/04/03 10:24:59 Running tests (examples): ...
2022/04/03 10:24:59 the function can be called only in the test context. Was integration.BeforeTest() called ?
2022/04/03 10:24:59 2022-04-03T10:24:59.462+0200 INFO m0 LISTEN GRPC {"member": "m0", "grpcAddr": "localhost:m0", "m.Name": "m0"}
```
Nearly none of the tests was checking the value... just assuming WaitLeader success.
```
maintenance_test.go:277: Waiting for leader...
logger.go:130: 2022-04-03T08:01:09.914+0200 INFO m0 cluster version differs from storage version. {"member": "m0", "cluster-version": "3.6.0", "storage-version": "3.5.0"}
logger.go:130: 2022-04-03T08:01:09.915+0200 WARN m0 leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk {"member": "m0", "to": "2acc3d3b521981", "heartbeat-interval": "10ms", "expected-duration": "20ms", "exceeded-duration": "103.756219ms"}
logger.go:130: 2022-04-03T08:01:09.916+0200 INFO m0 updated storage version {"member": "m0", "new-storage-version": "3.6.0"}
...
logger.go:130: 2022-04-03T08:01:09.926+0200 INFO grpc [[roundrobin] roundrobinPicker: Build called with info: {map[0xc002630ac0:{{unix:localhost:m0 localhost <nil> 0 <nil>}} 0xc002630af0:{{unix:localhost:m1 localhost <nil> 0 <nil>}} 0xc002630b20:{{unix:localhost:m2 localhost <nil> 0 <nil>}}]}]
logger.go:130: 2022-04-03T08:01:09.926+0200 WARN m0 apply request took too long {"member": "m0", "took": "114.661766ms", "expected-duration": "100ms", "prefix": "", "request": "header:<ID:12658633312866157316 > cluster_version_set:<ver:\"3.6.0\" > ", "response": ""}
logger.go:130: 2022-04-03T08:01:09.927+0200 INFO m0 cluster version is updated {"member": "m0", "cluster-version": "3.6"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m2.raft 9f96af25a04e2ec3 [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 10) {"member": "m2"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [logterm: 2, index: 8, vote: 9903a56eaf96afac] ignored MsgVote from 2acc3d3b521981 [logterm: 2, index: 8] at term 2: lease is not expired (remaining ticks: 5) {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac [term: 2] received a MsgAppResp message with higher term from 2acc3d3b521981 [term: 3] {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft 9903a56eaf96afac became follower at term 3 {"member": "m0"}
logger.go:130: 2022-04-03T08:01:09.955+0200 INFO m0.raft raft.node: 9903a56eaf96afac lost leader 9903a56eaf96afac at term 3 {"member": "m0"}
maintenance_test.go:279: Leader established.
```
Tmp
```
% (cd client/v3 && env go test -short -timeout=3m --race ./...)
--- FAIL: TestAuthTokenBundleNoOverwrite (0.00s)
client_test.go:210: listen unix /var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestAuthTokenBundleNoOverwrite3197524989/001/etcd-auth-test:0: bind: invalid argument
FAIL
FAIL go.etcd.io/etcd/client/v3 4.270s
```
The reason was that the path exceeded 108 chars (that is too much for socket).
In the mitigation we first change chroot (working directory) to the tempDir... such the path is 'local'.