Before this PR, the log is
```
2015/09/1 13:18:31 etcdmain: client: etcd cluster is unavailable or
misconfigured
```
It is quite hard for people to understand what happens.
Now we print out the exact reason for the failure, and explains the way
to handle it.
Current membership changing functionality of etcd seems to have a
problem which can cause deadlock.
How to produce:
1. construct N node cluster
2. add N new nodes with etcdctl member add, without starting the new members
What happens:
After finishing add N nodes, a total number of the cluster becomes 2 *
N and a quorum number of the cluster becomes N + 1. It means
membership change requires at least N + 1 nodes because Raft treats
membership information in its log like other ordinal log append
requests.
Assume the peer URLs of the added nodes are wrong because of miss
operation or bugs in wrapping program which launch etcd. In such a
case, both of adding and removing members are impossible because the
quorum isn't preserved. Of course ordinal requests cannot be
served. The cluster would seem to be deadlock.
Of course, the best practice of adding new nodes is adding one node
and let the node start one by one. However, the effect of this problem
is so serious. I think preventing the problem forcibly would be
valuable.
Solution:
This patch lets etcd forbid adding a new node if the operation changes
quorum and the number of changed quorum is larger than a number of
running nodes. If etcd is launched with a newly added option
-strict-reconfig-check, the checking logic is activated. If the option
isn't passed, default behavior of reconfig is kept.
Fixes https://github.com/coreos/etcd/issues/3477
We always want to use GOMAXPROCS() as the way go parses it. When in go1.4, we
want to expose GOMAXPROCS value, so we set GOMAXPROCS explicitly as the
way go 1.4 does and print it out.
But it becomes a problem when go 1.5 changes the way to set GOMAXPROCS.
Fix the problem by stop setting GOMAXPROCS and get its value directly.
Due to this change, it sets default GOMAXPROCS to the
number of CPUs available when compiling in go 1.5, which matches how go 1.5 works:
https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3g5mATSXqqV5QrxinasI/edit
This is a behavior change in etcd 2.2.
It uses heartbeat interval and election timeout to estimate the
expected request timeout.
This PR helps etcd survive under high roundtrip-time environment,
e.g., globally-deployed cluster.
This solves the problem that etcd may fatal because its critical path
cannot get file descriptor resource when the number of clients is too
big. The PR lets the client listener close client connections
immediately after they are accepted when
the file descriptor usage in the process reaches some pre-set limit, so
it ensures that the internal critical path could always get file
descriptor when it needs.
When there are tons to clients connecting to the server, the original
behavior is like this:
```
2015/08/4 16:42:08 etcdserver: cannot monitor file descriptor usage
(open /proc/self/fd: too many open files)
2015/08/4 16:42:33 etcdserver: failed to purge snap file open
default2.etcd/member/snap: too many open files
[halted]
```
Current behavior is like this:
```
2015/08/6 19:05:25 transport: accept error: closing connection,
exceed file descriptor usage limitation (fd limit=874)
2015/08/6 19:05:25 transport: accept error: closing connection,
exceed file descriptor usage limitation (fd limit=874)
2015/08/6 19:05:26 transport: accept error: closing connection,
exceed file descriptor usage limitation (fd limit=874)
2015/08/6 19:05:27 transport: accept error: closing connection,
exceed file descriptor usage limitation (fd limit=874)
2015/08/6 19:05:28 transport: accept error: closing connection,
exceed file descriptor usage limitation (fd limit=874)
2015/08/6 19:05:28 etcdserver: 80% of the file descriptor limit is
used [used = 873, limit = 1024]
```
It is available at linux system today because pkg/runtime only has linux
support.
At present it prints the whole usage and flags, which cause the exact
error message is hidden two screens above.
Fixes#3141
Signed-off-by: Guohua Ouyang <gouyang@redhat.com>
If the user sets TLS info, this implies that he wants to listen on TLS.
If etcd finds that urls to listen is still HTTP schema, it prints out
warning to notify user about possible wrong setting.
etcd shows an odd message on configuration error like this (partially):
```
... discovery or bootstrap flags are setChoose one of ...
^^^^^^^^^
```
This commit fixes the message format problem.
This PR set maxIdleConnsPerHost to 128 to let proxy handle 128 concurrent
requests in long term smoothly.
If the number of concurrent requests is bigger than this value,
proxy needs to create one new connection when handling each request in
the delta, which is bad because the creation consumes resource and may
eat up your ephemeral port.
After this PR, only cluster's interface Cluster is exposed, which makes
code much cleaner. And it avoids external packages to rely on cluster
struct in the future.
The PR extracts types.Cluster from etcdserver.Cluster. types.Cluster
is used for flag parsing and etcdserver config.
There is no need to expose etcdserver.Cluster public, which contains
lots of etcdserver internal details and methods. This is the first step
for it.
Before this PR, people can set listen-client-urls without setting
advertise-client-urls, and leaves advertise-client-urls as default
localhost value. The client libraries which sync the cluster info
fetch wrong advertise-client-urls and cannot connect to the cluster.
This PR avoids this case and provides better UX.
On the other hand, this change is safe because people always want to set
advertise-client-urls if listen-client-urls is set. The default localhost
advertise url cannot be accessed from the outside, and should always be
set except that etcd is bootstrapped with no flag.
We start to resolve host into tcp addrs since we generate
tcp based initial-cluster during srv discovery. However it
creates problems around tls and cluster verification. The
srv discovery only needs to use resolved the tcp addr to
find the local node. It does not have to resolve everything
and use the resolved addrs.
This fixes#2488 and #2226
Currently this doesn't work if a user wants to try out a single machine
cluster but change the name for whatever reason. This is because the
name is always "default" and the
```
./bin/etcd -name 'baz'
```
This solves our problem on CoreOS where the default is `ETCD_NAME=%m`.
etcd does not provide enough flexibility to configure server SSL and
client authentication separately. When configuring server SSL the
`--ca-file` flag is required to trust self-signed SSL certificates
used to service client requests.
The `--ca-file` has the side effect of enabling client cert
authentication. This can be surprising for those looking to simply
secure communication between an etcd server and client.
Resolve this issue by introducing four new flags:
--client-cert-auth
--peer-client-cert-auth
--trusted-ca-file
--peer-trusted-ca-file
These new flags will allow etcd to support a more explicit SSL
configuration for both etcd clients and peers.
Example usage:
Start etcd with server SSL and no client cert authentication:
etcd -name etcd0 \
--advertise-client-urls https://etcd0.example.com:2379 \
--cert-file etcd0.example.com.crt \
--key-file etcd0.example.com.key \
--trusted-ca-file ca.crt
Start etcd with server SSL and enable client cert authentication:
etcd -name etcd0 \
--advertise-client-urls https://etcd0.example.com:2379 \
--cert-file etcd0.example.com.crt \
--key-file etcd0.example.com.key \
--trusted-ca-file ca.crt \
--client-cert-auth
Start etcd with server SSL and client cert authentication for both
peer and client endpoints:
etcd -name etcd0 \
--advertise-client-urls https://etcd0.example.com:2379 \
--cert-file etcd0.example.com.crt \
--key-file etcd0.example.com.key \
--trusted-ca-file ca.crt \
--client-cert-auth \
--peer-cert-file etcd0.example.com.crt \
--peer-key-file etcd0.example.com.key \
--peer-trusted-ca-file ca.crt \
--peer-client-cert-auth
This change is backwards compatible with etcd versions 2.0.0+. The
current behavior of the `--ca-file` flag is preserved.
Fixes#2499.
Support IPv6 address for ETCD_ADDR and ETCD_PEER_ADDR
pkg/flags: Support IPv6 address for ETCD_ADDR and ETCD_PEER_ADDR
pkg/flags: tests for IPv6 addr and bind-addr flags
pkg/flags: IPAddressPort.Host: do not enclose IPv6 address in square brackets
pkg/flags: set default bind address to [::] instead of 0.0.0.0
pkg/flags: we don't need fmt any more
also, one minor fix: net.JoinHostPort takes string as a port value
pkg/flags: fix ipv6 tests
pkg/flags: test both IPv4 and IPv6 addresses in TestIPAddressPortString
etcdmain: test: use [::] instead of 0.0.0.0
Dial timeout is set shorter because
1. etcd is supposed to work in good environment, and the new value is long
enough
2. shorter dial timeout makes dial fail faster, which is good for
performance
The functionality in pkg/osutil ensures that all interrupt handlers finish
and the process kills itself with the proper signal.
Test for interrupt handling added.
The server shutsdown gracefully by stopping on interrupt (Issue #2277.)