From faad828c5154b871ec2b2a4f5e6d9e3dd4b715dc Mon Sep 17 00:00:00 2001 From: Anthony Romano Date: Tue, 28 Mar 2017 14:43:40 -0700 Subject: [PATCH] Documentation: add disk latency leader loss question to FAQ --- Documentation/faq.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/Documentation/faq.md b/Documentation/faq.md index dd1c1c08b..fe7bea60a 100644 --- a/Documentation/faq.md +++ b/Documentation/faq.md @@ -78,6 +78,10 @@ On the other hand, if the downed member is removed from cluster membership first etcd sets `strict-reconfig-check` in order to reject reconfiguration requests that would cause quorum loss. Abandoning quorum is really risky (especially when the cluster is already unhealthy). Although it may be tempting to disable quorum checking if there's quorum loss to add a new member, this could lead to full fledged cluster inconsistency. For many applications, this will make the problem even worse ("disk geometry corruption" being a candidate for most terrifying). +### Why does etcd lose its leader from disk latency spikes? + +This is intentional; disk latency is part of leader liveness. Suppose the cluster leader takes a minute to fsync a raft log update to disk, but the etcd cluster has a one second election timeout. Even though the leader can process network messages within the election interval (e.g., send heartbeats), it's effectively unavailable because it can't commit any new proposals; it's waiting on the slow disk. If the cluster frequently loses its leader due to disk latencies, try [tuning][tuning] the disk settings or etcd time parameters. + ### Performance #### How should I benchmark etcd?