Merge pull request #7622 from heyitsanthony/faq-disk-leader

Documentation: add disk latency leader loss question to FAQ
2017-03-28 19:18:50 -07:00 · 2017-03-28 19:18:50 -07:00 · 36735d52a4
parent eafab47f05 faad828c51
commit 36735d52a4
1 changed files with 4 additions and 0 deletions
--- a/Documentation/faq.md
+++ b/Documentation/faq.md
@ -78,6 +78,10 @@ On the other hand, if the downed member is removed from cluster membership first

 etcd sets `strict-reconfig-check` in order to reject reconfiguration requests that would cause quorum loss. Abandoning quorum is really risky (especially when the cluster is already unhealthy). Although it may be tempting to disable quorum checking if there's quorum loss to add a new member, this could lead to full fledged cluster inconsistency. For many applications, this will make the problem even worse ("disk geometry corruption" being a candidate for most terrifying).

+### Why does etcd lose its leader from disk latency spikes?
+
+This is intentional; disk latency is part of leader liveness. Suppose the cluster leader takes a minute to fsync a raft log update to disk, but the etcd cluster has a one second election timeout. Even though the leader can process network messages within the election interval (e.g., send heartbeats), it's effectively unavailable because it can't commit any new proposals; it's waiting on the slow disk. If the cluster frequently loses its leader due to disk latencies, try [tuning][tuning] the disk settings or etcd time parameters.
+
 ### Performance

 #### How should I benchmark etcd?