Documentation: Tweak etcdMembersDown to reduce false negatives

Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min), the etcdMembersDown alert tends to fire even when etcd is fully healthy because the averaging function can take more than 3 minutes to average back down below the 0.01 threshold. This change tries to reduce the possibility of a false negative by considering a shorter (1 min) failure rate window which tends to average down below the threshold far more quickly (within 1 min). The `for` clause of the alert should ensure that the alert still fires if the poor conditions are sustained for an unreasonable overall time (3 min).
2020-07-10 14:33:53 -04:00 · 2020-07-10 14:33:53 -04:00 · 2aa5684ada
parent 07461ecc8c
commit 2aa5684ada
1 changed files with 1 additions and 1 deletions
--- a/Documentation/etcd-mixin/mixin.libsonnet
+++ b/Documentation/etcd-mixin/mixin.libsonnet
@ -15,7 +15,7 @@
                sum by (job) (up{%(etcd_selector)s} == bool 0)
              or
                count by (job,endpoint) (
-                  sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[3m])) > 0.01
+                  sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[1m])) > 0.01
                )
              )
              > 0