Documentation: Tweak etcdMembersDown to reduce false negatives

Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.

This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
release-3.5
Dan Mace 2020-07-10 14:33:53 -04:00
parent 07461ecc8c
commit 2aa5684ada
1 changed files with 1 additions and 1 deletions

View File

@ -15,7 +15,7 @@
sum by (job) (up{%(etcd_selector)s} == bool 0)
or
count by (job,endpoint) (
sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[3m])) > 0.01
sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[1m])) > 0.01
)
)
> 0