Documentation: Tweak etcdMembersDown to reduce false negatives
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min), the etcdMembersDown alert tends to fire even when etcd is fully healthy because the averaging function can take more than 3 minutes to average back down below the 0.01 threshold. This change tries to reduce the possibility of a false negative by considering a shorter (1 min) failure rate window which tends to average down below the threshold far more quickly (within 1 min). The `for` clause of the alert should ensure that the alert still fires if the poor conditions are sustained for an unreasonable overall time (3 min).release-3.5
parent
07461ecc8c
commit
2aa5684ada
|
@ -15,7 +15,7 @@
|
|||
sum by (job) (up{%(etcd_selector)s} == bool 0)
|
||||
or
|
||||
count by (job,endpoint) (
|
||||
sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[3m])) > 0.01
|
||||
sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[1m])) > 0.01
|
||||
)
|
||||
)
|
||||
> 0
|
||||
|
|
Loading…
Reference in New Issue