From 2aa5684ada24882449ec6ba0f797bc8a2b0d273e Mon Sep 17 00:00:00 2001 From: Dan Mace Date: Fri, 10 Jul 2020 14:33:53 -0400 Subject: [PATCH] Documentation: Tweak etcdMembersDown to reduce false negatives Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min), the etcdMembersDown alert tends to fire even when etcd is fully healthy because the averaging function can take more than 3 minutes to average back down below the 0.01 threshold. This change tries to reduce the possibility of a false negative by considering a shorter (1 min) failure rate window which tends to average down below the threshold far more quickly (within 1 min). The `for` clause of the alert should ensure that the alert still fires if the poor conditions are sustained for an unreasonable overall time (3 min). --- Documentation/etcd-mixin/mixin.libsonnet | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/etcd-mixin/mixin.libsonnet b/Documentation/etcd-mixin/mixin.libsonnet index 66595a270..fc799f2ab 100644 --- a/Documentation/etcd-mixin/mixin.libsonnet +++ b/Documentation/etcd-mixin/mixin.libsonnet @@ -15,7 +15,7 @@ sum by (job) (up{%(etcd_selector)s} == bool 0) or count by (job,endpoint) ( - sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[3m])) > 0.01 + sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[1m])) > 0.01 ) ) > 0