From 2aa5684ada24882449ec6ba0f797bc8a2b0d273e Mon Sep 17 00:00:00 2001
From: Dan Mace <ironcladlou@gmail.com>
Date: Fri, 10 Jul 2020 14:33:53 -0400
Subject: [PATCH] Documentation: Tweak etcdMembersDown to reduce false
 negatives

Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.

This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
---
 Documentation/etcd-mixin/mixin.libsonnet | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/etcd-mixin/mixin.libsonnet b/Documentation/etcd-mixin/mixin.libsonnet
index 66595a270..fc799f2ab 100644
--- a/Documentation/etcd-mixin/mixin.libsonnet
+++ b/Documentation/etcd-mixin/mixin.libsonnet
@@ -15,7 +15,7 @@
                 sum by (job) (up{%(etcd_selector)s} == bool 0)
               or
                 count by (job,endpoint) (
-                  sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[3m])) > 0.01
+                  sum by (job,endpoint,To) (rate(etcd_network_peer_sent_failures_total{%(etcd_selector)s}[1m])) > 0.01
                 )
               )
               > 0