Before this change, the default window for the etcdMembersDown network failure
rate function was recently changed to 1 minute. While this helps detect a etcd
recovery more quickly, it depends on scrape intervals of <= 15s to collect
sufficient data points for the rate function. In practice, an interval of >= 30s
is more typical, which causes the rate function to be less accurate.
This patch increases the window to 2m, which is a compromise between the
original value of 3m and the 1m change introuced with 2aa5684, and should
accomodate more typical scrape intervals.
To offset the window change and to further improve the chance that the alert
will only fire when etcd is truly dead, this patch changes the `for` clause from
3m to 10m. The rationale is as follows:
1. There can be significant variance in durations following a reboot before etcd
is scraped and detected as available.
2. A conservative trigger like 10m seems less likely to produce a false alarm in
the face of such variance.
3. In this alerting situation, if the outage is real, it seems unlikely that an
additional 7 minutes of delay before (for example) paging somebody will make a
significant impact on the overall response.
* etcd-mixin: Reformulate alerting rules to use `without` rather than `by`
With aggregations using `by`, all additional target labels that a user
might have configured, are aggregated away. However, those target
labels are useful for e.g. alert routing. With this commit, nothing
should change for vanilla job/instance target labels, but whoever has
more target labels can now still make use of them.
Signed-off-by: beorn7 <beorn@grafana.com>
* etcd-mixin: Parametrize instance labels to aggregate away
Signed-off-by: beorn7 <beorn@grafana.com>
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.
This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
A cluster with three members could see three leader changes during a
healthy rolling reboot, and we don't want to alert on that. Growing
to 4 reduces false-alarms for clusters with three or fewer members,
and that's probably most clusters. It will also slightly increase the
risk of false-negatives, but if the cluster is struggling with high
latency, it seems likely that it would quickly pass the new threshold
too.
The hard-coded threshold means that we are still likely to get
false-positives during rolling reboots of clusters with four or more
members. Ideally we'd scale this with the cluster size, or something,
but I'm not sure how to do that. Three members is the minimum size
for high availability, so reducing false positives for that case seems
worth addressing even if we leave larger clusters largely unchanges.
Also manually catch etcd3_alert.rules up to speed, since it seems to
have been passed over by 16fc8a2b4b (Documentation/op-guide:
Re-generate alert rules and dashboard from mixin, 2020-04-07, #11768).
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.
Used this tool to update package imports:
https://github.com/KSubedi/gomove
* Documentation: enhance description of lock and lease
* Documentation: an executable implementation of fencing
* docs: api guarantees
cleanup lease grammar slightly
* docs: learning/lock/README.md improve grammar
Co-Authored-By: Steven E. Harris <seh@panix.com>
* docs: learning: improve locks and leases grammar
Co-authored-by: Brandon Philips <brandon@ifup.org>
Co-authored-by: Steven E. Harris <seh@panix.com>
This changes have started at etcdctl under auth.go, and make changes to stub out everything down into the internal raft. Made changes to the .proto files and regenerated them so that the local version would build successfully.
The `etcdHighNumberOfLeaderChanges` alert had a copy and paste
error when it was converted from docs to mixin in 10244 - we moved
from "increase over 15m > 3" to "rate over 15m > 3" which is not
the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)"). As part of fixing that, we
need to capture when prometheus starts or when new etcd clusters
are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are
already at 5 leader changes, we should fire on that transition.
This alert is also now more responsive, so if you get a quick
burst of 3 leader changes we'll alert within 5m rather than 15m.
Etcd-backup-restore is collection of components to backup and restore the etcd. It features the periodic full and incremental backups, automated restore, Validation of etcd data directory with multi cloud provider support.