Commit Graph

1378 Commits (8c44d25f2a38be42f3e9005104cf2e3b688036ce)

Author SHA1 Message Date
Dan Mace cd3df73944 Documentation: Further improve etcdMembersDown alert
Before this change, the default window for the etcdMembersDown network failure
rate function was recently changed to 1 minute. While this helps detect a etcd
recovery more quickly, it depends on scrape intervals of <= 15s to collect
sufficient data points for the rate function. In practice, an interval of >= 30s
is more typical, which causes the rate function to be less accurate.

This patch increases the window to 2m, which is a compromise between the
original value of 3m and the 1m change introuced with 2aa5684, and should
accomodate more typical scrape intervals.

To offset the window change and to further improve the chance that the alert
will only fire when etcd is truly dead, this patch changes the `for` clause from
3m to 10m. The rationale is as follows:

1. There can be significant variance in durations following a reboot before etcd
is scraped and detected as available.

2. A conservative trigger like 10m seems less likely to produce a false alarm in
the face of such variance.

3. In this alerting situation, if the outage is real, it seems unlikely that an
additional 7 minutes of delay before (for example) paging somebody will make a
significant impact on the overall response.
2020-07-31 09:26:46 -04:00
Boqin Qin 9006d8d4f9
Documentation/learning/lock/client: Add defer Unlock (#11802) 2020-07-26 11:22:19 -07:00
Björn Rabenstein c9a5889915
Documentation/etcd-mixin: Reformulate alerting rules to use `without` rather than `by` (#12122)
* etcd-mixin: Reformulate alerting rules to use `without` rather than `by`

With aggregations using `by`, all additional target labels that a user
might have configured, are aggregated away. However, those target
labels are useful for e.g. alert routing. With this commit, nothing
should change for vanilla job/instance target labels, but whoever has
more target labels can now still make use of them.

Signed-off-by: beorn7 <beorn@grafana.com>

* etcd-mixin: Parametrize instance labels to aggregate away

Signed-off-by: beorn7 <beorn@grafana.com>
2020-07-23 16:02:26 -07:00
Sahdev Zala ef866a6d8b
Merge pull request #11943 from mitake/bcrypt-in-api
auth, etcdserver: hash password in the API layer
2020-07-20 10:52:24 -04:00
Hitoshi Mitake 2c41d9960b Documentation: describe the change of WAL entries related to auth 2020-07-14 00:15:19 +09:00
Hitoshi Mitake 5a3da48cdf auth, etcdserver: hash password in the API layer 2020-07-14 00:15:19 +09:00
Dan Mace 2aa5684ada Documentation: Tweak etcdMembersDown to reduce false negatives
Before this change, during a reboot in which etcd recovers quickly (e.g. 1 min),
the etcdMembersDown alert tends to fire even when etcd is fully healthy because
the averaging function can take more than 3 minutes to average back down below
the 0.01 threshold.

This change tries to reduce the possibility of a false negative by considering a
shorter (1 min) failure rate window which tends to average down below the
threshold far more quickly (within 1 min). The `for` clause of the alert should
ensure that the alert still fires if the poor conditions are sustained for an
unreasonable overall time (3 min).
2020-07-13 08:58:21 -04:00
W. Trevor King 4160b8396d Documentation/op-guide: Drop old alert_rules
Frederic says [1]:

> Side note, we can probably remove the old alerting syntax rules,
> Prometheus has removed this syntax >2.5 years ago.

[1]: https://github.com/etcd-io/etcd/pull/12080#issuecomment-649982787
2020-07-08 09:37:34 -07:00
Sam Batschelet 429826b467
Merge pull request #12080 from wking/raise-etcd-leader-changes-to-four
Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4
2020-07-08 08:37:50 -04:00
Hitoshi Mitake e582d7dc80 Documentation: refine the description about password strength 2020-06-29 23:40:44 +09:00
W. Trevor King 0c5cffc60b Documentation/etcd-mixin: Raise etcdHighNumberOfLeaderChanges threshold to 4
A cluster with three members could see three leader changes during a
healthy rolling reboot, and we don't want to alert on that.  Growing
to 4 reduces false-alarms for clusters with three or fewer members,
and that's probably most clusters.  It will also slightly increase the
risk of false-negatives, but if the cluster is struggling with high
latency, it seems likely that it would quickly pass the new threshold
too.

The hard-coded threshold means that we are still likely to get
false-positives during rolling reboots of clusters with four or more
members.  Ideally we'd scale this with the cluster size, or something,
but I'm not sure how to do that.  Three members is the minimum size
for high availability, so reducing false positives for that case seems
worth addressing even if we leave larger clusters largely unchanges.

Also manually catch etcd3_alert.rules up to speed, since it seems to
have been passed over by 16fc8a2b4b (Documentation/op-guide:
Re-generate alert rules and dashboard from mixin, 2020-04-07, #11768).
2020-06-25 15:38:15 -07:00
Xiang Li beb5614aad
doc: add TLS related warnings (#12060) 2020-06-23 21:07:36 -07:00
CFC4N 10cdabe721 CHANGELOG: update for https://github.com/etcd-io/etcd/pull/11980 , https://github.com/etcd-io/etcd/pull/11986 , https://github.com/etcd-io/etcd/pull/11987 . 2020-06-21 00:00:41 +08:00
Hitoshi Mitake c13415c581 Documentation: note on data encryption 2020-06-15 00:29:32 +09:00
hwdef e3014072ba Documentation: fix broken links 2020-06-12 09:51:33 +08:00
Ankur Gargi b0d2edfc68
Documentation: Added Recover cluster from minority failure (#11988) 2020-06-10 14:36:44 -07:00
shawwang f3deba09b4 CHANGELOG: update 3.2 changelog and 3.3 upgrade document for #11691 2020-06-09 11:39:46 +08:00
Hitoshi Mitake 356f647866 Documentation: note on the policy of insecure by default 2020-05-05 22:44:24 +09:00
Brandon Philips 1044a8b07c
Merge pull request #11768 from brancz/uid
Use UID instead of ID in Grafana dashboard
2020-04-29 05:35:06 -07:00
Brandon Philips d88d765ba4 Documentation, CHANGELOG: use new go.etcd.io/etcd/v3 pkg
Use the new package path in the docs and announce it in the CHANGELOG
2020-04-28 22:02:19 +00:00
Brandon Philips 96cce208c2 go.mod: use go.etcd.io/etcd/v3 versioning
This change makes the etcd package compatible with the existing Go
ecosystem for module versioning.

Used this tool to update package imports:
  https://github.com/KSubedi/gomove
2020-04-28 00:57:35 +00:00
Hitoshi Mitake 2369cb3678
Documentation: note on password strength (#11796) 2020-04-22 15:50:29 -07:00
Frederic Branczyk 16fc8a2b4b
Documentation/op-guide: Re-generate alert rules and dashboard from mixin 2020-04-07 18:15:02 +02:00
Frederic Branczyk 2c4877064e
Documentation/etcd-mixin: Use etcd_mvcc_db_total_size_in_bytes metric 2020-04-07 18:14:23 +02:00
Frederic Branczyk 68c5f6066f
Documentation/etcd-mixin: Set unique UID for Grafana dashboard 2020-04-07 18:13:41 +02:00
tangcong a18cbc8a22
Documentation: add note for #11689 (#11759) 2020-04-06 10:16:47 -07:00
jingyih 0344b70906 *: make MemberList linearizable
- Add linearizable field to etcdserverpb.MemberListRequest.
- Change behavior of clienv3 MemberList API. Now it is served with
linearizable guarantee.
2020-03-25 20:16:20 -07:00
yoyinzyc d8b9b54348 etcdserver: add downgrade rpc proto api. 2020-03-20 17:37:26 -07:00
Hitoshi Mitake 6d1982efe8
Merge pull request #11659 from wswcfan/add-auth-revision-status
etcdserver: add auth revision to AuthStatus to improve observability and testability
2020-03-09 23:08:55 +09:00
Xiang Li e0ff5ca318
RFC Documentation: enhance description of lock and lease (#11490)
* Documentation: enhance description of lock and lease

* Documentation: an executable implementation of fencing

* docs: api guarantees

cleanup lease grammar slightly

* docs: learning/lock/README.md improve grammar

Co-Authored-By: Steven E. Harris <seh@panix.com>

* docs: learning: improve locks and leases grammar

Co-authored-by: Brandon Philips <brandon@ifup.org>
Co-authored-by: Steven E. Harris <seh@panix.com>
2020-03-05 10:31:47 -08:00
shawwang 15eeb2c4ae etcdserver: add auth revision to AuthStatus to improve observability and testability 2020-03-04 22:37:24 +08:00
shawwang c6fce8c320 Documentation: generate *.swagger.json using latest protoc-gen-swagger 2020-03-04 22:36:13 +08:00
Jacky Wu 4e5314e9b5
doc: remove out-date introduction video link. (#11601)
It's easy to find etcd introduction video, and the introduction video
from the rfc doc is outdated, so removing this link.

Fixes 11591.
2020-02-07 20:49:05 -08:00
Vern Burton 071e70cdc4
*: add a new API and command for checking auth status (#11536)
This changes have started at etcdctl under auth.go, and make changes to stub out everything down into the internal raft.  Made changes to the .proto files and regenerated them so that the local version would build successfully.
2020-02-05 19:27:42 -08:00
Luc Perkins b9d00aae7c
Documentation: add section headings to integrations doc (#11573)
Signed-off-by: lucperkins <lucperkins@gmail.com>
2020-01-31 17:02:08 -08:00
Sahdev Zala 3898452b54
Merge pull request #11412 from lucperkins/lperkins/docs-restructuring-v2
Restructure documentation source files
2020-01-27 18:15:56 -05:00
Jingyi Hu 342c2464ae Documentation: specify starting revision (#11559) 2020-01-27 10:18:27 -08:00
Hitoshi Mitake 23810ea285 Documentation: unify the explanation of isolation level and consistency (#11474) 2020-01-27 10:17:38 -08:00
lucperkins 1be2f4b8e2 Documentation: Restructure directory to accommodate new site generation system
Signed-off-by: lucperkins <lucperkins@gmail.com>
2020-01-21 14:29:54 -08:00
Sahdev P. Zala 0cfadaaaeb doc: update required go version for master
changelog and readme are already updated.
2020-01-16 16:05:46 -05:00
Gabi Davar 8223006a97
Documentation: added v3.4.x metrics docs 2019-12-15 14:13:36 +02:00
Clayton Coleman 322c38e169 Documentation/etcd-mixin: Fix etcdHighNumberOfLeaderChanges (#11448)
The `etcdHighNumberOfLeaderChanges` alert had a copy and paste
error when it was converted from docs to mixin in 10244 - we moved
from "increase over 15m > 3" to "rate over 15m > 3" which is not
the same (rate is measured per second, so it should have been
"rate over 15m > (3 / 60 / 15)").  As part of fixing that, we
need to capture when prometheus starts or when new etcd clusters
are captured with a high leader change - i.e. if you start a new
etcd cluster and at the moment prometheus first scrapes you are
already at 5 leader changes, we should fire on that transition.

This alert is also now more responsive, so if you get a quick
burst of 3 leader changes we'll alert within 5m rather than 15m.
2019-12-13 16:00:11 -08:00
Tamas Geschitz a0571166bc
feat: changed ETCD manager URL
It now points to our domain instead of the Github page.
2019-11-19 22:17:29 +01:00
Sahdev P. Zala d185a54cb4 doc: update file ref path
Update the adopters file path.
2019-10-17 20:34:24 -04:00
Sahdev P. Zala d73e04efd9 doc: move production users to a standard ADOPTERS file
The details of production users fits better in the standard
ADOPTERS file as used by many other CNCF projects like
CoreDNS, containerd etc.
2019-10-17 18:36:28 -04:00
宇慕 f62ea1ceca *: promote the boltdb-freelistType from experimental to official and set default type to hashmap 2019-10-17 15:40:38 +08:00
Sahdev P. Zala 9002c1951f doc: add lease time
The current lease time is short and as such can lead to a timeout
error as explained in the related issue which can be confusing.

Fixes #9726
2019-10-13 16:38:28 -04:00
Swapnil Mhamane e5aecf8678 Documentation: Add gardener/etcd-backup-restore to the tools list
Etcd-backup-restore is collection of components to backup and restore the etcd. It features the periodic full and incremental backups, automated restore, Validation of etcd data directory with multi cloud provider support.
2019-10-10 21:18:41 +05:30
Jingyi Hu 20acacdea5 doc: clarify metrics flag 2019-09-24 15:27:46 -07:00
Sahdev Zala 93ae5d2f5b
Merge pull request #11095 from KeepCaim/master
Documentation:fix clerical error
2019-08-30 09:54:38 -04:00