Documentation: Lowercase etcd in postmortem

dependabot/go_modules/go.uber.org/atomic-1.10.0
Marek Siarkowicz 2022-04-22 11:41:03 +02:00
parent b97f28c908
commit e9bc382c82
1 changed files with 15 additions and 15 deletions

View File

@ -15,14 +15,14 @@
## Background
Etcd v3 state is preserved on disk in two forms write ahead log (WAL) and database state (DB).
Etcd v3.5 also still maintains v2 state, however it's deprecated and not relevant to the issue in this postmortem.
etcd v3 state is preserved on disk in two forms write ahead log (WAL) and database state (DB).
etcd v3.5 also still maintains v2 state, however it's deprecated and not relevant to the issue in this postmortem.
WAL stores history of changes for etcd state and database represents state at one point.
To know which point of history database is representing, it stores consistent index (CI).
It's a special metadata field that points to last entry in WAL that it has seen.
When Etcd is updating database state, it replays entries from WAL and updates the consistent index to point to new entry.
When etcd is updating database state, it replays entries from WAL and updates the consistent index to point to new entry.
This operation is required to be [atomic](https://en.wikipedia.org/wiki/Atomic_commit).
A partial fail would mean that database and WAL would no longer match, so some entries would be either skipped (if only CI is updated) or executed twice (if only changes are applied).
This is especially important for distributed system like etcd, where there are multiple cluster members, each applying the WAL entries to their database.
@ -37,9 +37,9 @@ As part of transaction commit process, a database hook would read the value of c
Problem is that in memory value of consistent index is shared, and there might be other in flight transactions apart from serial WAL apply flow.
So if we imagine scenario:
1. Etcd server starts an apply workflow, and it just sets a new consistent index value.
1. etcd server starts an apply workflow, and it just sets a new consistent index value.
2. The periodic commit is triggered, and it executes the backend hook and saves consistent index from apply workflow.
3. Etcd server finished an apply workflow, saves new changes and saves same value of consistent index again.
3. etcd server finished an apply workflow, saves new changes and saves same value of consistent index again.
Between second and third point there is a very small window where consistent index is increased without applying any entries from WAL.
@ -88,9 +88,9 @@ Main impact comes from loosing user trust into etcd reliability.
### What went wrong
* No users enable data corruption detection as it is still an experimental feature introduced in v3.3. All reported cases where detected manually making it almost impossible to reproduce.
* Etcd has functional tests designed to detect such problems, however they are unmaintained, flaky and are missing crucial scenarios.
* Etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
* Etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
* etcd has functional tests designed to detect such problems, however they are unmaintained, flaky and are missing crucial scenarios.
* etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
* etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937).
### Where we got lucky
@ -115,14 +115,14 @@ To reflect this action items should have assigned priority:
| Action Item | Type | Priority | Bug |
|-------------------------------------------------------------------------------------|----------|----------|----------------------------------------------|
| Etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | |
| Etcd detects data corruption by default | Detect | P0 | |
| Etcd testing is high quality, easy to maintain and expand | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13637 |
| Etcd apply code should be easy to understand and validate correctness | Prevent | P1 | |
| etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | |
| etcd detects data corruption by default | Detect | P0 | |
| etcd testing is high quality, easy to maintain and expand | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13637 |
| etcd apply code should be easy to understand and validate correctness | Prevent | P1 | |
| Critical etcd features are not abandoned when contributors move on | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13775 |
| Etcd is continuously qualified with failure injection | Prevent | P1 | |
| Etcd can reliably detect data corruption | Detect | P1 | |
| Etcd can imminently detect and recover from data corruption (implement Merkle root) | Mitigate | P2 | https://github.com/etcd-io/etcd/issues/13839 |
| etcd is continuously qualified with failure injection | Prevent | P1 | |
| etcd can reliably detect data corruption | Detect | P1 | |
| etcd can imminently detect and recover from data corruption (implement Merkle root) | Mitigate | P2 | https://github.com/etcd-io/etcd/issues/13839 |
## Timeline