Documentation: Update postmortem based on feedback from @ptabor

dependabot/go_modules/go.uber.org/atomic-1.10.0
Marek Siarkowicz 2022-04-22 18:26:31 +02:00
parent e9bc382c82
commit 7fe1bf52d6
1 changed files with 20 additions and 16 deletions

View File

@ -10,7 +10,7 @@
| | |
|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Summary | Code refactor resulted in consistent index not being saved atomically, allowing for crash causing inconsistency with other cluster members. |
| Summary | Code refactor in v3.5.0 resulted in consistent index not being saved atomically. Independent crash could lead to committed transactions are not reflected on all the members. |
| Impact | No user reported problems in production as triggering the issue required frequent crashes, however issue was critical enough to motivate a public statement. Main impact comes from loosing user trust into etcd reliability. |
## Background
@ -41,7 +41,7 @@ So if we imagine scenario:
2. The periodic commit is triggered, and it executes the backend hook and saves consistent index from apply workflow.
3. etcd server finished an apply workflow, saves new changes and saves same value of consistent index again.
Between second and third point there is a very small window where consistent index is increased without applying any entries from WAL.
Between second and third point there is a very small window where consistent index is increased without applying entry from WAL.
## Trigger
@ -49,7 +49,8 @@ If etcd crashed after consistency index is saved, but before to apply workflow f
When recovering the data etcd would skip executing changes from failed apply workflow, assuming they have been already executed.
This follows the issue reports and code used to reproduce the issue where trigger was etcd crashing under high request load.
Crashes in reports where caused by etcd running under high memory pressure, causing it to go out of memory from time to time.
Etcd v3.5.0 was released with bug (https://github.com/etcd-io/etcd/pull/13505) that could cause etcd to crash that was fixed in v3.5.1.
Apart from that all reports described etcd running under high memory pressure, causing it to go out of memory from time to time.
Reproduction run etcd under high stress and randomly killed one of the members using SIGKILL signal (not recoverable immediate process death).
## Detection
@ -84,14 +85,15 @@ Main impact comes from loosing user trust into etcd reliability.
### What went well
* Multiple maintainers were able to work effectively on reproducing and fixing the issue. As they are in different timezones, there was always someone working on the issue.
* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937).
### What went wrong
* No users enable data corruption detection as it is still an experimental feature introduced in v3.3. All reported cases where detected manually making it almost impossible to reproduce.
* etcd has functional tests designed to detect such problems, however they are unmaintained, flaky and are missing crucial scenarios.
* etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
* etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937).
* etcd v3.5 release was not qualified as comprehensive as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
* etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries. Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
* etcd v3.5 was recommended for production without enough insight on the production adoption. Production ready recommendations based on after some internal feedback... to get diverse usage, but the user's hold on till someone else will discover issues.
### Where we got lucky
@ -113,16 +115,18 @@ To reflect this action items should have assigned priority:
* P1 - Important for long term success of the project. Blocks v3.6 release.
* P2 - Stretch goals that would be nice to have for v3.6, however should not be blocking.
| Action Item | Type | Priority | Bug |
|-------------------------------------------------------------------------------------|----------|----------|----------------------------------------------|
| etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | |
| etcd detects data corruption by default | Detect | P0 | |
| etcd testing is high quality, easy to maintain and expand | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13637 |
| etcd apply code should be easy to understand and validate correctness | Prevent | P1 | |
| Critical etcd features are not abandoned when contributors move on | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13775 |
| etcd is continuously qualified with failure injection | Prevent | P1 | |
| etcd can reliably detect data corruption | Detect | P1 | |
| etcd can imminently detect and recover from data corruption (implement Merkle root) | Mitigate | P2 | https://github.com/etcd-io/etcd/issues/13839 |
| Action Item | Type | Priority | Bug |
|-------------------------------------------------------------------------------------|----------|----------|-------------------------------------------------|
| etcd testing can reproduce historical data inconsistency issues | Prevent | P0 | |
| etcd detects data corruption by default | Detect | P0 | |
| etcd testing is high quality, easy to maintain and expand | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13637 |
| etcd apply code should be easy to understand and validate correctness | Prevent | P1 | |
| Critical etcd features are not abandoned when contributors move on | Prevent | P1 | https://github.com/etcd-io/etcd/issues/13775 |
| etcd is continuously qualified with failure injection | Prevent | P1 | |
| etcd can reliably detect data corruption (hash is linearizable) | Detect | P1 | |
| etcd checks consistency of snapshots sent between leader and followers | Detect | P1 | https://github.com/etcd-io/etcd/issues/13973 |
| etcd recovery from data inconsistency procedures are documented and tested | Mitigate | P1 | |
| etcd can imminently detect and recover from data corruption (implement Merkle root) | Mitigate | P2 | https://github.com/etcd-io/etcd/issues/13839 |
## Timeline