Documentation: Create a draft data inconsistency postmortem

2022-04-20 17:14:56 +02:00 · 2022-04-20 17:14:56 +02:00 · b97f28c908
parent 4555fc3998
commit b97f28c908
1 changed files with 138 additions and 0 deletions
--- a/Documentation/postmortems/v3.5-data-inconsistency.md
+++ b/Documentation/postmortems/v3.5-data-inconsistency.md
@ -0,0 +1,138 @@
+# v3.5 data inconsistency postmortem
+
+|         |            |
+|---------|------------|
+| Authors | serathius@ |
+| Date    | 2022-04-20 |
+| Status  | draft      |
+
+## Summary
+
+|         |                                                                                                                                                                                                                               |
+|---------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Summary | Code refactor resulted in consistent index not being saved atomically, allowing for crash causing inconsistency with other cluster members.                                                                                   |
+| Impact  | No user reported problems in production as triggering the issue required frequent crashes, however issue was critical enough to motivate a public statement. Main impact comes from loosing user trust into etcd reliability. |
+
+## Background
+
+Etcd v3 state is preserved on disk in two forms write ahead log (WAL) and database state (DB).
+Etcd v3.5 also still maintains v2 state, however it's deprecated and not relevant to the issue in this postmortem. 
+
+WAL stores history of changes for etcd state and database represents state at one point. 
+To know which point of history database is representing, it stores consistent index (CI).
+It's a special metadata field that points to last entry in WAL that it has seen.
+
+When Etcd is updating database state, it replays entries from WAL and updates the consistent index to point to new entry.
+This operation is required to be [atomic](https://en.wikipedia.org/wiki/Atomic_commit). 
+A partial fail would mean that database and WAL would no longer match, so some entries would be either skipped (if only CI is updated) or executed twice (if only changes are applied).
+This is especially important for distributed system like etcd, where there are multiple cluster members, each applying the WAL entries to their database.
+Correctness of the system depends on assumption that every member of the cluster, while replying WAL entries, will reach the same state.
+
+## Root cause
+
+To simplify managing consistency index, etcd has introduced backend hooks in https://github.com/etcd-io/etcd/pull/12855.
+Goal was to ensure that consistency index is always updated, by automatically triggering update during commit.
+Implementation was as follows, before applying the WAL entries, etcd updated in memory value of consistent index. 
+As part of transaction commit process, a database hook would read the value of consistent index and store it to database. 
+
+Problem is that in memory value of consistent index is shared, and there might be other in flight transactions apart from serial WAL apply flow.
+So if we imagine scenario:
+1. Etcd server starts an apply workflow, and it just sets a new consistent index value.
+2. The periodic commit is triggered, and it executes the backend hook and saves consistent index from apply workflow.
+3. Etcd server finished an apply workflow, saves new changes and saves same value of consistent index again.
+
+Between second and third point there is a very small window where consistent index is increased without applying any entries from WAL.
+
+## Trigger
+
+If etcd crashed after consistency index is saved, but before to apply workflow finished it would lead to data inconsistency.
+When recovering the data etcd would skip executing changes from failed apply workflow, assuming they have been already executed.
+
+This follows the issue reports and code used to reproduce the issue where trigger was etcd crashing under high request load.
+Crashes in reports where caused by etcd running under high memory pressure, causing it to go out of memory from time to time.
+Reproduction run etcd under high stress and randomly killed one of the members using SIGKILL signal (not recoverable immediate process death). 
+
+## Detection
+
+For single member cluster it is totally undetectable. 
+There is no mechanism or tool for verifying that state database matches WAL.  
+
+In cluster with multiple members it would mean that one of the members that crashed, will missing changes from failed apply workflow.
+This means that it will have different state of database and will return different hash via `HashKV` grpc call.
+
+There is an automatic mechanism to detect data inconsistency. 
+It can be executed during etcd start via `--experimental-initial-corrupt-check` and periodically via `--experimental-corrupt-check-time`.
+Both checks however have a flaw, they depend on `HashKV` grpc method, which might fail causing the check to pass.
+
+In multi member etcd cluster, each member can run with different performance and be at different stage of applying the WAL log.
+Comparing database hashes between multiple etcd members requires all hashes to be calculated at the same change.
+This is done by requesting hash for the same `revision` (version of key value store). 
+However, it will not work if the provided revision is not available on the members.
+This can happen on very slow members, or in cases where corruption has lead revision numbers to diverge.
+
+This means that for this issue, the corrupt check is only reliable during etcd start just after etcd crashes.
+
+## Impact
+
+We are not aware any cases of users reporting a data corruption in production environment.
+
+However, issue was critical enough to motivate a public statement. 
+Main impact comes from loosing user trust into etcd reliability.
+
+## Lessons learned
+
+### What went well
+
+* Multiple maintainers were able to work effectively on reproducing and fixing the issue. As they are in different timezones, there was always someone working on the issue.
+
+### What went wrong
+
+* No users enable data corruption detection as it is still an experimental feature introduced in v3.3. All reported cases where detected manually making it almost impossible to reproduce.
+* Etcd has functional tests designed to detect such problems, however they are unmaintained, flaky and are missing crucial scenarios.
+* Etcd v3.5 release was not qualified as well as previous ones. Older maintainers run manual qualification process that is no longer known or executed.
+* Etcd apply code is so complicated that fixing the data inconsistency took almost 2 weeks and multiple tries (). Fix needed to be so complicated that we needed to develop automatic validation for it (https://github.com/etcd-io/etcd/pull/13885).
+* When fixing the main data inconsistency we have found multiple other edge cases that could lead to data corruption (https://github.com/etcd-io/etcd/issues/13514, https://github.com/etcd-io/etcd/issues/13922, https://github.com/etcd-io/etcd/issues/13937).
+
+### Where we got lucky
+
+* We reproduced the issue using etcd functional only because weird partition setup on workstation. Functional tests store etcd data under `/tmp` usually mounted to in memory filesystem. Problem was reproduced only because one of the maintainers has `/tmp` mounted to standard disk.
+
+## Action items
+
+Action items should directly address items listed in lessons learned. 
+We should double down on things that went well, fix things that went wrong, and stop depending on luck.
+
+Action fall under three types, and we should have at least one item per type. Types:
+* Prevent - Prevent similar issues from occurring. In this case, what testing we should introduce to find data inconsistency issues before release, preventing publishing broken release.
+* Detect - Be more effective in detecting when similar issues occur. In this case, improve mechanism to detect data inconsistency issue so users will be automatically informed.
+* Mitigate - Reduce time to recovery for users. In this case, how we ensure that users are able to quickly fix data inconsistency.
+
+Actions should not be restricted to fixing the immediate issues and also propose long term strategic improvements.
+To reflect this action items should have assigned priority: 
+* P0 - Critical for reliability of the v3.5 release. Should be prioritized this over all other work and backported to v3.5.
+* P1 - Important for long term success of the project. Blocks v3.6 release.
+* P2 - Stretch  goals that would be nice to have for v3.6, however should not be blocking.
+
+| Action Item                                                                         | Type     | Priority | Bug                                          |
+|-------------------------------------------------------------------------------------|----------|----------|----------------------------------------------|
+| Etcd testing can reproduce historical data inconsistency issues                     | Prevent  | P0       |                                              |
+| Etcd detects data corruption by default                                             | Detect   | P0       |                                              |
+| Etcd testing is high quality, easy to maintain and expand                           | Prevent  | P1       | https://github.com/etcd-io/etcd/issues/13637 |
+| Etcd apply code should be easy to understand and validate correctness               | Prevent  | P1       |                                              |
+| Critical etcd features are not abandoned when contributors move on                  | Prevent  | P1       | https://github.com/etcd-io/etcd/issues/13775 |
+| Etcd is continuously qualified with failure injection                               | Prevent  | P1       |                                              |
+| Etcd can reliably detect data corruption                                            | Detect   | P1       |                                              |
+| Etcd can imminently detect and recover from data corruption (implement Merkle root) | Mitigate | P2       | https://github.com/etcd-io/etcd/issues/13839 |
+
+## Timeline
+
+| Date       | Event                                                                                                                 |
+|------------|-----------------------------------------------------------------------------------------------------------------------|
+| 2021-05-08 | Pull request that caused data corruption was merged - https://github.com/etcd-io/etcd/pull/12855                      |
+| 2021-06-16 | Release v3.5.0 with data corruption was published - https://github.com/etcd-io/etcd/releases/tag/v3.5.0               |
+| 2021-12-01 | Report of data corruption - https://github.com/etcd-io/etcd/issues/13514                                              |
+| 2021-01-28 | Report of data corruption - https://github.com/etcd-io/etcd/issues/13654                                              |
+| 2022-03-08 | Report of data corruption - https://github.com/etcd-io/etcd/issues/13766                                              |
+| 2022-03-25 | Corruption confirmed by one of the maintainers - https://github.com/etcd-io/etcd/issues/13766#issuecomment-1078897588 |
+|            | Statement about the corruption was sent to etcd-dev@googlegroups.com and dev@kubernetes.io                            |                                                                                                  
+|            | Release v3.5.3 with fix was published - https://github.com/etcd-io/etcd/releases/tag/v3.5.3                           |