diff --git a/Documentation/contributor-guide/storage.md b/Documentation/contributor-guide/storage.md new file mode 100644 index 000000000..39a3d2a66 --- /dev/null +++ b/Documentation/contributor-guide/storage.md @@ -0,0 +1,749 @@ +# Etcd persistent storage files + +Last updated : etcd-v3.5 (2023-03-15) + + +## Purpose + +The document explains the etcd persistent storage format: naming, content and tools that allow developers to inspect them. Going forward the document should be extended with changes to the storage model. This document is targeted at etcd developers to help with their data recovery needs. + + +## Prerequisites + +The following articles provide helpful background information for this document: + +* Etcd data model overview: https://etcd.io/docs/v3.4/learning/data_model +* Raft overview: http://web.stanford.edu/~ouster/cgi-bin/papers/Raft-atc14.pdf (especially "5.3 Log replication" section). + + +## Overview + +### Long leaving files + +
File name | +High level purpose | +
./member/snap/db | +bbolt b+tree that stores all the applied data, membership authorization information & metadata. It’s aware of what's the last applied WAL log index ("consistent_index"). + | +
+./member/snap/0000000000000002-0000000000049425.snap +./member/snap/0000000000000002-0000000000061ace.snap + |
+ Periodic snapshots of legacy v2 store, containing:
+
|
+
/member/snap/000000000007a178.snap.db |
+ A complete bbolt snapshot downloaded from the etcd leader if the replica was lagging too much.
+ +Has the same type of content as (./member/snap/db) file. + +The file is used in 2 scenarios: +
|
+
+./member/wal/000000000000000f-00000000000b38c7.wal +./member/wal/000000000000000e-00000000000a7fe3.wal +./member/wal/000000000000000d-000000000009c70c.wal + |
+ Raft’s Write Ahead Logs, containing recent transactions accepted by Raft, periodic snapshots or CRC records. + +Recent `--max-wals=5` files are being preserved. Each of these files is `~64*10^6` bytes. The file is cut when it exceeds this hardcoded size, so the files might slightly exceed that size (so the preallocated `0.tmp` does not offer full disk-exceeded protection). + +If the snapshots are too infrequent, there can be more than `--max-wals=5`, as file-system level locks are protecting the files preventing them from being deleted too early. + | +
./member/wal/0.tmp (or .../1.tmp)+ |
+ Preallocated space for the next write ahead log file. + +Used to avoid Raft being stuck by a lack of WAL logs capacity without the possibility to raise an alarm. + | +
File + | +High level purpose + | +
./member/snap/ \ +0000000000000002-000000000007a178.snap.broken \ + + | +Snapshot files are renamed as ‘broken’ when they cannot be loaded:
+ +TODO: etcdserver/api/snap/snapshotter.go:148 + +The attempt to load the newest file happens when etcd is being started: + +TODO: etcdserver/server.go:428 \ +Or during backup/migrate commands of etcdctl. + |
+
/member/snap/tmp071677638 (random suffix) + | +Temporary (bbolt) file created on replicas in response to the msgSnap leaders request, so to the demand from the leader to recover storage from the given snapshot.
+
+After successful (complete) retrieval of content the file is renamed to: \
+/member/snap/ |
+
/member/snap/db.tmp.071677638 (random suffix) + | +A temporary file that contains a copy of the backend content (/member/snap/db), during the process of defragmentation. After the successful process the file is renamed to /member/snap/db, replacing the original backend.
+ + On etcd server startup these files get pruned. + |
+
bucket + | +key + | +Exemplar value + | +description + | +
alarm + | +rpcpb.Alarm:
+{MemberID, Alarm: NONE|NOSPACE|CORRUPT}
+ |
+ nil + | +Indicates problems have been diagnosed in one of the members. + | +
auth + | +"authRevision" + | +""(empty) or \
+BigEndian.PutUint64
+ |
+ Any change of Roles or Users increments this field on transaction commit.
+ +The value is used only for optimistic locking during the authorization process. + |
+
authRoles + | +[roleName] as string + | +authpb.Role marshalled
+ |
+ + | +
authUsers + | +[userName] as string + | +authpb.User marshalled
+ |
+ + | +
cluster + | +"clusterVersion" + | +"3.5.0" (string) + | +minor version of consensus-agreed common storage version. + | +
“downgrade” + | +Json: { \
+“target-version”: "3.4.0" \
+“enabled”: true/false
+ +} + |
+ Persists intent configured by the most recent: Downgrade RPC request.
++Since v3.5 + |
+ |
key + | +[revisionId]
+ +encoded using + +bytesToRev{main,sub} + +The key-value deletes are marshalled with ‘t’ at the end (as a “Thumbstone”) + |
+ mvccpb.KeyValue marshalled proto (key, create_rev, mod_rev, version, value, lease id )
+ |
+ + | +
lease + | ++ | +leasepb.Lease marshalled proto (ID, TTL, RemainingTTL)
+ |
+ Note: LeaseCheckpoint is extending only RemainingTTL. Just TTL is from the original Grant.
+ +Note2: We persist TTLs in seconds (from the undefined ‘now’). Crash-looping server does not release leases !!! + |
+
members + | +[memberId] in hex as string: \ + \ +"8e9e05c52164694d" + | +json as string serialized Member structure: \ +{"id":10276657743932975437, \ +"peerURLs":[ \ + "http://localhost:2380"], \ +"name":"default", \ +"clientURLs": \ + ["http://localhost:2379"]} + | +Agreed cluster membership information. + | +
members \ +_removed + | +[memberId] in hex as string: \ + \ +"8e9e05c52164694d" + | +[]byte("removed")
+ |
+ Ids of all removed members. \
+ \
+Used to validate that a removed member is never added again under the same id.
+ +The field is currently (3.4) read from store V2 and never from V3. See https://github.com/etcd-io/etcd/pull/12820 + |
+
meta + | +"consistent_index" + | +uint64 bytes (BigEndian) + | +Represents the offset of the last applied WAL entry to the bolt DB storage. + | +
"scheduledCompactRev" + | +bytesToRev{main,sub} encoded. (16 bytes) + | +Used to reinitialize compaction if a crash happened after a compaction request. + | +|
"finishedCompactRev" + | +bytesToRev{main,sub} encoded. (16 bytes) + | +Revision at which store was recently successfully compacted (https://github.com/etcd-io/etcd/blob/ae7862e8bc8007eb396099db4e0e04ac026c8df5/server/mvcc/kvstore_compaction.go#L54) + | +|
+ | +“confState” + | ++ | +Since etcd 3.5 + | +
+ | +“term” + | ++ | +Since etcd 3.5 + | +
+ | +“storage-version” + | ++ | ++ | +
>>>>> gd2md-html alert: inline drawings not supported directly from Docs. You may want to copy the inline drawing to a standalone drawing and export by reference. See Google Drawings by reference for details. The img URL below is a placeholder.
(Back to top)(Next alert)
>>>>>
>>>>> gd2md-html alert: undefined internal link (link text: "store v2 content"). Did you generate a TOC with blue links?
(Back to top)(Next alert)
>>>>>
/0/members/8e9e05c52164694d/attributes -> {\"name\":\"default\",\"clientURLs\":[\"[http://localhost:2379\](http://localhost:2379\)"]}
+ * /0/members/8e9e05c52164694d/RaftAttributes -> "{\"peerURLs\":[\"http://localhost:2380\"]}"
+* Storage version: /0/version-> 3.5.0
+
+
+### Tools
+
+
+#### protoc
+
+Following command allows you to see the file content when executed from etcd root directory:
+
+
+```
+cat default.etcd/member/snap/0000000000000002-0000000000049425.snap |
+ protoc --decode=snappb.snapshot \
+ server/etcdserver/api/snap/snappb/snap.proto \
+ -I $(go list -f '{{.Dir}}' github.com/gogo/protobuf/proto)/.. \
+ -I .
+ -I $(go list -m -f '{{.Dir}}' github.com/gogo/protobuf)/protobuf
+```
+
+
+Analogously you can extract 'data' field and decode as '[Raftpb.Snapshot](https://github.com/etcd-io/etcd/blob/ad5b30297a43daeb5ce7311fa606ce4c1f16618f/raft/raftpb/raft.proto#L31)`'`
+
+
+### Exemplar JSON serialized store v2 content in etcd 3.4 *.snap files:
+
+
+```
+{
+ "Root":{
+ "Path":"/",
+ "CreatedIndex":0,
+ "ModifiedIndex":0,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"",
+ "Children":{
+ "0":{
+ "Path":"/0",
+ "CreatedIndex":0,
+ "ModifiedIndex":0,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"",
+ "Children":{
+ "members":{
+ "Path":"/0/members",
+ "CreatedIndex":1,
+ "ModifiedIndex":1,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"",
+ "Children":{
+ "8e9e05c52164694d":{
+ "Path":"/0/members/8e9e05c52164694d",
+ "CreatedIndex":1,
+ "ModifiedIndex":1,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"",
+ "Children":{
+ "attributes":{
+ "Path":"/0/members/8e9e05c52164694d/attributes",
+ "CreatedIndex":2,
+ "ModifiedIndex":2,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
+ "Children":null
+ },
+ "RaftAttributes":{
+ "Path":"/0/members/8e9e05c52164694d/RaftAttributes",
+ "CreatedIndex":1,
+ "ModifiedIndex":1,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
+ "Children":null
+ }
+ }
+ }
+ }
+ },
+ "version":{
+ "Path":"/0/version",
+ "CreatedIndex":3,
+ "ModifiedIndex":3,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"3.5.0",
+ "Children":null
+ }
+ }
+ },
+ "1":{
+ "Path":"/1",
+ "CreatedIndex":0,
+ "ModifiedIndex":0,
+ "ExpireTime":"0001-01-01T00:00:00Z",
+ "Value":"",
+ "Children":{
+
+
+ }
+ }
+ }
+ },
+ "WatcherHub":{
+ "EventHistory":{
+ "Queue":{
+ "Events":[
+ {
+ "action":"create",
+ "node":{
+ "key":"/0/members/8e9e05c52164694d/RaftAttributes",
+ "value":"{\"peerURLs\":[\"http://localhost:2380\"]}",
+ "modifiedIndex":1,
+ "createdIndex":1
+ }
+ },
+ {
+ "action":"set",
+ "node":{
+ "key":"/0/members/8e9e05c52164694d/attributes",
+ "value":"{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}",
+ "modifiedIndex":2,
+ "createdIndex":2
+ }
+ },
+ {
+ "action":"set",
+ "node":{
+ "key":"/0/version",
+ "value":"3.5.0",
+ "modifiedIndex":3,
+ "createdIndex":3
+ }
+ }
+ ]
+ }
+ }
+ }
+}
+
+
+
+
+
+
+## Notes
+
+[^1]:
+ The metadata pages at the beginning of the bbolt file are modified in-place.
+
+[^2]:
+
+ Inconsistent, as majority of uint’s are written bigendian
+
+[^3]:
+ The initial (index:0) snapshot at the beginning of WAL log is not associated with *.snap file. Also the old *.snap files (or WAL logs) might get purged.
+
+
+## Changes
+
+This section is reserved to describe changes to the file formats introduces
+between different etcd versions.