From 6a3e8d4bdeb879beb5c8a251b65e93e033cfa7a9 Mon Sep 17 00:00:00 2001 From: Xiang Li Date: Thu, 10 Mar 2016 12:17:34 -0800 Subject: [PATCH] doc: add v3 data model --- Documentation/api_v3.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/Documentation/api_v3.md b/Documentation/api_v3.md index 01308834e..21bc61c9b 100644 --- a/Documentation/api_v3.md +++ b/Documentation/api_v3.md @@ -2,6 +2,28 @@ TODO: API doc +## Data Model + +etcd is designed to reliably store infrequently updated data and provide reliable watch queries. etcd exposes previous versions of key-value pairs to support inexpensive snapshots and watch history events (“time travel queries”). A persistent, multi-version, concurrency-control data model is a good fit for these use cases. + +etcd stores data in a multiversion [persistent][persistent-ds] key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generates a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data. + +### Logical View + +The store’s logical view is a flat binary key space. The key space has a lexically sorted index on byte string keys so range queries are inexpensive. + +The key space maintains multiple revisions. Each atomic mutative operation (e.g., a transaction operation may contain multiple operations) creates a new revision on the key space. All data held by previous revisions remains unchanged. Old versions of key can still be accessed through previous revisions. Likewise, revisions are indexed as well; ranging over revisions with watchers is efficient. If the store is compacted to recover space, revisions before the compact revision will be removed. + +A key’s lifetime spans a generation. Each key may have one or multiple generations. Creating a key increments the generation of that key, starting at 1 if the key never existed. Deleting a key generates a key tombstone, concluding the key’s current generation. Each modification of a key creates a new version of the key. Once a compaction happens, any generation ended before the given revision will be removed and values set before the compaction revision except the latest one will be removed. + +### Physical View + +etcd stores the physical data as key-value pairs in a persistent [b+tree][b+tree]. Each revision of the store’s state only contains the delta from its previous revision to be efficient. A single revision may correspond to multiple keys in the tree. + +The key of key-value pair is a 3-tuple (major, sub, type). Major is the store revision holding the key. Sub differentiates among keys within the same revision. Type is an optional suffix for special value (e.g., `t` if the value contains a tombstone). The value of the key-value pair contains the modification from previous revision, thus one delta from previous revision. The b+tree is ordered by key in lexical byte-order. Ranged lookups over revision deltas are fast; this enables quickly finding modifications from one specific revision to another. Compaction removes out-of-date keys-value pairs. + +etcd also keeps a secondary in-memory [btree][btree] index to speed up range queries over keys. The keys in the btree index are the keys of the store exposed to user. The value is a pointer to the modification of the persistent b+tree. Compaction removes dead pointers. + ## KV API Guarantees etcd is a consistent and durable key value store with mini-transaction(TODO: link to txn doc when we have it) support. The key value store is exposed through the KV APIs. etcd tries to ensure the strongest consistency and durability guarantees for a distributed system. This specification enumerates the KV API guarantees made by etcd. @@ -61,6 +83,9 @@ etcd does not ensure linearizability for watch operations. Users are expected to etcd ensures linearizability for all other operations by default. Linearizability comes with a cost, however, because linearized requests must go through the Raft consensus process. To obtain lower latencies and higher throughput for read requests, clients can configure a request’s consistency mode to `serializable`, which may access stale data with respect to quorum, but removes the performance penalty of linearized accesses' reliance on live consensus. +[persistent-ds]: [https://en.wikipedia.org/wiki/Persistent_data_structure] +[btree]: [https://en.wikipedia.org/wiki/B-tree] +[b+tree]: [https://en.wikipedia.org/wiki/B%2B_tree] [seq_consistency]: [https://en.wikipedia.org/wiki/Consistency_model#Sequential_consistency] [strict_consistency]: [https://en.wikipedia.org/wiki/Consistency_model#Strict_consistency] [serializable_isolation]: [https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable]