RFC Documentation: enhance description of lock and lease (#11490)

* Documentation: enhance description of lock and lease

* Documentation: an executable implementation of fencing

* docs: api guarantees

cleanup lease grammar slightly

* docs: learning/lock/README.md improve grammar

Co-Authored-By: Steven E. Harris <seh@panix.com>

* docs: learning: improve locks and leases grammar

Co-authored-by: Brandon Philips <brandon@ifup.org>
Co-authored-by: Steven E. Harris <seh@panix.com>
release-3.5
Xiang Li 2020-03-05 10:31:47 -08:00 committed by GitHub
parent 6f850a65a1
commit e0ff5ca318
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
7 changed files with 410 additions and 1 deletions

View File

@ -14,6 +14,10 @@ etcd is a consistent and durable key value store with [mini-transaction][txn] su
* delete
* Combination (read-modify-write) APIs
* txn
* Lease APIs
* grant
* revoke
* put (attaching a lease to a key)
### etcd specific definitions
@ -49,6 +53,15 @@ etcd does not ensure linearizability for watch operations. Users are expected to
etcd ensures linearizability for all other operations by default. Linearizability comes with a cost, however, because linearized requests must go through the Raft consensus process. To obtain lower latencies and higher throughput for read requests, clients can configure a requests consistency mode to `serializable`, which may access stale data with respect to quorum, but removes the performance penalty of linearized accesses' reliance on live consensus.
### Granting, attaching and revoking leases
etcd provides [a lease mechanism][lease]. The primary use case of a lease is implementing distributed coordination mechanisms like distributed locks. The lease mechanism itself is simple: a lease can be created with the grant API, attached to a key with the put API, revoked with the revoke API, and will be expired by the wall clock time to live (TTL). However, users need to be aware about [the important properties of the APIs and usage][why] for implementing correct distributed coordination mechanisms.
[txn]: api.md#transactions
[linearizability]: https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf
[strict_serializability]: http://jepsen.io/consistency/models/strict-serializable
[serializable_isolation]: https://en.wikipedia.org/wiki/Isolation_(database_systems)#Serializable
[Linearizability]: #Linearizability
[lease]: https://web.stanford.edu/class/cs240/readings/89-leases.pdf
[why]: why.md#Notes

View File

@ -0,0 +1,61 @@
# What is this?
This directory provides an executable example of the scenarios described in [the article by Martin Kleppmann][fencing].
Generally speaking, a lease-based lock service cannot provide mutual exclusion to processes. This is because such a lease mechanism depends on the physical clock of both the lock service and client processes. Many factors (e.g. stop-the-world GC pause of a language runtime) can cause false expiration of a granted lease as depicted in the below figure: ![unsafe lock][unsafe-lock]
As discussed in [notes on the usage of lock and lease][why.md], such a problem can be solved with a technique called version number validation or fencing tokens. With this technique a shared resource (storage in the figures) needs to validate requests from clients based on their tokens like this: ![fencing tokens][fencing-tokens]
This directory contains two programs: `client` and `storage`. With `etcd`, you can reproduce the expired lease problem of distributed locking and a simple example solution of the validation technique which can avoid incorrect access from a client with an expired lease.
`storage` works as a very simple key value in-memory store which is accessible through HTTP and a custom JSON protocol. `client` works as client processes which tries to write a key/value to `storage` with coordination of etcd locking.
## How to build
For building `client` and `storage`, just execute `go build` in each directory.
## How to try
At first you need to start an etcd cluster, which works as lock service in the figures. On top of the etcd source directory, execute commands like below:
```
$ ./build # build etcd
$ goreman start
```
Then run `storage` command in `storage` directory:
```
$ ./storage
```
Now client processes ("Client 1" and "Client 2" in the figures) can be started. At first, execute below command for starting a client process which corresponds to "Client 1":
```
$ GODEBUG=gcstoptheworld=2 ./client 1
```
It will show an output like this:
```
client 1 starts
creted etcd client
acquired lock, version: 1029195466614598192
took 6.771998255s for allocation, took 36.217205ms for GC
emulated stop the world GC, make sure the /lock/* key disappeared and hit any key after executing client 2:
```
The process causes stop the world GC pause for making lease expiration intentionally and waits a keyboard input. Now another client process can be started like this:
```
$ ./client 2
client 2 starts
creted etcd client
acquired lock, version: 4703569812595502727
this is client 2, continuing
```
If things go well the second client process invoked as `./client 2` finishes soon. It successfully writes a key to `storage` process. After checking this, please hit any key for `./client 1` and resume the process. It will show an output like below:
```
resuming client 1
failed to write to storage: error: given version (4703569812595502721) differ from the existing version (4703569812595502727)
```
### Notes on the parameters related to stop the world GC pause
`client` program includes two constant values: `nrGarbageObjects` and `sessionTTL`. These parameters are configured for causing lease expiration with stop the world GC pause of go runtime. They heavily rely on resources of a machine for executing the example. If lease expiration doesn't happen on your machine, update these parameters and try again.
[why.md]: ../why.md#Notes-on-the-usage-of-lock-and-lease
[fencing]: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
[unsafe-lock]: https://martin.kleppmann.com/2016/02/unsafe-lock.png
[fencing-tokens]: https://martin.kleppmann.com/2016/02/fencing-tokens.png

View File

@ -0,0 +1 @@
client

View File

@ -0,0 +1,205 @@
// Copyright 2020 The etcd Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// An example distributed locking with fencing in the case of etcd
// Based on https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
// Important usage:
// If you are invoking this program as client 1, you need to configure GODEBUG env var like below:
// GODEBUG=gcstoptheworld=2 ./client 1
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"go.etcd.io/etcd/clientv3"
"go.etcd.io/etcd/clientv3/concurrency"
"io/ioutil"
"net/http"
"os"
"runtime"
"strconv"
"time"
)
type node struct {
next *node
}
const (
// These const values might be need adjustment.
nrGarbageObjects = 100 * 1000 * 1000
sessionTTL = 1
)
func stopTheWorld() {
n := new(node)
root := n
allocStart := time.Now()
for i := 0; i < nrGarbageObjects; i++ {
n.next = new(node)
n = n.next
}
func(n *node) {}(root) // dummy usage of root for removing a compiler error
root = nil
allocDur := time.Since(allocStart)
gcStart := time.Now()
runtime.GC()
gcDur := time.Since(gcStart)
fmt.Printf("took %v for allocation, took %v for GC\n", allocDur, gcDur)
}
type request struct {
Op string `json:"op"`
Key string `json:"key"`
Val string `json:"val"`
Version int64 `json:"version"`
}
type response struct {
Val string `json:"val"`
Version int64 `json:"version"`
Err string `json:"err"`
}
func write(key string, value string, version int64) error {
req := request{
Op: "write",
Key: key,
Val: value,
Version: version,
}
reqBytes, err := json.Marshal(&req)
if err != nil {
fmt.Printf("failed to marshal request: %s\n", err)
os.Exit(1)
}
httpResp, err := http.Post("http://localhost:8080", "application/json", bytes.NewReader(reqBytes))
if err != nil {
fmt.Printf("failed to send a request to storage: %s\n", err)
os.Exit(1)
}
respBytes, err := ioutil.ReadAll(httpResp.Body)
if err != nil {
fmt.Printf("failed to read request body: %s\n", err)
os.Exit(1)
}
resp := new(response)
err = json.Unmarshal(respBytes, resp)
if err != nil {
fmt.Printf("failed to unmarshal response json: %s\n", err)
os.Exit(1)
}
if resp.Err != "" {
return fmt.Errorf("error: %s", resp.Err)
}
return nil
}
func read(key string) (string, int64) {
req := request{
Op: "read",
Key: key,
}
reqBytes, err := json.Marshal(&req)
if err != nil {
fmt.Printf("failed to marshal request: %s\n", err)
os.Exit(1)
}
httpResp, err := http.Post("http://localhost:8080", "application/json", bytes.NewReader(reqBytes))
if err != nil {
fmt.Printf("failed to send a request to storage: %s\n", err)
os.Exit(1)
}
respBytes, err := ioutil.ReadAll(httpResp.Body)
if err != nil {
fmt.Printf("failed to read request body: %s\n", err)
os.Exit(1)
}
resp := new(response)
err = json.Unmarshal(respBytes, resp)
if err != nil {
fmt.Printf("failed to unmarshal response json: %s\n", err)
os.Exit(1)
}
return resp.Val, resp.Version
}
func main() {
if len(os.Args) != 2 {
fmt.Printf("usage: %s <1 or 2>\n", os.Args[0])
return
}
mode, err := strconv.Atoi(os.Args[1])
if err != nil || mode != 1 && mode != 2 {
fmt.Printf("mode should be 1 or 2 (given value is %s)\n", os.Args[1])
return
}
fmt.Printf("client %d starts\n", mode)
client, err := clientv3.New(clientv3.Config{
Endpoints: []string{"http://127.0.0.1:2379", "http://127.0.0.1:22379", "http://127.0.0.1:32379"},
})
if err != nil {
fmt.Printf("failed to create an etcd client: %s\n", err)
os.Exit(1)
}
fmt.Printf("creted etcd client\n")
session, err := concurrency.NewSession(client, concurrency.WithTTL(sessionTTL))
if err != nil {
fmt.Printf("failed to create a session: %s\n", err)
os.Exit(1)
}
locker := concurrency.NewLocker(session, "/lock")
locker.Lock()
version := session.Lease()
fmt.Printf("acquired lock, version: %d\n", version)
if mode == 1 {
stopTheWorld()
fmt.Printf("emulated stop the world GC, make sure the /lock/* key disappeared and hit any key after executing client 2: ")
reader := bufio.NewReader(os.Stdin)
reader.ReadByte()
fmt.Printf("resuming client 1\n")
} else {
fmt.Printf("this is client 2, continuing\n")
}
err = write("key0", fmt.Sprintf("value from client %d", mode), int64(version))
if err != nil {
fmt.Printf("failed to write to storage: %s\n", err) // client 1 should show this message
} else {
fmt.Printf("successfully write a key to storage\n")
}
}

View File

@ -0,0 +1 @@
storage

View File

@ -0,0 +1,101 @@
// Copyright 2020 The etcd Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package main
import (
"encoding/json"
"fmt"
"io/ioutil"
"net/http"
"os"
"strings"
)
type value struct {
val string
version int64
}
var data = make(map[string]*value)
type request struct {
Op string `json:"op"`
Key string `json:"key"`
Val string `json:"val"`
Version int64 `json:"version"`
}
type response struct {
Val string `json:"val"`
Version int64 `json:"version"`
Err string `json:"err"`
}
func writeResponse(resp response, w http.ResponseWriter) {
wBytes, err := json.Marshal(resp)
if err != nil {
fmt.Printf("failed to marshal json: %s\n", err)
os.Exit(1)
}
_, err = w.Write(wBytes)
if err != nil {
fmt.Printf("failed to write a response: %s\n", err)
os.Exit(1)
}
}
func handler(w http.ResponseWriter, r *http.Request) {
rBytes, err := ioutil.ReadAll(r.Body)
if err != nil {
fmt.Printf("failed to read http request: %s\n", err)
os.Exit(1)
}
var req request
err = json.Unmarshal(rBytes, &req)
if err != nil {
fmt.Printf("failed to unmarshal json: %s\n", err)
os.Exit(1)
}
if strings.Compare(req.Op, "read") == 0 {
if val, ok := data[req.Key]; ok {
writeResponse(response{val.val, val.version, ""}, w)
} else {
writeResponse(response{"", -1, "key not found"}, w)
}
} else if strings.Compare(req.Op, "write") == 0 {
if val, ok := data[req.Key]; ok {
if req.Version != val.version {
writeResponse(response{"", -1, fmt.Sprintf("given version (%d) is different from the existing version (%d)", req.Version, val.version)}, w)
} else {
data[req.Key].val = req.Val
data[req.Key].version = req.Version
writeResponse(response{req.Val, req.Version, ""}, w)
}
} else {
data[req.Key] = &value{req.Val, req.Version}
writeResponse(response{req.Val, req.Version, ""}, w)
}
} else {
fmt.Printf("unknown op: %s\n", req.Op)
return
}
}
func main() {
http.HandleFunc("/", handler)
http.ListenAndServe(":8080", nil)
}

View File

@ -71,12 +71,32 @@ If an application reasons primarily about metadata or metadata ordering, such as
## Using etcd for distributed coordination
etcd has distributed coordination primitives such as event watches, leases, elections, and distributed shared locks out of the box. These primitives are both maintained and supported by the etcd developers; leaving these primitives to external libraries shirks the responsibility of developing foundational distributed software, essentially leaving the system incomplete. NewSQL databases usually expect these distributed coordination primitives to be authored by third parties. Likewise, ZooKeeper famously has a separate and independent [library][curator] of coordination recipes. Consul, which provides a native locking API, goes so far as to apologize that its “[not a bulletproof method][consul-bulletproof]”.
etcd has distributed coordination primitives such as event watches, leases, elections, and distributed shared locks out of the box (Note that in the case of the distributed shared lock, users need to be aware about its non obvious properties. The details are described below). These primitives are both maintained and supported by the etcd developers; leaving these primitives to external libraries shirks the responsibility of developing foundational distributed software, essentially leaving the system incomplete. NewSQL databases usually expect these distributed coordination primitives to be authored by third parties. Likewise, ZooKeeper famously has a separate and independent [library][curator] of coordination recipes. Consul, which provides a native locking API, goes so far as to apologize that its “[not a bulletproof method][consul-bulletproof]”.
In theory, its possible to build these primitives atop any storage systems providing strong consistency. However, the algorithms tend to be subtle; it is easy to develop a locking algorithm that appears to work, only to suddenly break due to thundering herd and timing skew. Furthermore, other primitives supported by etcd, such as transactional memory depend on etcds MVCC data model; simple strong consistency is not enough.
For distributed coordination, choosing etcd can help prevent operational headaches and save engineering effort.
### Notes on the usage of lock and lease
etcd provides [lock APIs][etcd-v3lock] which are based on [the lease mechanism][lease] and [its implementation in etcd][etcdlease]. The basic idea of the lease mechanism is: a server grants a token, which is called a lease, to a requesting client. When the server grants a lease, it associates a TTL with the lease. When the server detects the passage of time longer than the TTL, it revokes the lease. While the client holds a non revoked lease it can claim that it owns access to a resource associated with the lease. In the case of etcd, the resource is a key in the etcd keyspace. etcd provides lock APIs with this scheme. However, the lock APIs cannot be used as mutual exclusion mechanism by themselves. The APIs are called lock because [for historical reasons][chubby]. The lock APIs can, however, be used as an optimization mechanism of mutual exclusion as described below.
The most important aspect of the lease mechanism is that TTL is defined as a physical time interval. Both of the server and client measures passing of time with their own clocks. It allows a situation that the server revokes the lease but the client still claims it owns the lease.
Then how does the lease mechanism guarantees mutual exclusion of the locking mechanism? Actually, the lease mechanism itself doesn't guarantee mutual exclusion. Owning a lease cannot guarantee the owner holds a lock of the resource.
In the case of controlling mutual accesses to keys of etcd itself with etcd lock, mutual exclusion is implemented based on the mechanism of version number validation (it is sometimes called compare and swap in other systems like Consul). In etcd's RPCs like `Put` or `Txn`, we can specify required conditions about revision number and lease ID for the operations. If the conditions are not satisfied, the operation can fail. With this mechanism, etcd provides distributed locking for clients. It means that a client knows that it is acquiring a lock of a key when its requests are completed by etcd cluster successfully.
In distributed locking literature similar designs are described:
* In [the paper of Chubby][chubby], the concept of *sequencer* is introduced. We interpret that sequencer is an almost same to the combination of revision number and lease ID of etcd.
* In [How to do distributed locking][fencing], Martin Kleppmann introduced the idea of *fencing token*. The authors interpret that fencing token is revision number in the case of etcd. In [Note on fencing and distributed locks][fencing-zk] Flavio Junqueira discussed how the idea of fencing token should be implemented in the case of zookeeper.
* In [Practical Uses of Synchronized Clocks in Distributed Systems][physicalclock], we can find a description that Thor implements a distributed locking mechanism based on version number validation and lease.
Why do etcd and other systems provide lease if they provide mutual exclusion based on version number validation? Well, leases provide an optimization mechanism for reducing a number of aborted requests.
Note that in the case of etcd keys, it can be locked efficiently because of the mechanisms of lease and version number validation. If users need to protect resources which aren't related to etcd, the resources must provide the version number validation mechanism and consistency of replicas like keys of etcd. The lock feature of etcd itself cannot be used for protecting external resources.
The [lock directory][executable-example] contains an executable example which follows [the scenario described in the article by Martin Kleppmann][fencing]. The example shows how stop the world GC of go runtime can cause false revoking of etcd lock and how version number validation can prevent access to the shared resource from a client which acquires a stale lock.
[production-users]: ../../ADOPTERS.md
[grpc]: https://www.grpc.io
[consul-bulletproof]: https://www.consul.io/docs/internals/sessions.html
@ -116,3 +136,10 @@ For distributed coordination, choosing etcd can help prevent operational headach
[locksmith]: https://github.com/coreos/locksmith
[kubernetes]: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
[dbtester-comparison-results]: https://github.com/coreos/dbtester/tree/master/test-results/2018Q1-02-etcd-zookeeper-consul
[etcdlease]: https://godoc.org/github.com/etcd-io/etcd/clientv3#Lease
[lease]: https://web.stanford.edu/class/cs240/readings/89-leases.pdf
[chubby]: https://research.google/pubs/pub27897/
[fencing]: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
[physicalclock]: http://www.dainf.cefetpr.br/~tacla/SDII/PracticalUseOfClocks.pdf
[fencing-zk]: https://fpj.me/2016/02/10/note-on-fencing-and-distributed-locks/
[executable-example]: lock/README.md