Compare commits

..

45 Commits

Author SHA1 Message Date
Gyu-Ho Lee
bdee27b19e *: bump to v2.2.4 2016-01-13 14:07:39 -08:00
Anthony Romano
7f684641a3 Godeps: remove golang.org/x/net/netutil
Now using our own LimitListener to support KeepAlives.
2016-01-13 13:32:38 -08:00
Anthony Romano
d54cf26bed etcdmain: support keep alive listeners on limit listener connections
Fixes #4171
2016-01-13 13:32:26 -08:00
Xiang Li
d3e73aadab etcdmain: tls listener MUST be at the outer layer of all listeners
go HTTP library uses type assertion to determine if a connection
is a TLS connection. If we wrapper TLS Listener with any customized
Listener that can create customized Conn, HTTPs will be broken.

This commit fixes the issue.
2016-01-13 13:32:08 -08:00
Gyu-Ho Lee
e340928988 etcdctl: ignore value in updatedir command
Fixes coreos#4145.
client.KeysAPI ignores value if SetOptions.Dir is true.
2016-01-13 13:31:49 -08:00
Gyu-Ho Lee (etcd)
e6ffe22e16 *: bump to v2.2.3+git 2015-12-30 13:54:57 -08:00
Gyu-Ho Lee
05b564a394 *: bump to v2.2.3 2015-12-30 13:41:16 -08:00
Xiang Li
cb779b2305 etcdctl: fix syncWithPeerAPI by breaking the loop when there is no error 2015-12-30 11:24:27 -08:00
Xiang Li
22c3208fb3 etcdserver: always check if the data dir is writable before starting etcd 2015-12-30 10:22:52 -08:00
Xiang Li
e44372e430 etcdsever: avoid creating member dir before finishing validate bootstrap
This commit fixes the issue of creating member dir before validating
the configuration. When member dir exists, it indicates the local etcd
process is a valid etcd member. So we should only create member dir
after we finish configuration validation, joining validation or
discovery validation.
2015-12-30 10:19:51 -08:00
Xiang Li
05a90bc1e5 etcdmain: fix incomplete proxy config file
etcd might generate incomplete proxy config file after a power failure.
It is because we use ioutil.WriteFile. And iotuile.WriteFile does
not call Sync before closing the file.
2015-12-22 12:30:39 -08:00
Xiang Li
6751727809 etcdctl: support basic operations with etcd 0.4.
For CoreOS users, they will get a updated version of etcdctl without updating
the etcd server version. And the users cannot really control this behavior.
We do not want to suddenly break them without enough communication.

So we still want the most basic opeartions like get, set, watch of etcdctl2 work
with etcd 0.4. This patches solve the incompability issue.
2015-12-22 12:29:16 -08:00
Xiang Li
916106c3a2 client: support reset Endpoints.
ResetEndpoints is useful when the there is a scheduled cluster
changes or when manually manage the cluster without auto-sync
enabled.
2015-12-22 12:25:59 -08:00
Xiang Li
e0c7768f94 store: fix data race when modify event in watchHub.
The event got from watchHub should be considered as readonly.
To modify it, we first need to get a clone of it or there might
be a data race.
2015-12-14 14:08:39 -08:00
Andrei Korzhevskii
0fb2d5d4d3 client: fix goroutine leak in unreleased context
If headerTimeout is not zero then two context are created but only one is released.
cancelCtx in this case is never released which leads to goroutine leak inside it.
2015-12-14 13:58:43 -08:00
Xiang Li
fc61fc7c7a etcdctl: cluster health exit with non-zero when cluster is unhealthy 2015-12-14 13:58:10 -08:00
Yicheng Qin
09b81bad15 *: bump to v2.2.2+git 2015-11-19 14:24:59 -08:00
Yicheng Qin
b4bddf685b *: bump to v2.2.2 2015-11-19 14:24:35 -08:00
Xiang Li
af1c711270 auth: use canonical path for pre-defined guest role 2015-11-19 13:41:27 -08:00
Xiang Li
c269426be8 *: fix various data races detected by race detector
Conflicts:
	rafthttp/transport.go
2015-11-19 13:38:28 -08:00
Xiang Li
20b7df3c12 rafthttp: fix data races detected by go race detector
Conflicts:
	rafthttp/pipeline.go
2015-11-19 13:34:49 -08:00
Yicheng Qin
e342de3cc5 etcdmain: fix parsing discovery error
The discovery error is wrapped into a struct now, and cannot be compared
to predefined errors. Correct the comparison behavior to fix the
problem.
2015-11-19 13:26:50 -08:00
Yicheng Qin
26cc2111cd etcdmain: improve log when join discovery fails
Before this PR, the log is

```
2015/09/1 13:18:31 etcdmain: client: etcd cluster is unavailable or
misconfigured
```

It is quite hard for people to understand what happens.

Now we print out the exact reason for the failure, and explains the way
to handle it.
2015-11-19 13:26:42 -08:00
Jonathan Boulle
5d6457e658 godeps: bump coreos/pkg/capnslog
Update to catch coreos/pkg#43 which should fix SYSLOG_IDENTIFIER getting
set when etcd is logging to the journal.
2015-11-19 13:26:29 -08:00
Wojciech Tyczynski
53bc644168 client: regenerate code to unmarshal key response
Regenerate code for unmarshaling key response with a new version of
ugorji/go/codec.
2015-11-19 13:26:19 -08:00
Wojciech Tyczynski
ad3bb484ca Godeps: update ugorji/go/codec dependency
Update ugorji/go/codec dependency to the newer version.
2015-11-19 13:26:08 -08:00
mqliang
15f7b736e4 etcdctl:fix health check condition 2015-11-19 13:25:57 -08:00
Yicheng Qin
4dc835c718 *: bump to v2.2.1+git 2015-10-15 21:59:15 -07:00
Yicheng Qin
75f8282eef *: bump to v2.2.1 2015-10-15 21:31:51 -07:00
Gyu-Ho Lee
45c86af0eb etcdctl/command: mk command with PrevNoExist
This attempts to fix #3676. `PrevNoExist` checks if the key previously exists
and if so, it returns an error, which is how `mk` command is supposed to work.
The previous code ignores the previous key and overwrites with the later value.

/cc @yichengq
2015-10-15 15:26:52 -07:00
Yicheng Qin
71e5467807 Godeps: update prometheus dependency
prometheus updates its directory layout
(https://github.com/prometheus/client_golang#where-is-model-extraction-and-text)
and makes Godeps restore/save unable to work.

Remove all prometheus dependency manually and godep save again to fix
this problem.
2015-10-15 15:26:42 -07:00
Wojciech Tyczynski
0169fec873 client: regenerate code to unmarshal key response
Regenerate code for unmarshaling key response with a new version of
ugorji/go/codec
2015-10-15 15:24:21 -07:00
Wojciech Tyczynski
766023b1b0 Godeps: update ugorji/go/codec dependency
Update ugorji/go/codec dependency to the newer version (a bunch of fixed were made).
2015-10-15 15:24:12 -07:00
Yicheng Qin
ca9e63dde2 pkg/types: fix unwanted unescape in NewURLsMap
We use url.ParseQuery to parse names-to-urls string, but it has side
effect that unescape the string. If the initial-cluster string has ipv6
which contains `%25`, it will unescape it to `%` and make further url
parse failed.

Fix it by modifiying the parse process.

Go1.4 doesn't support literal IPv6 address w/ zone in
URI(https://github.com/golang/go/issues/6530), so we only enable tests
in Go1.5+.
2015-10-15 15:24:00 -07:00
Xiang Li
7659bbb1b2 etcdmain: print out error and suggestion for fixing notify issue 2015-10-15 15:23:49 -07:00
Xiang Li
f8b98d3925 etcdhttp: add Content-Type: application/json header to version handler 2015-10-15 15:23:34 -07:00
Yicheng Qin
9ee3ed777b etcdmain: exit after print out ErrDuplicateID
etcd should exit after printing log for unhandlable error.
2015-10-15 15:23:24 -07:00
Xiang Li
c9bd125490 etcdsever: mismatch error uses the same format as the corresponding flags 2015-10-15 15:22:52 -07:00
Guohua ouyang
ec49496111 proxy: improve log for retrying an unavailable endpoint
Fixes #3541

Signed-off-by: Guohua ouyang <guohuaouyang@gmail.com>
2015-10-15 15:22:40 -07:00
Xiang Li
baaefd18e2 etcdmain: better logging when user forget to set initial flags 2015-10-15 15:22:29 -07:00
Hitoshi Mitake
72c18eb7ba etcdctl: use a context with -total-timeout in simple commands
Like the commit 8ebc933111, this commit lets simple etcdctl commands
use a context with timeout value passed via -total-timeout.

This commit doesn't change complex commands like watch,
cluster-health, and import because it is not obvious that using the
context in the commands is good or not.
2015-10-15 15:22:13 -07:00
Hitoshi Mitake
2e87d71bc6 etcdctl: use user specified timeout value for entire command execution
etcdctl should be capable to use a user specified timeout value for
total command execution, not only per request timeout. This commit
adds a new option --total-timeout to the command. The value passed via
this option is used as a timeout value of entire command execution.

Fixes coreos#3517
2015-10-15 15:21:38 -07:00
Brandon Philips
217dccd617 raft: improve panic error message
Give a human being some insight into how we might have gotten to this
state based on feedback from #3504.
2015-10-15 15:20:12 -07:00
Yicheng Qin
3ceb5dd270 client: add Nodes to codecgen and regenerate 2015-10-15 15:19:58 -07:00
Yicheng Qin
49b77a59cf client: add Nodes type to faciliate sorting
This helps users to sort easily.
2015-10-15 15:19:52 -07:00
955 changed files with 23785 additions and 88471 deletions

1
.gitignore vendored
View File

@@ -9,4 +9,3 @@
*.swp
/hack/insta-discovery/.env
*.test
tools/functional-tester/docker/bin

View File

@@ -1,4 +1,4 @@
// Copyright 2016 CoreOS, Inc.
// Copyright 2014 CoreOS, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.

View File

@@ -1,19 +1,11 @@
language: go
sudo: false
go:
- 1.4
- 1.5
- tip
matrix:
allow_failures:
- go: tip
addons:
apt:
packages:
- libpcap-dev
install:
- go get github.com/barakmich/go-nyet
script:
- ./test
- INTEGRATION=y ./test

View File

@@ -35,7 +35,7 @@ Thanks for your contributions!
### Code style
The coding style suggested by the Golang community is used in etcd. See the [style doc](https://github.com/golang/go/wiki/CodeReviewComments) for details.
The coding style suggested by the Golang community is used in etcd. See the [style doc](https://code.google.com/p/go-wiki/wiki/CodeReviewComments) for details.
Please follow this style to make etcd easy to review, maintain and develop.

View File

@@ -1,4 +1,4 @@
# Snapshot Migration
## Snapshot Migration
You can migrate a snapshot of your data from a v0.4.9+ cluster into a new etcd 2.2 cluster using a snapshot migration. After snapshot migration, the etcd indexes of your data will change. Many etcd applications rely on these indexes to behave correctly. This operation should only be done while all etcd applications are stopped.
@@ -15,7 +15,7 @@ etcdctl --endpoint new_cluster.example.com import --snap backup.snap
```
If you have a large amount of data, you can specify more concurrent works to copy data in parallel by using `-c` flag.
If you have hidden keys to copy, you can use `--hidden` flag to specify. For example fleet uses `/_coreos.com/fleet` so to import those keys use `--hidden /_coreos.com`.
If you have hidden keys to copy, you can use `--hidden` flag to specify.
And the data will quickly copy into the new cluster:

View File

@@ -1,8 +1,8 @@
# Administration
## Administration
## Data Directory
### Data Directory
### Lifecycle
#### Lifecycle
When first started, etcd stores its configuration into a data directory specified by the data-dir configuration parameter.
Configuration is stored in the write ahead log and includes: the local member ID, cluster ID, and initial cluster configuration.
@@ -20,7 +20,7 @@ Once removed the member can be re-added with an empty data directory.
[remove-a-member]: runtime-configuration.md#remove-a-member
### Contents
#### Contents
The data directory has two sub-directories in it:
@@ -32,18 +32,18 @@ If `--wal-dir` flag is set, etcd will write the write ahead log files to the spe
[wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal
[snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap
## Cluster Management
### Cluster Management
### Lifecycle
#### Lifecycle
If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters.
This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other.
### Monitoring
#### Monitoring
It is important to monitor your production etcd cluster for healthy information and runtime metrics.
#### Health Monitoring
##### Health Monitoring
At lowest level, etcd exposes health information via HTTP at `/health` in JSON format. If it returns `{"health": "true"}`, then the cluster is healthy. Please note the `/health` endpoint is still an experimental one as in etcd 2.2.
@@ -63,16 +63,16 @@ member fd422379fda50e48 is healthy: got healthy result from http://127.0.0.1:323
cluster is healthy
```
#### Runtime Metrics
##### Runtime Metrics
etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. You can read more through the runtime metrics [doc](metrics.md).
### Debugging
#### Debugging
Debugging a distributed system can be difficult. etcd provides several ways to make debug
easier.
#### Enabling Debug Logging
##### Enabling Debug Logging
When you want to debug etcd without stopping it, you can enable debug logging at runtime.
etcd exposes logging configuration at `/config/local/log`.
@@ -85,7 +85,7 @@ $ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"INFO"}'
$ # debug logging disabled
```
#### Debugging Variables
##### Debugging Variables
Debug variables are exposed for real-time debugging purposes. Developers who are familiar with etcd can utilize these variables to debug unexpected behavior. etcd exposes debug variables via HTTP at `/debug/vars` in JSON format. The debug variables contains
`cmdline`, `file_descriptor_limit`, `memstats` and `raft.status`.
@@ -107,7 +107,7 @@ Debug variables are exposed for real-time debugging purposes. Developers who are
}
```
### Optimal Cluster Size
#### Optimal Cluster Size
The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.
@@ -152,7 +152,7 @@ This example will walk you through the process of migrating the infra1 member to
|infra2|10.0.1.12:2380|
```sh
$ export ETCDCTL_ENDPOINT=http://10.0.1.10:2379,http://10.0.1.11:2379,http://10.0.1.12:2379
$ export ETCDCTL_PEERS=http://10.0.1.10:2379,http://10.0.1.11:2379,http://10.0.1.12:2379
```
```sh
@@ -293,6 +293,6 @@ If timeout happens several times continuously, administrators should check statu
#### Maximum OS threads
By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [has changed in Go 1.5](https://golang.org/doc/go1.5#runtime)).
By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [may change in Go 1.5](https://docs.google.com/document/d/1At2Ls5_fhJQ59kDK2DFVhFu3g5mATSXqqV5QrxinasI/edit)).
When using etcd in heavy-load scenarios on machines with multiple cores it will usually be desirable to increase the number of threads that etcd can utilize. To do this, simply set the environment variable `GOMAXPROCS` to the desired number when starting etcd. For more information on this variable, see the Go [runtime](https://golang.org/pkg/runtime) documentation.

View File

@@ -234,50 +234,6 @@ curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -d ttl= -d prevExist=t
}
```
### Refreshing key TTL
Keys in etcd can be refreshed notifying watchers
this can be achieved by setting the refresh to true when updating a TTL
You cannot update the value of a key when refreshing it
```sh
curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -d ttl=5
curl http://127.0.0.1:2379/v2/keys/foo -XPUT -d ttl=5 -d refresh=true -d prevExist=true
```
```json
{
"action": "set",
"node": {
"createdIndex": 5,
"expiration": "2013-12-04T12:01:21.874888581-08:00",
"key": "/foo",
"modifiedIndex": 5,
"ttl": 5,
"value": "bar"
}
}
{
"action":"update",
"node":{
"key":"/foo",
"value":"bar",
"expiration": "2013-12-04T12:01:26.874888581-08:00",
"ttl":5,
"modifiedIndex":6,
"createdIndex":5
},
"prevNode":{
"key":"/foo",
"value":"bar",
"expiration":"2013-12-04T12:01:21.874888581-08:00",
"ttl":3,
"modifiedIndex":5,
"createdIndex":5
}
}
```
### Waiting for a change
@@ -403,7 +359,7 @@ curl 'http://127.0.0.1:2379/v2/keys/foo?wait=true&waitIndex=2008'
#### Connection being closed prematurely
The server may close a long polling connection before emitting any events.
This can happen due to a timeout or the server being shutdown.
This can happend due to a timeout or the server being shutdown.
Since the HTTP header is sent immediately upon accepting the connection, the response will be seen as empty: `200 OK` and empty body.
The clients should be prepared to deal with this scenario and retry the watch.
@@ -544,15 +500,13 @@ etcd can be used as a centralized coordination service in a cluster, and `Compar
This command will set the value of a key only if the client-provided conditions are equal to the current conditions.
_Note that `CompareAndSwap` does not work with [directories](#listing-a-directory). If an attempt is made to `CompareAndSwap` a directory, a 102 "Not a file" error will be returned._
The current comparable conditions are:
1. `prevValue` - checks the previous value of the key.
2. `prevIndex` - checks the previous modifiedIndex of the key.
3. `prevExist` - checks existence of the key: if `prevExist` is true, it is an `update` request; if `prevExist` is `false`, it is a `create` request.
3. `prevExist` - checks existence of the key: if `prevExist` is true, it is an `update` request; if prevExist is `false`, it is a `create` request.
Here is a simple example.
Let's create a key-value pair first: `foo=one`.
@@ -631,8 +585,6 @@ We successfully changed the value from "one" to "two" since we gave the correct
This command will delete a key only if the client-provided conditions are equal to the current conditions.
_Note that `CompareAndDelete` does not work with [directories](#listing-a-directory). If an attempt is made to `CompareAndDelete` a directory, a 102 "Not a file" error will be returned._
The current comparable conditions are:
1. `prevValue` - checks the previous value of the key.
@@ -1096,7 +1048,6 @@ curl http://127.0.0.1:2379/v2/stats/self
### Store Statistics
The store statistics include information about the operations that this node has handled.
Note that v2 `store Statistics` is stored in-memory. When a member stops, store statistics will reset on restart.
Operations that modify the store's state like create, delete, set and update are seen by the entire cluster and the number will increase on all nodes.
Operations like get and watch are node local and will only be seen on this node.
@@ -1126,6 +1077,6 @@ curl http://127.0.0.1:2379/v2/stats/store
## Cluster Config
See the [members API][members-api] for details on the cluster management.
See the [other etcd APIs][other-apis] for details on the cluster management.
[members-api]: members_api.md
[other-apis]: other_apis.md

View File

@@ -19,7 +19,7 @@ Each role has exact one associated Permission List. An permission list exists fo
The special static ROOT (named `root`) role has a full permissions on all key-value resources, the permission to manage user resources and settings resources. Only the ROOT role has the permission to manage user resources and modify settings resources. The ROOT role is built-in and does not need to be created.
There is also a special GUEST role, named 'guest'. These are the permissions given to unauthenticated requests to etcd. This role will be created automatically, and by default allows access to the full keyspace due to backward compatibility. (etcd did not previously authenticate any actions.). This role can be modified by a ROOT role holder at any time, to reduce the capabilities of unauthenticated users.
There is also a special GUEST role, named 'guest'. These are the permissions given to unauthenticated requests to etcd. This role will be created automatically, and by default allows access to the full keyspace due to backward compatability. (etcd did not previously authenticate any actions.). This role can be modified by a ROOT role holder at any time, to reduce the capabilities of unauthenticated users.
#### Permissions
@@ -124,7 +124,7 @@ The User JSON object is formed as follows:
Password is only passed when necessary.
**Get a List of Users**
**Get a list of users**
GET/HEAD /v2/auth/users
@@ -137,36 +137,7 @@ GET/HEAD /v2/auth/users
Content-type: application/json
200 Body:
{
"users": [
{
"user": "alice",
"roles": [
{
"role": "root",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
},
{
"user": "bob",
"roles": [
{
"role": "guest",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
}
]
"users": ["alice", "bob", "eve"]
}
**Get User Details**
@@ -184,26 +155,7 @@ GET/HEAD /v2/auth/users/alice
200 Body:
{
"user" : "alice",
"roles" : [
{
"role": "fleet",
"permissions" : {
"kv" : {
"read": [ "/fleet/" ],
"write": [ "/fleet/" ]
}
}
},
{
"role": "etcd",
"permissions" : {
"kv" : {
"read": [ "*" ],
"write": [ "*" ]
}
}
}
]
"roles" : ["fleet", "etcd"]
}
**Create Or Update A User**
@@ -261,6 +213,22 @@ A full role structure may look like this. A Permission List structure is used fo
}
```
**Get a list of Roles**
GET/HEAD /v2/auth/roles
Sent Headers:
Authorization: Basic <BasicAuthString>
Possible Status Codes:
200 OK
401 Unauthorized
200 Headers:
Content-type: application/json
200 Body:
{
"roles": ["fleet", "etcd", "quay"]
}
**Get Role Details**
GET/HEAD /v2/auth/roles/fleet
@@ -284,50 +252,6 @@ GET/HEAD /v2/auth/roles/fleet
}
}
**Get a list of Roles**
GET/HEAD /v2/auth/roles
Sent Headers:
Authorization: Basic <BasicAuthString>
Possible Status Codes:
200 OK
401 Unauthorized
200 Headers:
Content-type: application/json
200 Body:
{
"roles": [
{
"role": "fleet",
"permissions": {
"kv": {
"read": ["/fleet/"],
"write": ["/fleet/"]
}
}
},
{
"role": "etcd",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
},
{
"role": "quay",
"permissions": {
"kv": {
"read": ["*"],
"write": ["*"]
}
}
}
]
}
**Create Or Update A Role**
PUT /v2/auth/roles/rkt

View File

@@ -134,7 +134,7 @@ $ etcdctl role remove myrolename
## Enabling authentication
The minimal steps to enabling auth are as follows. The administrator can set up users and roles before or after enabling authentication, as a matter of preference.
The minimal steps to enabling auth follow. The administrator can set up users and roles before or after enabling authentication, as a matter of preference.
Make sure the root user is created:

View File

@@ -32,7 +32,7 @@ The consistent flag for read operations is removed in etcd 2.0.0. The normal rea
The read consistency guarantees are:
The consistent read guarantees the sequential consistency within one client that talks to one etcd server. Read/Write from one client to one etcd member should be observed in order. If one client write a value to an etcd server successfully, it should be able to get the value out of the server immediately.
The consistent read guarantees the sequential consistency within one client that talks to one etcd server. Read/Write from one client to one etcd member should be observed in order. If one client write a value to a etcd server successfully, it should be able to get the value out of the server immediately.
Each etcd member will proxy the request to leader and only return the result to user after the result is applied on the local member. Thus after the write succeed, the user is guaranteed to see the value on the member it sent the request to.
@@ -60,9 +60,9 @@ A size key needs to be provided inside a [discovery token][discoverytoken].
## HTTP Admin API
`v2/admin` on peer url and `v2/keys/_etcd` are unified under the new [v2/members API][members-api] to better explain which machines are part of an etcd cluster, and to simplify the keyspace for all your use cases.
`v2/admin` on peer url and `v2/keys/_etcd` are unified under the new [v2/member API][memberapi] to better explain which machines are part of an etcd cluster, and to simplify the keyspace for all your use cases.
[members-api]: members_api.md
[memberapi]: other_apis.md
## HTTP Key Value API
- The follower can now transparently proxy write requests to the leader. Clients will no longer see 307 redirections to the leader from etcd.

View File

@@ -14,7 +14,7 @@ GCE n1-highcpu-2 machine type
## Testing
Use [etcd v3 benchmark tool](../../tools/v3benchmark/).
Use [etcd v3 benchmark tool](../../hack/v3benchmark/).
## Performance

View File

@@ -1,76 +0,0 @@
# Watch Memory Usage Benchmark
*NOTE*: The watch features are under active development, and their memory usage may change as that development progresses. We do not expect it to significantly increase beyond the figures stated below.
A primary goal of etcd is supporting a very large number of watchers doing a massively large amount of watching. etcd aims to support O(10k) clients, O(100K) watch streams (O(10) streams per client) and O(10M) total watchings (O(100) watching per stream). The memory consumed by each individual watching accounts for the largest portion of etcd's overall usage, and is therefore the focus of current and future optimizations.
Three related components of etcd watch consume physical memory: each `grpc.Conn`, each watch stream, and each instance of the watching activity. `grpc.Conn` maintains the actual TCP connection and other gRPC connection state. Each `grpc.Conn` consumes O(10kb) of memory, and might have multiple watch streams attached.
Each watch stream is an independent HTTP2 connection which consumes another O(10kb) of memory.
Multiple watchings might share one watch stream.
Watching is the actual struct that tracks the changes on the key-value store. Each watching should only consume < O(1kb).
```
+-------+
| watch |
+---------> | foo |
| +-------+
+------+-----+
| stream |
+--------------> | |
| +------+-----+ +-------+
| | | watch |
| +---------> | bar |
+-----+------+ +-------+
| | +------------+
| conn +-------> | stream |
| | | |
+-----+------+ +------------+
|
|
|
| +------------+
+--------------> | stream |
| |
+------------+
```
The theoretical memory consumption of watch can be approximated with the formula:
`memory = c1 * number_of_conn + c2 * avg_number_of_stream_per_conn + c3 * avg_number_of_watch_stream`
## Testing Environment
etcd version
- git head https://github.com/coreos/etcd/commit/185097ffaa627b909007e772c175e8fefac17af3
GCE n1-standard-2 machine type
- 7.5 GB memory
- 2x CPUs
## Overall memory usage
The overall memory usage captures how much [RSS](https://en.wikipedia.org/wiki/Resident_set_size) etcd consumes with the client watchers. The result is not very accurate and might vary around 10%. It is still very meaningful, since the goal is to learn about the rough memory usage and the pattern (find out `c1`, `c2` and `c3`).
With the benchmark result, we can calculate roughly that `c1 = 17kb`, `c2 = 18kb` and `c3 = 350bytes`. So each additional client connection consumes 17kb of memory and each additional stream consumes 18kb of memory, and each additional watching only cause 350bytes. A single etcd server can maintain millions of watchings with a few GB of memory in normal case.
| clients | streams per client | watchings per stream | total watching | memory usage |
|---------|---------|-----------|----------------|--------------|
| 1k | 1 | 1 | 1k | 50MB |
| 2k | 1 | 1 | 2k | 90MB |
| 5k | 1 | 1 | 5k | 200MB |
| 1k | 10 | 1 | 10k | 217MB |
| 2k | 10 | 1 | 20k | 417MB |
| 5k | 10 | 1 | 50k | 980MB |
| 1k | 50 | 1 | 50k | 1001MB |
| 2k | 50 | 1 | 100k | 1960MB |
| 5k | 50 | 1 | 250k | 4700MB |
| 1k | 50 | 10 | 500k | 1171MB |
| 2k | 50 | 10 | 1M | 2371MB |
| 5k | 50 | 10 | 2.5M | 5710MB |
| 1k | 50 | 100 | 5M | 2380MB |
| 2k | 50 | 100 | 10M | 4672MB |
| 5k | 50 | 100 | 50M | *OOM* |

View File

@@ -1,98 +0,0 @@
# Storage Memory Usage Benchmark
<!---todo: link storage to storage design doc-->
Two components of etcd storage consume physical memory. The etcd process allocates an *in-memory index* to speed key lookup. The process's *page cache*, managed by the operating system, stores recently-accessed data from disk for quick re-use.
The in-memory index holds all the keys in a [B-tree][btree] data structure, along with pointers to the on-disk data (the values). Each key in the B-tree may contain multiple pointers, pointing to different versions of its values. The theoretical memory consumption of the in-memory index can hence be approximated with the formula:
`N * (c1 + avg_key_size) + N * (avg_versions_of_key) * (c2 + size_of_pointer)`
where `c1` is the key metadata overhead and `c2` is the version metadata overhead.
The graph shows the detailed structure of the in-memory index B-tree.
```
In mem index
+------------+
| key || ... |
+--------------+ | || |
| | +------------+
| | | v1 || ... |
| disk <----------------| || | Tree Node
| | +------------+
| | | v2 || ... |
| <----------------+ || |
| | +------------+
+--------------+ +-----+ | | |
| | | | |
| +------------+
|
|
^
------+
| ... |
| |
+-----+
| ... | Tree Node
| |
+-----+
| ... |
| |
------+
```
[Page cache memory][pagecache] is managed by the operating system and is not covered in detail in this document.
## Testing Environment
etcd version
- git head https://github.com/coreos/etcd/commit/776e9fb7be7eee5e6b58ab977c8887b4fe4d48db
GCE n1-standard-2 machine type
- 7.5 GB memory
- 2x CPUs
## In-memory index memory usage
In this test, we only benchmark the memory usage of the in-memory index. The goal is to find `c1` and `c2` mentioned above and to understand the hard limit of memory consumption of the storage.
We calculate the memory usage consumption via the Go runtime.ReadMemStats. We calculate the total allocated bytes difference before creating the index and after creating the index. It cannot perfectly reflect the memory usage of the in-memory index itself but can show the rough consumption pattern.
| N | versions | key size | memory usage |
|------|----------|----------|--------------|
| 100K | 1 | 64bytes | 22MB |
| 100K | 5 | 64bytes | 39MB |
| 1M | 1 | 64bytes | 218MB |
| 1M | 5 | 64bytes | 432MB |
| 100K | 1 | 256bytes | 41MB |
| 100K | 5 | 256bytes | 65MB |
| 1M | 1 | 256bytes | 409MB |
| 1M | 5 | 256bytes | 506MB |
Based on the result, we can calculate `c1=120bytes`, `c2=30bytes`. We only need two sets of data to calculate `c1` and `c2`, since they are the only unknown variable in the formula. The `c1=120bytes` and `c2=30bytes` are the average value of the 4 sets of `c1` and `c2` we calculated. The key metadata overhead is still relatively nontrivial (50%) for small key-value pairs. However, this is a significant improvement over the old store, which had at least 1000% overhead.
## Overall memory usage
The overall memory usage captures how much RSS etcd consumes with the storage. The value size should have very little impact on the overall memory usage of etcd, since we keep values on disk and only retain hot values in memory, managed by the OS page cache.
| N | versions | key size | value size | memory usage |
|------|----------|----------|------------|--------------|
| 100K | 1 | 64bytes | 256bytes | 40MB |
| 100K | 5 | 64bytes | 256bytes | 89MB |
| 1M | 1 | 64bytes | 256bytes | 470MB |
| 1M | 5 | 64bytes | 256bytes | 880MB |
| 100K | 1 | 64bytes | 1KB | 102MB |
| 100K | 5 | 64bytes | 1KB | 164MB |
| 1M | 1 | 64bytes | 1KB | 587MB |
| 1M | 5 | 64bytes | 1KB | 836MB |
Based on the result, we know the value size does not significantly impact the memory consumption. There is some minor increase due to more data held in the OS page cache.
[btree]: https://en.wikipedia.org/wiki/B-tree
[pagecache]: https://en.wikipedia.org/wiki/Page_cache

View File

@@ -1,6 +1,6 @@
# Branch Management
## Branch Management
## Guide
### Guide
- New development occurs on the [master branch](https://github.com/coreos/etcd/tree/master)
- Master branch should always have a green build!

View File

@@ -124,7 +124,7 @@ There two methods that can be used for discovery:
### etcd Discovery
To better understand the design about discovery service protocol, we suggest you read [this](discovery_protocol.md).
To better understand the design about discovery service protocol, we suggest you read [this](./discovery_protocol.md).
#### Lifetime of a Discovery URL
@@ -148,7 +148,7 @@ If you bootstrap an etcd cluster using discovery service with more than the expe
The URL you will use in this case will be `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` and the etcd members will use the `https://myetcd.local/v2/keys/discovery/6c007a14875d53d9bf0ef5a6fc0257c817f0fb83` directory for registration as they start.
**Each member must have a different name flag specified. `Hostname` or `machine-id` can be a good choice. Or discovery will fail due to duplicated name.**
Each member must have a different name flag specified. Or discovery will fail due to duplicated name.
Now we start etcd with those relevant flags for each member:
@@ -200,7 +200,7 @@ ETCD_DISCOVERY=https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573d
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
**Each member must have a different name flag specified. `Hostname` or `machine-id` can be a good choice. Or discovery will fail due to duplicated name.**
Each member must have a different name flag specified. Or discovery will fail due to duplicated name.
Now we start etcd with those relevant flags for each member:
@@ -285,13 +285,6 @@ The following DNS SRV records are looked up in the listed order:
If `_etcd-server-ssl._tcp.example.com` is found then etcd will attempt the bootstrapping process over SSL.
To help clients discover the etcd cluster, the following DNS SRV records are looked up in the listed order:
* _etcd-client._tcp.example.com
* _etcd-client-ssl._tcp.example.com
If `_etcd-client-ssl._tcp.example.com` is found, clients will attempt to communicate with the etcd cluster over SSL.
#### Create DNS SRV records
```
@@ -301,13 +294,6 @@ _etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra1.example.com.
_etcd-server._tcp.example.com. 300 IN SRV 0 0 2380 infra2.example.com.
```
```
$ dig +noall +answer SRV _etcd-client._tcp.example.com
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra0.example.com.
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra1.example.com.
_etcd-client._tcp.example.com. 300 IN SRV 0 0 2379 infra2.example.com.
```
```
$ dig +noall +answer infra0.example.com infra1.example.com infra2.example.com
infra0.example.com. 300 IN A 10.0.1.10
@@ -396,21 +382,9 @@ DNS SRV records can also be used to configure the list of peers for an etcd serv
$ etcd --proxy on -discovery-srv example.com
```
#### etcd client configuration
DNS SRV records can also be used to help clients discover the etcd cluster.
The official [etcd/client](../client) supports [DNS Discovery](https://godoc.org/github.com/coreos/etcd/client#Discoverer).
`etcdctl` also supports DNS Discovery by specifying the `--discovery-srv` option.
```
$ etcdctl --discovery-srv example.com set foo bar
```
#### Error Cases
You might see an error like `cannot find local etcd $name from SRV records.`. That means the etcd member fails to find itself from the cluster defined in SRV records. The resolved address in `-initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets.
You might see the an error like `cannot find local etcd $name from SRV records.`. That means the etcd member fails to find itself from the cluster defined in SRV records. The resolved address in `-initial-advertise-peer-urls` *must match* one of the resolved addresses in the SRV targets.
# 0.4 to 2.0+ Migration Guide

View File

@@ -1,274 +1,261 @@
# Configuration Flags
## Configuration Flags
etcd is configurable through command-line flags and environment variables. Options set on the command line take precedence over those from the environment.
The format of environment variable for flag `-my-flag` is `ETCD_MY_FLAG`. It applies to all flags.
The [official etcd ports](https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=etcd) are 2379 for client requests, and 2380 for peer communication. Some legacy code and documentation still references ports 4001 and 7001, but all new etcd use and discussion should adopt the assigned ports.
To start etcd automatically using custom settings at startup in Linux, using a [systemd][systemd-intro] unit is highly recommended.
[systemd-intro]: http://freedesktop.org/wiki/Software/systemd/
## Member Flags
### Member Flags
### -name
##### -name
+ Human-readable name for this member.
+ default: "default"
+ env variable: ETCD_NAME
+ This value is referenced as this node's own entries listed in the `-initial-cluster` flag (Ex: `default=http://localhost:2380` or `default=http://localhost:2380,default=http://localhost:7001`). This needs to match the key used in the flag if you're using [static bootstrapping](clustering.md#static). When using discovery, each member must have a unique name. `Hostname` or `machine-id` can be a good choice.
+ This value is referenced as this node's own entries listed in the `-initial-cluster` flag (Ex: `default=http://localhost:2380` or `default=http://localhost:2380,default=http://localhost:7001`). This needs to match the key used in the flag if you're using [static boostrapping](clustering.md#static).
### -data-dir
##### -data-dir
+ Path to the data directory.
+ default: "${name}.etcd"
+ env variable: ETCD_DATA_DIR
### -wal-dir
##### -wal-dir
+ Path to the dedicated wal directory. If this flag is set, etcd will write the WAL files to the walDir rather than the dataDir. This allows a dedicated disk to be used, and helps avoid io competition between logging and other IO operations.
+ default: ""
+ env variable: ETCD_WAL_DIR
### -snapshot-count
##### -snapshot-count
+ Number of committed transactions to trigger a snapshot to disk.
+ default: "10000"
+ env variable: ETCD_SNAPSHOT_COUNT
### -heartbeat-interval
##### -heartbeat-interval
+ Time (in milliseconds) of a heartbeat interval.
+ default: "100"
+ env variable: ETCD_HEARTBEAT_INTERVAL
### -election-timeout
##### -election-timeout
+ Time (in milliseconds) for an election to timeout. See [Documentation/tuning.md](tuning.md#time-parameters) for details.
+ default: "1000"
+ env variable: ETCD_ELECTION_TIMEOUT
### -listen-peer-urls
##### -listen-peer-urls
+ List of URLs to listen on for peer traffic. This flag tells the etcd to accept incoming requests from its peers on the specified scheme://IP:port combinations. Scheme can be either http or https.If 0.0.0.0 is specified as the IP, etcd listens to the given port on all interfaces. If an IP address is given as well as a port, etcd will listen on the given port and interface. Multiple URLs may be used to specify a number of addresses and ports to listen on. The etcd will respond to requests from any of the listed addresses and ports.
+ default: "http://localhost:2380,http://localhost:7001"
+ env variable: ETCD_LISTEN_PEER_URLS
+ example: "http://10.0.0.1:2380"
+ invalid example: "http://example.com:2380" (domain name is invalid for binding)
### -listen-client-urls
##### -listen-client-urls
+ List of URLs to listen on for client traffic. This flag tells the etcd to accept incoming requests from the clients on the specified scheme://IP:port combinations. Scheme can be either http or https. If 0.0.0.0 is specified as the IP, etcd listens to the given port on all interfaces. If an IP address is given as well as a port, etcd will listen on the given port and interface. Multiple URLs may be used to specify a number of addresses and ports to listen on. The etcd will respond to requests from any of the listed addresses and ports.
+ default: "http://localhost:2379,http://localhost:4001"
+ env variable: ETCD_LISTEN_CLIENT_URLS
+ example: "http://10.0.0.1:2379"
+ invalid example: "http://example.com:2379" (domain name is invalid for binding)
### -max-snapshots
##### -max-snapshots
+ Maximum number of snapshot files to retain (0 is unlimited)
+ default: 5
+ env variable: ETCD_MAX_SNAPSHOTS
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
### -max-wals
##### -max-wals
+ Maximum number of wal files to retain (0 is unlimited)
+ default: 5
+ env variable: ETCD_MAX_WALS
+ The default for users on Windows is unlimited, and manual purging down to 5 (or your preference for safety) is recommended.
### -cors
##### -cors
+ Comma-separated white list of origins for CORS (cross-origin resource sharing).
+ default: none
+ env variable: ETCD_CORS
## Clustering Flags
### Clustering Flags
`-initial` prefix flags are used in bootstrapping ([static bootstrap][build-cluster], [discovery-service bootstrap][discovery] or [runtime reconfiguration][reconfig]) a new member, and ignored when restarting an existing member.
`-discovery` prefix flags need to be set when using [discovery service][discovery].
### -initial-advertise-peer-urls
##### -initial-advertise-peer-urls
+ List of this member's peer URLs to advertise to the rest of the cluster. These addresses are used for communicating etcd data around the cluster. At least one must be routable to all cluster members. These URLs can contain domain names.
+ default: "http://localhost:2380,http://localhost:7001"
+ env variable: ETCD_INITIAL_ADVERTISE_PEER_URLS
+ example: "http://example.com:2380, http://10.0.0.1:2380"
### -initial-cluster
##### -initial-cluster
+ Initial cluster configuration for bootstrapping.
+ default: "default=http://localhost:2380,default=http://localhost:7001"
+ env variable: ETCD_INITIAL_CLUSTER
+ The key is the value of the `-name` flag for each node provided. The default uses `default` for the key because this is the default for the `-name` flag.
### -initial-cluster-state
##### -initial-cluster-state
+ Initial cluster state ("new" or "existing"). Set to `new` for all members present during initial static or DNS bootstrapping. If this option is set to `existing`, etcd will attempt to join the existing cluster. If the wrong value is set, etcd will attempt to start but fail safely.
+ default: "new"
+ env variable: ETCD_INITIAL_CLUSTER_STATE
[static bootstrap]: clustering.md#static
### -initial-cluster-token
##### -initial-cluster-token
+ Initial cluster token for the etcd cluster during bootstrap.
+ default: "etcd-cluster"
+ env variable: ETCD_INITIAL_CLUSTER_TOKEN
### -advertise-client-urls
##### -advertise-client-urls
+ List of this member's client URLs to advertise to the rest of the cluster. These URLs can contain domain names.
+ default: "http://localhost:2379,http://localhost:4001"
+ env variable: ETCD_ADVERTISE_CLIENT_URLS
+ example: "http://example.com:2379, http://10.0.0.1:2379"
+ Be careful if you are advertising URLs such as http://localhost:2379 from a cluster member and are using the proxy feature of etcd. This will cause loops, because the proxy will be forwarding requests to itself until its resources (memory, file descriptors) are eventually depleted.
### -discovery
##### -discovery
+ Discovery URL used to bootstrap the cluster.
+ default: none
+ env variable: ETCD_DISCOVERY
### -discovery-srv
##### -discovery-srv
+ DNS srv domain used to bootstrap the cluster.
+ default: none
+ env variable: ETCD_DISCOVERY_SRV
### -discovery-fallback
##### -discovery-fallback
+ Expected behavior ("exit" or "proxy") when discovery services fails.
+ default: "proxy"
+ env variable: ETCD_DISCOVERY_FALLBACK
### -discovery-proxy
##### -discovery-proxy
+ HTTP proxy to use for traffic to discovery service.
+ default: none
+ env variable: ETCD_DISCOVERY_PROXY
### -strict-reconfig-check
+ Reject reconfiguration requests that would cause quorum loss.
+ default: false
+ env variable: ETCD_STRICT_RECONFIG_CHECK
## Proxy Flags
### Proxy Flags
`-proxy` prefix flags configures etcd to run in [proxy mode][proxy].
### -proxy
##### -proxy
+ Proxy mode setting ("off", "readonly" or "on").
+ default: "off"
+ env variable: ETCD_PROXY
### -proxy-failure-wait
##### -proxy-failure-wait
+ Time (in milliseconds) an endpoint will be held in a failed state before being reconsidered for proxied requests.
+ default: 5000
+ env variable: ETCD_PROXY_FAILURE_WAIT
### -proxy-refresh-interval
##### -proxy-refresh-interval
+ Time (in milliseconds) of the endpoints refresh interval.
+ default: 30000
+ env variable: ETCD_PROXY_REFRESH_INTERVAL
### -proxy-dial-timeout
##### -proxy-dial-timeout
+ Time (in milliseconds) for a dial to timeout or 0 to disable the timeout
+ default: 1000
+ env variable: ETCD_PROXY_DIAL_TIMEOUT
### -proxy-write-timeout
##### -proxy-write-timeout
+ Time (in milliseconds) for a write to timeout or 0 to disable the timeout.
+ default: 5000
+ env variable: ETCD_PROXY_WRITE_TIMEOUT
### -proxy-read-timeout
##### -proxy-read-timeout
+ Time (in milliseconds) for a read to timeout or 0 to disable the timeout.
+ Don't change this value if you use watches because they are using long polling requests.
+ default: 0
+ env variable: ETCD_PROXY_READ_TIMEOUT
## Security Flags
### Security Flags
The security flags help to [build a secure etcd cluster][security].
### -ca-file [DEPRECATED]
##### -ca-file [DEPRECATED]
+ Path to the client server TLS CA file. `-ca-file ca.crt` could be replaced by `-trusted-ca-file ca.crt -client-cert-auth` and etcd will perform the same.
+ default: none
+ env variable: ETCD_CA_FILE
### -cert-file
##### -cert-file
+ Path to the client server TLS cert file.
+ default: none
+ env variable: ETCD_CERT_FILE
### -key-file
##### -key-file
+ Path to the client server TLS key file.
+ default: none
+ env variable: ETCD_KEY_FILE
### -client-cert-auth
##### -client-cert-auth
+ Enable client cert authentication.
+ default: false
+ env variable: ETCD_CLIENT_CERT_AUTH
### -trusted-ca-file
##### -trusted-ca-file
+ Path to the client server TLS trusted CA key file.
+ default: none
+ env variable: ETCD_TRUSTED_CA_FILE
### -peer-ca-file [DEPRECATED]
##### -peer-ca-file [DEPRECATED]
+ Path to the peer server TLS CA file. `-peer-ca-file ca.crt` could be replaced by `-peer-trusted-ca-file ca.crt -peer-client-cert-auth` and etcd will perform the same.
+ default: none
+ env variable: ETCD_PEER_CA_FILE
### -peer-cert-file
##### -peer-cert-file
+ Path to the peer server TLS cert file.
+ default: none
+ env variable: ETCD_PEER_CERT_FILE
### -peer-key-file
##### -peer-key-file
+ Path to the peer server TLS key file.
+ default: none
+ env variable: ETCD_PEER_KEY_FILE
### -peer-client-cert-auth
##### -peer-client-cert-auth
+ Enable peer client cert authentication.
+ default: false
+ env variable: ETCD_PEER_CLIENT_CERT_AUTH
### -peer-trusted-ca-file
##### -peer-trusted-ca-file
+ Path to the peer server TLS trusted CA file.
+ default: none
+ env variable: ETCD_PEER_TRUSTED_CA_FILE
## Logging Flags
### Logging Flags
### -debug
##### -debug
+ Drop the default log level to DEBUG for all subpackages.
+ default: false (INFO for all packages)
+ env variable: ETCD_DEBUG
### -log-package-levels
##### -log-package-levels
+ Set individual etcd subpackages to specific log levels. An example being `etcdserver=WARNING,security=DEBUG`
+ default: none (INFO for all packages)
+ env variable: ETCD_LOG_PACKAGE_LEVELS
## Unsafe Flags
### Unsafe Flags
Please be CAUTIOUS when using unsafe flags because it will break the guarantees given by the consensus protocol.
For example, it may panic if other members in the cluster are still alive.
Follow the instructions when using these flags.
### -force-new-cluster
+ Force to create a new one-member cluster. It commits configuration changes forcing to remove all existing members in the cluster and add itself. It needs to be set to [restore a backup][restore].
##### -force-new-cluster
+ Force to create a new one-member cluster. It commits configuration changes in force to remove all existing members in the cluster and add itself. It needs to be set to [restore a backup][restore].
+ default: false
+ env variable: ETCD_FORCE_NEW_CLUSTER
## Experimental Flags
### Experimental Flags
### -experimental-v3demo
##### -experimental-v3demo
+ Enable experimental [v3 demo API](rfc/v3api.proto).
+ default: false
+ env variable: ETCD_EXPERIMENTAL_V3DEMO
## Miscellaneous Flags
### Miscellaneous Flags
### -version
##### -version
+ Print the version and exit.
+ default: false
## Profiling flags
### -enable-pprof
+ Enable runtime profiling data via HTTP server. Address is at client URL + "/debug/pprof"
+ default: false
[build-cluster]: clustering.md#static
[reconfig]: runtime-configuration.md
[discovery]: clustering.md#discovery

View File

@@ -4,7 +4,7 @@ Discovery service protocol helps new etcd member to discover all other members i
Discovery service protocol is _only_ used in cluster bootstrap phase, and cannot be used for runtime reconfiguration or cluster monitoring.
The protocol uses a new discovery token to bootstrap one _unique_ etcd cluster. Remember that one discovery token can represent only one etcd cluster. As long as discovery protocol on this token starts, even if it fails halfway, it must not be used to bootstrap another etcd cluster.
The protocol uses a new discovery token to bootstrap one _unique_ etcd cluster. Remember that one discovery token can represent only one etcd cluster. As long as discovery protocol on this token starts, even if fails halfway, it must not be used to bootstrap another etcd cluster.
The rest of this article will walk through the discovery process with examples that correspond to a self-hosted discovery cluster. The public discovery service, discovery.etcd.io, functions the same way, but with a layer of polish to abstract away ugly URLs, generate UUIDs automatically, and provide some protections against excessive requests. At its core, the public discovery service still uses an etcd cluster as the data store as described in this document.
@@ -14,7 +14,7 @@ The idea of discovery protocol is to use an internal etcd cluster to coordinate
In the following example workflow, we will list each step of protocol in curl format for ease of understanding.
By convention the etcd discovery protocol uses the key prefix `_etcd/registry`. If `http://example.com` hosts an etcd cluster for discovery service, a full URL to discovery keyspace will be `http://example.com/v2/keys/_etcd/registry`. We will use this as the URL prefix in the example.
By convention the etcd discovery protocol uses the key prefix `_etcd/registry`. If `http://example.com` hosts a etcd cluster for discovery service, a full URL to discovery keyspace will be `http://example.com/v2/keys/_etcd/registry`. We will use this as the URL prefix in the example.
### Creating a New Discovery Token

View File

@@ -12,11 +12,9 @@ export HostIP="192.168.12.50"
The following `docker run` command will expose the etcd client API over ports 4001 and 2379, and expose the peer port over 2380.
This will run the latest release version of etcd. You can specify version if needed (e.g. `quay.io/coreos/etcd:v2.2.0`).
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd0 \
-advertise-client-urls http://${HostIP}:2379,http://${HostIP}:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@@ -46,7 +44,7 @@ The main difference being the value used for the `-initial-cluster` flag, which
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd0 \
-advertise-client-urls http://192.168.12.50:2379,http://192.168.12.50:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@@ -61,7 +59,7 @@ docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd1 \
-advertise-client-urls http://192.168.12.51:2379,http://192.168.12.51:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \
@@ -76,7 +74,7 @@ docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380
```
docker run -d -v /usr/share/ca-certificates/:/etc/ssl/certs -p 4001:4001 -p 2380:2380 -p 2379:2379 \
--name etcd quay.io/coreos/etcd \
--name etcd quay.io/coreos/etcd:v2.0.8 \
-name etcd2 \
-advertise-client-urls http://192.168.12.52:2379,http://192.168.12.52:4001 \
-listen-client-urls http://0.0.0.0:2379,http://0.0.0.0:4001 \

View File

@@ -1,4 +1,4 @@
# Error Code
Error Code
======
This document describes the error code used in key space '/v2/keys'. Feel free to import 'github.com/coreos/etcd/error' to use.

View File

@@ -37,7 +37,7 @@ timeout.
A proxy is a redirection server to the etcd cluster. The proxy handles the
redirection of a client to the current configuration of the etcd cluster. A
typical use case is to start a proxy on a machine, and on first boot up of the
typical usecase is to start a proxy on a machine, and on first boot up of the
proxy specify both the `--proxy` flag and the `--initial-cluster` flag.
From there, any etcdctl client that starts up automatically speaks to the local
@@ -57,26 +57,24 @@ and their integration with the reconfiguration API.
Thus, a member that is down, even infinitely, will never be automatically
removed from the etcd cluster member list.
This makes sense because it's usually an application level / administrative
This makes sense because its usually an application level / administrative
action to determine whether a reconfiguration should happen based on health.
For more information, refer to
[Documentation/runtime-reconf-design.md](https://github.com/coreos/etcd/blob/master/Documentation/runtime-reconf-design.md).
For more information, refer to [Documentation/runtime-reconfiguration.md].
## 6) how does --endpoint work with etcdctl?
## 6) how does --peers work with etcdctl?
The `--endpoint` flag can specify any number of etcd cluster members in a comma
The `--peers` flag can specify any number of etcd cluster members in a comma
separated list. This list might be a subset, equal to, or more than the actual
etcd cluster member list itself.
If only one peer is specified via the `--endpoint` flag, the etcdctl discovers the
If only one peer is specified via the `--peers` flag, the etcdctl discovers the
rest of the cluster via the member list of that one peer, and then it randomly
chooses a member to use. Again, the client can use the `quorum=true` flag on
reads, which will always fail when using a member in the minority.
If peers from multiple clusters are specified via the `--endpoint` flag, etcdctl
If peers from multiple clusters are specified via the `--peers` flag, etcdctl
will randomly choose a peer, and the request will simply get routed to one of
the clusters. This is probably not what you want.
Note: --peers flag is now deprecated and --endpoint should be used instead,
as it might confuse users to give etcdctl a peerURL.

View File

@@ -1,35 +1,35 @@
# Glossary
## Glossary
This document defines the various terms used in etcd documentation, command line and source code.
## Node
### Node
Node is an instance of raft state machine.
It has a unique identification, and records other nodes' progress internally when it is the leader.
## Member
### Member
Member is an instance of etcd. It hosts a node, and provides service to clients.
## Cluster
### Cluster
Cluster consists of several members.
The node in each member follows raft consensus protocol to replicate logs. Cluster receives proposals from members, commits them and apply to local store.
## Peer
### Peer
Peer is another member of the same cluster.
## Proposal
### Proposal
A proposal is a request (for example a write request, a configuration change request) that needs to go through raft protocol.
## Client
### Client
Client is a caller of the cluster's HTTP API.
## Machine (deprecated)
### Machine (deprecated)
The alternative of Member in etcd before 2.0

View File

@@ -46,7 +46,7 @@ ExecStart=/usr/bin/etcd
There are several error cases:
0) Init has already run and the data directory is already configured
0) Init has already ran and the data directory is already configured
1) Discovery fails because of network timeout, etc
2) Discovery fails because the cluster is already full and etcd needs to fall back to proxy
3) Static cluster configuration fails because of conflict, misconfiguration or timeout

View File

@@ -1,24 +1,19 @@
# Libraries and Tools
## Libraries and Tools
**Tools**
- [etcdctl](https://github.com/coreos/etcd/tree/master/etcdctl) - A command line client for etcd
- [etcdctl](https://github.com/coreos/etcdctl) - A command line client for etcd
- [etcd-backup](https://github.com/fanhattan/etcd-backup) - A powerful command line utility for dumping/restoring etcd - Supports v2
- [etcd-dump](https://npmjs.org/package/etcd-dump) - Command line utility for dumping/restoring etcd.
- [etcd-fs](https://github.com/xetorthio/etcd-fs) - FUSE filesystem for etcd
- [etcddir](https://github.com/rekby/etcddir) - Realtime sync etcd and local directory. Work with windows and linux.
- [etcd-browser](https://github.com/henszey/etcd-browser) - A web-based key/value editor for etcd using AngularJS
- [etcd-lock](https://github.com/datawisesystems/etcd-lock) - Master election & distributed r/w lock implementation using etcd - Supports v2
- [etcd-console](https://github.com/matishsiao/etcd-console) - A web-base key/value editor for etcd using PHP
- [etcd-viewer](https://github.com/nikfoundas/etcd-viewer) - An etcd key-value store editor/viewer written in Java
- [etcdtool](https://github.com/mickep76/etcdtool) - Export/Import/Edit etcd directory as JSON/YAML/TOML and Validate directory using JSON schema
- [etcd-rest](https://github.com/mickep76/etcd-rest) - Create generic REST API in Go using etcd as a backend with validation using JSON schema
- [etcdsh](https://github.com/kamilhark/etcdsh) - A command line client with support of command history and tab completion. Supports v2
**Go libraries**
- [etcd/client](https://github.com/coreos/etcd/blob/master/client) - the officially maintained Go client
- [go-etcd](https://github.com/coreos/go-etcd) - the deprecated official client. May be useful for older (<2.0.0) versions of etcd.
- [go-etcd](https://github.com/coreos/go-etcd) - Supports v2
**Java libraries**
@@ -54,7 +49,6 @@
**C++ libraries**
- [edwardcapriolo/etcdcpp](https://github.com/edwardcapriolo/etcdcpp) - Supports v2
- [suryanathan/etcdcpp](https://github.com/suryanathan/etcdcpp) - Supports v2 (with waits)
**Clojure libraries**
@@ -68,7 +62,6 @@
**.Net Libraries**
- [wangjia184/etcdnet](https://github.com/wangjia184/etcdnet) - Supports v2
- [drusellers/etcetera](https://github.com/drusellers/etcetera)
**PHP Libraries**
@@ -87,6 +80,10 @@
- [efrecon/etcd-tcl](https://github.com/efrecon/etcd-tcl) - Supports v2, except wait.
A detailed recap of client functionalities can be found in the [clients compatibility matrix][clients-matrix.md].
[clients-matrix.md]: https://github.com/coreos/etcd/blob/master/Documentation/clients-matrix.md
**Chef Integration**
- [coderanger/etcd-chef](https://github.com/coderanger/etcd-chef)
@@ -114,7 +111,7 @@
- [configdb](https://git.autistici.org/ai/configdb/tree/master) - A REST relational abstraction on top of arbitrary database backends, aimed at storing configs and inventories.
- [scrz](https://github.com/scrz/scrz) - Container manager, stores configuration in etcd.
- [fleet](https://github.com/coreos/fleet) - Distributed init system
- [kubernetes/kubernetes](https://github.com/kubernetes/kubernetes) - Container cluster manager introduced by Google.
- [GoogleCloudPlatform/kubernetes](https://github.com/GoogleCloudPlatform/kubernetes) - Container cluster manager.
- [mailgun/vulcand](https://github.com/mailgun/vulcand) - HTTP proxy that uses etcd as a configuration backend.
- [duedil-ltd/discodns](https://github.com/duedil-ltd/discodns) - Simple DNS nameserver using etcd as a database for names and records.
- [skynetservices/skydns](https://github.com/skynetservices/skydns) - RFC compliant DNS server

View File

@@ -1,120 +0,0 @@
# Members API
* [List members](#list-members)
* [Add a member](#add-a-member)
* [Delete a member](#delete-a-member)
* [Change the peer urls of a member](#change-the-peer-urls-of-a-member)
## List members
Return an HTTP 200 OK response code and a representation of all members in the etcd cluster.
### Request
```
GET /v2/members HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members
```
```json
{
"members": [
{
"id": "272e204152",
"name": "infra1",
"peerURLs": [
"http://10.0.0.10:2380"
],
"clientURLs": [
"http://10.0.0.10:2379"
]
},
{
"id": "2225373f43",
"name": "infra2",
"peerURLs": [
"http://10.0.0.11:2380"
],
"clientURLs": [
"http://10.0.0.11:2379"
]
},
]
}
```
## Add a member
Returns an HTTP 201 response code and the representation of added member with a newly generated a memberID when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member exists in the cluster or existed in the cluster at some point in the past an HTTP 409 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
POST /v2/members HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members -XPOST \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```
```json
{
"id": "3777296169",
"peerURLs": [
"http://10.0.0.10:2380"
]
}
```
## Delete a member
Remove a member from the cluster. The member ID must be a hex-encoded uint64.
Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the member does not exist in the cluster an HTTP 500(TODO: fix this) will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
DELETE /v2/members/<id> HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XDELETE
```
## Change the peer urls of a member
Change the peer urls of a given member. The member ID must be a hex-encoded uint64. Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member does not exist in the cluster an HTTP 404 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
PUT /v2/members/<id> HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XPUT \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```

View File

@@ -1,95 +1,101 @@
# Metrics
## Metrics
**NOTE: The metrics feature is considered experimental. We may add/change/remove metrics without warning in future releases.**
**NOTE: The metrics feature is considered as an experimental. We might add/change/remove metrics without warning in the future releases.**
etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.
etcd only stores these data in memory. If a member restarts, metrics will reset.
The simplest way to see the available metrics is to cURL the metrics endpoint `/metrics` of etcd. The format is described [here](http://prometheus.io/docs/instrumenting/exposition_formats/).
You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Prometheus server and monitor etcd metrics.
You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Promethus server and monitor etcd metrics.
The naming of metrics follows the suggested [best practice of Prometheus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).
The naming of metrics follows the suggested [best practice of Promethus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).
etcd now exposes the following metrics:
## etcdserver
### etcdserver
| Name | Description | Type |
|-----------------------------------------|--------------------------------------------------|-----------|
| file_descriptors_used_total | The total number of file descriptors used | Gauge |
| proposal_durations_seconds | The latency distributions of committing proposal | Histogram |
| pending_proposal_total | The total number of pending proposals | Gauge |
| proposal_failed_total | The total number of failed proposals | Counter |
| Name | Description | Type |
|-----------------------------------------|--------------------------------------------------|---------|
| file_descriptors_used_total | The total number of file descriptors used | Gauge |
| proposal_durations_milliseconds | The latency distributions of committing proposal | Summary |
| pending_proposal_total | The total number of pending proposals | Gauge |
| proposal_failed_total | The total number of failed proposals | Counter |
High file descriptors (`file_descriptors_used_total`) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.
[Proposal](glossary.md#proposal) durations (`proposal_durations_seconds`) give you a histogram about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
[Proposal](glossary.md#proposal) durations (`proposal_durations_milliseconds`) give you an summary about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
Pending proposal (`pending_proposal_total`) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.
Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
## wal
| Name | Description | Type |
|------------------------------------|--------------------------------------------------|-----------|
| fsync_durations_seconds | The latency distributions of fsync called by wal | Histogram |
| last_index_saved | The index of the last entry saved by wal | Gauge |
### store
Abnormally high fsync duration (`fsync_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
These metrics describe the accesses into the data store of etcd members that exist in the cluster. They
are useful to count what kind of actions are taken by users. It is also useful to see and whether all etcd members
"see" the same set of data mutations, and whether reads and watches (which are local) are equally distributed.
All these metrics are prefixed with `etcd_store_`.
## http requests
These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total
incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking
user-generated traffic hitting the etcd cluster .
All these metrics are prefixed with `etcd_http_`
| Name | Description | Type |
|--------------------------------|-----------------------------------------------------------------------------------------|--------------------|
| received_total | Total number of events after parsing and auth. | Counter(method) |
| failed_total | Total number of failed events.   | Counter(method,error) |
| successful_duration_second | Bucketed handling times of the requests, including raft rounds for writes. | Histogram(method) |
| Name | Description | Type |
|---------------------------|------------------------------------------------------------------------------------------|--------------------|
| reads_total | Total number of reads from store, should differ among etcd members (local reads). | Counter(action) |
| writes_total | Total number of writes to store, should be same among all etcd members. | Counter(action) |
| reads_failed_total | Number of failed reads from store (e.g. key missing) on local reads. | Counter(action) |
| writes_failed_total | Number of failed writes to store (e.g. failed compare and swap). | Counter(action) |
| expires_total | Total number of expired keys (due to TTL).   | Counter |
| watch_requests_totals | Total number of incoming watch requests to this etcd member (local watches). | Counter |
| watchers | Current count of active watchers on this etcd member. | Gauge |
Both `reads_total` and `writes_total` count both successful and failed requests. `reads_failed_total` and
`writes_failed_total` count failed requests. A lot of failed writes indicate possible contentions on keys (e.g. when
doing `compareAndSet`), and read failures indicate that some clients try to access keys that don't exist.
Example Prometheus queries that may be useful from these metrics (across all etcd members):
* `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)`
* `sum(rate(etcd_store_reads_total{job="etcd"}[1m])) by (action)`
`max(rate(etcd_store_writes_total{job="etcd"}[1m])) by (action)`
Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`.
* `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)`
`sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)`
Rate of reads and writes by action, across all servers across a time window of `1m`. The reason why `max` is used
for writes as opposed to `sum` for reads is because all of etcd nodes in the cluster apply all writes to their stores.
Shows the rate of successfull readonly/write queries across all servers, across a time window of `1m`.
* `sum(rate(etcd_store_watch_requests_total{job="etcd"}[1m]))`
Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`.
Shows rate of new watch requests per second. Likely driven by how often watched keys change.
* `sum(etcd_store_watchers{job="etcd"})`
* `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))`
`histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))`
Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`.
## snapshot
| Name | Description | Type |
|--------------------------------------------|------------------------------------------------------------|-----------|
| snapshot_save_total_durations_seconds | The total latency distributions of save called by snapshot | Histogram |
Abnormally high snapshot duration (`snapshot_save_total_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
Number of active watchers across all etcd servers.
## rafthttp
### wal
| Name | Description | Type | Labels |
|-----------------------------------|--------------------------------------------|--------------|--------------------------------|
| message_sent_latency_seconds | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID |
| message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID |
| Name | Description | Type |
|------------------------------------|--------------------------------------------------|---------|
| fsync_durations_microseconds | The latency distributions of fsync called by wal | Summary |
| last_index_saved | The index of the last entry saved by wal | Gauge |
Abnormally high fsync duration (`fsync_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
### snapshot
| Name | Description | Type |
|--------------------------------------------|------------------------------------------------------------|---------|
| snapshot_save_total_durations_microseconds | The total latency distributions of save called by snapshot | Summary |
Abnormally high snapshot duration (`snapshot_save_total_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
Abnormally high message duration (`message_sent_latency_seconds`) indicates network issues and might cause the cluster to be unstable.
### rafthttp
| Name | Description | Type | Labels |
|-----------------------------------|--------------------------------------------|---------|--------------------------------|
| message_sent_latency_microseconds | The latency distributions of messages sent | Summary | sendingType, msgType, remoteID |
| message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID |
Abnormally high message duration (`message_sent_latency_microseconds`) indicates network issues and might cause the cluster to be unstable.
An increase in message failures (`message_sent_failed_total`) indicates more severe network issues and might cause the cluster to be unstable.
@@ -100,7 +106,7 @@ Label `msgType` is the type of raft message. `MsgApp` is log replication message
Label `remoteID` is the member ID of the message destination.
## proxy
### proxy
etcd members operating in proxy mode do not do store operations. They forward all requests
to cluster instances.
@@ -124,8 +130,8 @@ Example Prometheus queries that may be useful from these metrics (across all etc
* `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method="GET"}[5m])) by (le))`
`histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method!="GET"}[5m])) by (le))`
Show the 0.90-tile latency (in seconds) of handling of user requests across all proxy machines, with a window of `5m`.
Show the 0.90-tile latency (in seconds) of handling of user requestsacross all proxy machines, with a window of `5m`.
* `sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)`
Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to etcd cluster.

View File

@@ -1,28 +1,119 @@
# Miscellaneous APIs
## Members API
* [Getting the etcd version](#getting-the-etcd-version)
* [Checking health of an etcd member node](#checking-health-of-an-etcd-member-node)
* [List members](#list-members)
* [Add a member](#add-a-member)
* [Delete a member](#delete-a-member)
* [Change the peer urls of a member](#change-the-peer-urls-of-a-member)
## Getting the etcd version
## List members
The etcd version of a specific instance can be obtained from the `/version` endpoint.
Return an HTTP 200 OK response code and a representation of all members in the etcd cluster.
### Request
```
GET /v2/members HTTP/1.1
```
### Example
```sh
curl -L http://127.0.0.1:2379/version
```
```
etcd 2.0.12
```
## Checking health of an etcd member node
etcd provides a `/health` endpoint to verify the health of a particular member.
```sh
curl http://10.0.0.10:2379/health
curl http://10.0.0.10:2379/v2/members
```
```json
{"health": "true"}
{
"members": [
{
"id": "272e204152",
"name": "infra1",
"peerURLs": [
"http://10.0.0.10:2380"
],
"clientURLs": [
"http://10.0.0.10:2379"
]
},
{
"id": "2225373f43",
"name": "infra2",
"peerURLs": [
"http://10.0.0.11:2380"
],
"clientURLs": [
"http://10.0.0.11:2379"
]
},
]
}
```
## Add a member
Returns an HTTP 201 response code and the representation of added member with a newly generated a memberID when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member exists in the cluster or existed in the cluster at some point in the past an HTTP 409 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
POST /v2/members HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members -XPOST \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```
```json
{
"id": "3777296169",
"peerURLs": [
"http://10.0.0.10:2380"
]
}
```
## Delete a member
Remove a member from the cluster. The member ID must be a hex-encoded uint64.
Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the member does not exist in the cluster an HTTP 500(TODO: fix this) will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
### Request
```
DELETE /v2/members/<id> HTTP/1.1
```
### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XDELETE
```
## Change the peer urls of a member
Change the peer urls of a given member. The member ID must be a hex-encoded uint64. Returns 204 with empty content when successful. Returns a string describing the failure condition when unsuccessful.
If the POST body is malformed an HTTP 400 will be returned. If the member does not exist in the cluster an HTTP 404 will be returned. If any of the given peerURLs exists in the cluster an HTTP 409 will be returned. If the cluster fails to process the request within timeout an HTTP 500 will be returned, though the request may be processed later.
#### Request
```
PUT /v2/members/<id> HTTP/1.1
{"peerURLs": ["http://10.0.0.10:2380"]}
```
#### Example
```sh
curl http://10.0.0.10:2379/v2/members/272e204152 -XPUT \
-H "Content-Type: application/json" -d '{"peerURLs":["http://10.0.0.10:2380"]}'
```

View File

@@ -0,0 +1,4 @@
etcd is being used successfully by many companies in production. It is,
however, under active development and systems like etcd are difficult to get
correct. If you are comfortable with bleeding-edge software please use etcd and
provide us with the feedback and testing young software needs.

View File

@@ -1,49 +0,0 @@
# Production Users
This document tracks people and use cases for etcd in production. By creating a list of production use cases we hope to build a community of advisors that we can reach out to with experience using various etcd applications, operation environments, and cluster sizes. The etcd development team may reach out periodically to check-in on your experience and update this list.
## discovery.etcd.io
- *Application*: https://github.com/coreos/discovery.etcd.io
- *Launched*: Feb. 2014
- *Cluster Size*: 5 members, 5 discovery proxies
- *Order of Data Size*: 100s of Megabytes
- *Operator*: CoreOS, brandon.philips@coreos.com
- *Environment*: AWS
- *Backups*: Periodic async to S3
discovery.etcd.io is the longest continuously running etcd backed service that we know about. It is the basis of automatic cluster bootstrap and was launched in Feb. 2014: https://coreos.com/blog/etcd-0.3.0-released/.
## OpenTable
- *Application*: OpenTable internal service discovery and cluster configuration management
- *Launched*: May 2014
- *Cluster Size*: 3 members each in 6 independent clusters; approximately 50 nodes reading / writing
- *Order of Data Size*: 10s of MB
- *Operator*: OpenTable, Inc; sschlansker@opentable.com
- *Environment*: AWS, VMWare
- *Backups*: None, all data can be re-created if necessary.
## cycoresys.com
- *Application*: multiple
- *Launched*: Jul. 2014
- *Cluster Size*: 3 members, _n_ proxies
- *Order of Data Size*: 100s of kilobytes
- *Operator*: CyCore Systems, Inc, sys@cycoresys.com
- *Environment*: Baremetal
- *Backups*: Periodic sync to Ceph RadosGW and DigitalOcean VM
CyCore Systems provides architecture and engineering for computing systems. This cluster provides microservices, virtual machines, databases, storage clusters to a number of clients. It is built on CoreOS machines, with each machine in the cluster running etcd as a peer or proxy.
## Radius Intelligence
- *Application*: multiple internal tools, Kubernetes clusters, bootstrappable system configs
- *Launched*: June 2015
- *Cluster Size*: 2 clusters of 5 and 3 members; approximately a dozen nodes read/write
- *Order of Data Size*: 100s of kilobytes
- *Operator*: Radius Intelligence; jcderr@radius.com
- *Environment*: AWS, CoreOS, Kubernetes
- *Backups*: None, all data can be recreated if necessary.
Radius Intelligence uses Kubernetes running CoreOS to containerize and scale internal toolsets. Examples include running [JetBrains TeamCity](https://www.jetbrains.com/teamcity/) and internal AWS security and cost reporting tools. etcd clusters back these clusters as well as provide some basic environment bootstrapping configuration keys.

View File

@@ -1,150 +1,37 @@
# Proxy
## Proxy
etcd can run as a transparent proxy. Doing so allows for easy discovery of etcd within your infrastructure, since it can run on each machine as a local service. In this mode, etcd acts as a reverse proxy and forwards client requests to an active etcd cluster. The etcd proxy does not participate in the consensus replication of the etcd cluster, thus it neither increases the resilience nor decreases the write performance of the etcd cluster.
etcd can now run as a transparent proxy. Running etcd as a proxy allows for easily discovery of etcd within your infrastructure, since it can run on each machine as a local service. In this mode, etcd acts as a reverse proxy and forwards client requests to an active etcd cluster. The etcd proxy does not participate in the consensus replication of the etcd cluster, thus it neither increases the resilience nor decreases the write performance of the etcd cluster.
etcd currently supports two proxy modes: `readwrite` and `readonly`. The default mode is `readwrite`, which forwards both read and write requests to the etcd cluster. A `readonly` etcd proxy only forwards read requests to the etcd cluster, and returns `HTTP 501` to all write requests.
etcd currently supports two proxy modes: `readwrite` and `readonly`. The default mode is `readwrite`, which forwards both read and write requests to the etcd cluster. A `readonly` etcd proxy only forwards read requests to the etcd cluster, and returns `HTTP 501` to all write requests.
The proxy will shuffle the list of cluster members periodically to avoid sending all connections to a single member.
The member list used by an etcd proxy consists of all client URLs advertised in the cluster. These client URLs are specified in each etcd cluster member's `advertise-client-urls` option.
The member list used by proxy consists of all client URLs advertised within the cluster, as specified in each members' `-advertise-client-urls` flag. If this flag is set incorrectly, requests sent to the proxy are forwarded to wrong addresses and then fail. Including URLs in the `-advertise-client-urls` flag that point to the proxy itself, e.g. http://localhost:2379, is even more problematic as it will cause loops, because the proxy keeps trying to forward requests to itself until its resources (memory, file descriptors) are eventually depleted. The fix for this problem is to restart etcd member with correct `-advertise-client-urls` flag. After client URLs list in proxy is recalculated, which happens every 30 seconds, requests will be forwarded correctly.
An etcd proxy examines several command-line options to discover its peer URLs. In order of precedence, these options are `discovery`, `discovery-srv`, and `initial-cluster`. The `initial-cluster` option is set to a comma-separated list of one or more etcd peer URLs used temporarily in order to discover the permanent cluster.
After establishing a list of peer URLs in this manner, the proxy retrieves the list of client URLs from the first reachable peer. These client URLs are specified by the `advertise-client-urls` option to etcd peers. The proxy then continues to connect to the first reachable etcd cluster member every thirty seconds to refresh the list of client URLs.
While etcd proxies therefore do not need to be given the `advertise-client-urls` option, as they retrieve this configuration from the cluster, this implies that `initial-cluster` must be set correctly for every proxy, and the `advertise-client-urls` option must be set correctly for every non-proxy, first-order cluster peer. Otherwise, requests to any etcd proxy would be forwarded improperly. Take special care not to set the `advertise-client-urls` option to URLs that point to the proxy itself, as such a configuration will cause the proxy to enter a loop, forwarding requests to itself until resources are exhausted. To correct either case, stop etcd and restart it with the correct URLs.
[This example Procfile](https://github.com/coreos/etcd/blob/master/Procfile) illustrates the difference in the etcd peer and proxy command lines used to configure and start a cluster with one proxy under the [goreman process management utility](https://github.com/mattn/goreman).
To summarize etcd proxy startup and peer discovery:
1. etcd proxies execute the following steps in order until the cluster *peer-urls* are known:
1. If `discovery` is set for the proxy, ask the given discovery service for
the *peer-urls*. The *peer-urls* will be the combined
`initial-advertise-peer-urls` of all first-order, non-proxy cluster
members.
2. If `discovery-srv` is set for the proxy, the *peer-urls* are discovered
from DNS.
3. If `initial-cluster` is set for the proxy, that will become the value of
*peer-urls*.
4. Otherwise use the default value of
`http://localhost:2380,http://localhost:7001`.
2. These *peer-urls* are used to contact the (non-proxy) members of the cluster
to find their *client-urls*. The *client-urls* will thus be the combined
`advertise-client-urls` of all cluster members (i.e. non-proxies).
3. Request of clients of the proxy will be forwarded (proxied) to these
*client-urls*.
Always start the first-order etcd cluster members first, then any proxies. A proxy must be able to reach the cluster members to retrieve its configuration, and will attempt connections somewhat aggressively in the absence of such a channel. Starting the members before any proxy ensures the proxy can discover the client URLs when it later starts.
## Using an etcd proxy
To start etcd in proxy mode, you need to provide three flags: `proxy`, `listen-client-urls`, and `initial-cluster` (or `discovery`).
### Using an etcd proxy
To start etcd in proxy mode, you need to provide three flags: `proxy`, `listen-client-urls`, and `initial-cluster` (or `discovery`).
To start a readwrite proxy, set `-proxy on`; To start a readonly proxy, set `-proxy readonly`.
The proxy will be listening on `listen-client-urls` and forward requests to the etcd cluster discovered from in `initial-cluster` or `discovery` url.
The proxy will be listening on `listen-client-urls` and forward requests to the etcd cluster discovered from in `initial-cluster` or `discovery` url.
### Start an etcd proxy with a static configuration
#### Start an etcd proxy with a static configuration
To start a proxy that will connect to a statically defined etcd cluster, specify the `initial-cluster` flag:
```
etcd -proxy on \
-listen-client-urls http://127.0.0.1:8080 \
-initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380
etcd -proxy on -listen-client-urls http://127.0.0.1:8080 -initial-cluster infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380,infra2=http://10.0.1.12:2380
```
### Start an etcd proxy with the discovery service
If you bootstrap an etcd cluster using the [discovery service][discovery-service], you can also start the proxy with the same `discovery`.
#### Start an etcd proxy with the discovery service
If you bootstrap an etcd cluster using the [discovery service][discovery-service], you can also start the proxy with the same `discovery`.
To start a proxy using the discovery service, specify the `discovery` flag. The proxy will wait until the etcd cluster defined at the `discovery` url finishes bootstrapping, and then start to forward the requests.
To start a proxy using the discovery service, specify the `discovery` flag. The proxy will wait until the etcd cluster defined at the `discovery` url finishes bootstrapping, and then start to forward the requests.
```
etcd -proxy on \
-listen-client-urls http://127.0.0.1:8080 \
-discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
etcd -proxy on -listen-client-urls http://127.0.0.1:8080 -discovery https://discovery.etcd.io/3e86b59982e49066c5d813af1c2e2579cbf573de
```
## Fallback to proxy mode with discovery service
If you bootstrap an etcd cluster using [discovery service][discovery-service] with more than the expected number of etcd members, the extra etcd processes will fall back to being `readwrite` proxies by default. They will forward the requests to the cluster as described above. For example, if you create a discovery url with `size=5`, and start ten etcd processes using that same discovery url, the result will be a cluster with five etcd members and five proxies. Note that this behaviour can be disabled with the `proxy-fallback` flag.
## Promote a proxy to a member of etcd cluster
A Proxy is in the part of etcd cluster that does not participate in consensus. A proxy will not promote itself to an etcd member that participates in consensus automatically in any case.
If you want to promote a proxy to an etcd member, there are four steps you need to follow:
- use etcdctl to add the proxy node as an etcd member into the existing cluster
- stop the etcd proxy process or service
- remove the existing proxy data directory
- restart the etcd process with new member configuration
## Example
We assume you have a one member etcd cluster with one proxy. The cluster information is listed below:
|Name|Address|
|------|---------|
|infra0|10.0.1.10|
|proxy0|10.0.1.11|
This example walks you through a case that you promote one proxy to an etcd member. The cluster will become a two member cluster after finishing the four steps.
### Add a new member into the existing cluster
First, use etcdctl to add the member to the cluster, which will output the environment variables need to correctly configure the new member:
``` bash
$ etcdctl -endpoint http://10.0.1.10:2379 member add infra1 http://10.0.1.11:2380
added member 9bf1b35fc7761a23 to cluster
ETCD_NAME="infra1"
ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380"
ETCD_INITIAL_CLUSTER_STATE=existing
```
### Stop the proxy process
Stop the existing proxy so we can wipe it's state on disk and reload it with the new configuration:
``` bash
px aux | grep etcd
kill %etcd_proxy_pid%
```
or (if you are running etcd proxy as etcd service under systemd)
``` bash
sudo systemctl stop etcd
```
### Remove the existing proxy data dir
``` bash
rm -rf %data_dir%/proxy
```
### Start etcd as a new member
Finally, start the reconfigured member and make sure it joins the cluster correctly:
``` bash
$ export ETCD_NAME="infra1"
$ export ETCD_INITIAL_CLUSTER="infra0=http://10.0.1.10:2380,infra1=http://10.0.1.11:2380"
$ export ETCD_INITIAL_CLUSTER_STATE=existing
$ etcd -listen-client-urls http://10.0.1.11:2379 \
-advertise-client-urls http://10.0.1.11:2379 \
-listen-peer-urls http://10.0.1.11:2380 \
-initial-advertise-peer-urls http://10.0.1.11:2380 \
-data-dir %data_dir%
```
If you are running etcd under systemd, you should modify the service file with correct configuration and restart the service:
``` bash
sudo systemd restart etcd
```
If you see an error, you can read the [add member troubleshooting doc](runtime-configuration.md#error-cases).
#### Fallback to proxy mode with discovery service
If you bootstrap a etcd cluster using [discovery service][discovery-service] with more than the expected number of etcd members, the extra etcd processes will fall back to being `readwrite` proxies by default. They will forward the requests to the cluster as described above. For example, if you create a discovery url with `size=5`, and start ten etcd processes using that same discovery url, the result will be a cluster with five etcd members and five proxies. Note that this behaviour can be disabled with the `proxy-fallback` flag.
[discovery-service]: clustering.md#discovery

View File

@@ -1,4 +1,4 @@
# Reporting Bugs
## Reporting Bugs
If you find bugs or documentation mistakes in etcd project, please let us know by [opening an issue](https://github.com/coreos/etcd/issues/new). We treat bugs and mistakes very seriously and believe no issue is too small. Before creating a bug report, please check there that one does not already exist.
@@ -20,7 +20,7 @@ We might ask you for further information to locate a bug. A duplicated bug repor
## Frequently Asked Questions
### How to get a stack trace
### How to get stack trace
``` bash
$ kill -QUIT $PID

View File

@@ -1,10 +1,4 @@
# Overview
The etcd v3 API is designed to give users a more efficient and cleaner abstraction compared to etcd v2. There are a number of semantic and protocol changes in this new API. For an overview [see Xiang Li's video](https://youtu.be/J5AioGtEPeQ?t=211).
To prove out the design of the v3 API the team has also built [a number of example recipes](https://github.com/coreos/etcd/tree/master/contrib/recipes), there is a [video discussing these recipes too](https://www.youtube.com/watch?v=fj-2RY-3yVU&feature=youtu.be&t=590).
# Design
## Design
1. Flatten binary key-value space
@@ -34,24 +28,13 @@ To prove out the design of the v3 API the team has also built [a number of examp
- easy for people to write simple etcd application
## Notes
### Request Size Limitation
The max request size is around 1MB. Since etcd replicates requests in a streaming fashion, a very large
request might block other requests for a long time. The use case for etcd is to store small configuration
values, so we prevent user from submitting large requests. This also applies to Txn requests. We might loosen
the size in the future a little bit or make it configurable.
## Protobuf Defined API
[api protobuf](../../etcdserver/etcdserverpb/rpc.proto)
[protobuf](./v3api.proto)
[kv protobuf](../../storage/storagepb/kv.proto)
### Examples
## Examples
### Put a key (foo=bar)
#### Put a key (foo=bar)
```
// A put is always successful
Put( PutRequest { key = foo, value = bar } )
@@ -64,7 +47,7 @@ PutResponse {
}
```
### Get a key (assume we have foo=bar)
#### Get a key (assume we have foo=bar)
```
Get ( RangeRequest { key = foo } )
@@ -85,7 +68,7 @@ RangeResponse {
}
```
### Range over a key space (assume we have foo0=bar0… foo100=bar100)
#### Range over a key space (assume we have foo0=bar0… foo100=bar100)
```
Range ( RangeRequest { key = foo, end_key = foo80, limit = 30 } )
@@ -114,7 +97,7 @@ RangeResponse {
}
```
### Finish a txn (assume we have foo0=bar0, foo1=bar1)
#### Finish a txn (assume we have foo0=bar0, foo1=bar1)
```
Txn(TxnRequest {
// mod_revision of foo0 is equal to 1, mod_revision of foo1 is greater than 1
@@ -146,7 +129,7 @@ TxnResponse {
}
```
### Watch on a key/range
#### Watch on a key/range
```
Watch( WatchRequest{

View File

@@ -0,0 +1,285 @@
syntax = "proto3";
// Interface exported by the server.
service etcd {
// Range gets the keys in the range from the store.
rpc Range(RangeRequest) returns (RangeResponse) {}
// Put puts the given key into the store.
// A put request increases the revision of the store,
// and generates one event in the event history.
rpc Put(PutRequest) returns (PutResponse) {}
// Delete deletes the given range from the store.
// A delete request increase the revision of the store,
// and generates one event in the event history.
rpc DeleteRange(DeleteRangeRequest) returns (DeleteRangeResponse) {}
// Txn processes all the requests in one transaction.
// A txn request increases the revision of the store,
// and generates events with the same revision in the event history.
rpc Txn(TxnRequest) returns (TxnResponse) {}
// Watch watches the events happening or happened in etcd. Both input and output
// are stream. One watch rpc can watch for multiple ranges and get a stream of
// events. The whole events history can be watched unless compacted.
rpc WatchRange(stream WatchRangeRequest) returns (stream WatchRangeResponse) {}
// Compact compacts the event history in etcd. User should compact the
// event history periodically, or it will grow infinitely.
rpc Compact(CompactionRequest) returns (CompactionResponse) {}
// LeaseCreate creates a lease. A lease has a TTL. The lease will expire if the
// server does not receive a keepAlive within TTL from the lease holder.
// All keys attached to the lease will be expired and deleted if the lease expires.
// The key expiration generates an event in event history.
rpc LeaseCreate(LeaseCreateRequest) returns (LeaseCreateResponse) {}
// LeaseRevoke revokes a lease. All the key attached to the lease will be expired and deleted.
rpc LeaseRevoke(LeaseRevokeRequest) returns (LeaseRevokeResponse) {}
// LeaseAttach attaches keys with a lease.
rpc LeaseAttach(LeaseAttachRequest) returns (LeaseAttachResponse) {}
// LeaseTxn likes Txn. It has two addition success and failure LeaseAttachRequest list.
// If the Txn is successful, then the success list will be executed. Or the failure list
// will be executed.
rpc LeaseTxn(LeaseTxnRequest) returns (LeaseTxnResponse) {}
// KeepAlive keeps the lease alive.
rpc LeaseKeepAlive(stream LeaseKeepAliveRequest) returns (stream LeaseKeepAliveResponse) {}
}
message ResponseHeader {
// an error type message?
string error = 1;
uint64 cluster_id = 2;
uint64 member_id = 3;
// revision of the store when the request was applied.
int64 revision = 4;
// term of raft when the request was applied.
uint64 raft_term = 5;
}
message RangeRequest {
// if the range_end is not given, the request returns the key.
bytes key = 1;
// if the range_end is given, it gets the keys in range [key, range_end).
bytes range_end = 2;
// limit the number of keys returned.
int64 limit = 3;
// range over the store at the given revision.
// if revision is less or equal to zero, range over the newest store.
// if the revision has been compacted, ErrCompaction will be returned in
// response.
int64 revision = 4;
}
message RangeResponse {
ResponseHeader header = 1;
repeated storagepb.KeyValue kvs = 2;
// more indicates if there are more keys to return in the requested range.
bool more = 3;
}
message PutRequest {
bytes key = 1;
bytes value = 2;
}
message PutResponse {
ResponseHeader header = 1;
}
message DeleteRangeRequest {
// if the range_end is not given, the request deletes the key.
bytes key = 1;
// if the range_end is given, it deletes the keys in range [key, range_end).
bytes range_end = 2;
}
message DeleteRangeResponse {
ResponseHeader header = 1;
}
message RequestUnion {
oneof request {
RangeRequest request_range = 1;
PutRequest request_put = 2;
DeleteRangeRequest request_delete_range = 3;
}
}
message ResponseUnion {
oneof response {
RangeResponse response_range = 1;
PutResponse response_put = 2;
DeleteRangeResponse response_delete_range = 3;
}
}
message Compare {
enum CompareResult {
EQUAL = 0;
GREATER = 1;
LESS = 2;
}
enum CompareTarget {
VERSION = 0;
CREATE = 1;
MOD = 2;
VALUE= 3;
}
CompareResult result = 1;
CompareTarget target = 2;
// key path
bytes key = 3;
oneof target_union {
// version of the given key
int64 version = 4;
// create revision of the given key
int64 create_revision = 5;
// last modified revision of the given key
int64 mod_revision = 6;
// value of the given key
bytes value = 7;
}
}
// If the comparisons succeed, then the success requests will be processed in order,
// and the response will contain their respective responses in order.
// If the comparisons fail, then the failure requests will be processed in order,
// and the response will contain their respective responses in order.
// From google paxosdb paper:
// Our implementation hinges around a powerful primitive which we call MultiOp. All other database
// operations except for iteration are implemented as a single call to MultiOp. A MultiOp is applied atomically
// and consists of three components:
// 1. A list of tests called guard. Each test in guard checks a single entry in the database. It may check
// for the absence or presence of a value, or compare with a given value. Two different tests in the guard
// may apply to the same or different entries in the database. All tests in the guard are applied and
// MultiOp returns the results. If all tests are true, MultiOp executes t op (see item 2 below), otherwise
// it executes f op (see item 3 below).
// 2. A list of database operations called t op. Each operation in the list is either an insert, delete, or
// lookup operation, and applies to a single database entry. Two different operations in the list may apply
// to the same or different entries in the database. These operations are executed
// if guard evaluates to
// true.
// 3. A list of database operations called f op. Like t op, but executed if guard evaluates to false.
message TxnRequest {
repeated Compare compare = 1;
repeated RequestUnion success = 2;
repeated RequestUnion failure = 3;
}
message TxnResponse {
ResponseHeader header = 1;
bool succeeded = 2;
repeated ResponseUnion responses = 3;
}
message KeyValue {
bytes key = 1;
int64 create_revision = 2;
// mod_revision is the last modified revision of the key.
int64 mod_revision = 3;
// version is the version of the key. A deletion resets
// the version to zero and any modification of the key
// increases its version.
int64 version = 4;
bytes value = 5;
}
message WatchRangeRequest {
// if the range_end is not given, the request returns the key.
bytes key = 1;
// if the range_end is given, it gets the keys in range [key, range_end).
bytes range_end = 2;
// start_revision is an optional revision (including) to watch from. No start_revision is "now".
int64 start_revision = 3;
// end_revision is an optional revision (excluding) to end watch. No end_revision is "forever".
int64 end_revision = 4;
bool progress_notification = 5;
}
message WatchRangeResponse {
ResponseHeader header = 1;
repeated Event events = 2;
}
message Event {
enum EventType {
PUT = 0;
DELETE = 1;
EXPIRE = 2;
}
EventType event_type = 1;
// a put event contains the current key-value
// a delete/expire event contains the previous
// key-value
KeyValue kv = 2;
}
// Compaction compacts the kv store upto the given revision (including).
// It removes the old versions of a key. It keeps the newest version of
// the key even if its latest modification revision is smaller than the given
// revision.
message CompactionRequest {
int64 revision = 1;
}
message CompactionResponse {
ResponseHeader header = 1;
}
message LeaseCreateRequest {
// advisory ttl in seconds
int64 ttl = 1;
}
message LeaseCreateResponse {
ResponseHeader header = 1;
int64 lease_id = 2;
// server decided ttl in second
int64 ttl = 3;
string error = 4;
}
message LeaseRevokeRequest {
int64 lease_id = 1;
}
message LeaseRevokeResponse {
ResponseHeader header = 1;
}
message LeaseTxnRequest {
TxnRequest request = 1;
repeated LeaseAttachRequest success = 2;
repeated LeaseAttachRequest failure = 3;
}
message LeaseTxnResponse {
ResponseHeader header = 1;
TxnResponse response = 2;
repeated LeaseAttachResponse attach_responses = 3;
}
message LeaseAttachRequest {
int64 lease_id = 1;
bytes key = 2;
}
message LeaseAttachResponse {
ResponseHeader header = 1;
}
message LeaseKeepAliveRequest {
int64 lease_id = 1;
}
message LeaseKeepAliveResponse {
ResponseHeader header = 1;
int64 lease_id = 2;
int64 ttl = 3;
}

View File

@@ -1,8 +1,8 @@
# Runtime Reconfiguration
## Runtime Reconfiguration
etcd comes with support for incremental runtime reconfiguration, which allows users to update the membership of the cluster at run time.
Reconfiguration requests can only be processed when the majority of the cluster members are functioning. It is **highly recommended** to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not able to make progress and need to [restart from majority failure][majority failure].
Reconfiguration requests can only be processed when the the majority of the cluster members are functioning. It is **highly recommended** to always have a cluster size greater than two in production. It is unsafe to remove a member from a two member cluster. The majority of a two member cluster is also two. If there is a failure during the removal process, the cluster might not able to make progress and need to [restart from majority failure][majority failure].
To better understand the design behind runtime reconfiguration, we suggest you read [this](runtime-reconf-design.md).
@@ -60,24 +60,11 @@ All changes to the cluster are done one at a time:
* To decrease from 5 to 3 you will make two remove operations
All of these examples will use the `etcdctl` command line tool that ships with etcd.
If you want to use the members API directly you can find the documentation [here](members_api.md).
If you want to use the member API directly you can find the documentation [here](other_apis.md).
### Update a Member
#### Update advertise client URLs
If you would like to update the advertise client URLs of a member, you can simply restart
that member with updated client urls flag (`--advertise-client-urls`) or environment variable
(`ETCD_ADVERTISE_CLIENT_URLS`). The restarted member will self publish the updated URLs.
A wrongly updated client URL will not affect the health of the etcd cluster.
#### Update advertise peer URLs
If you would like to update the advertise peer URLs of a member, you have to first update
it explicitly via member command and then restart the member. The additional action is required
since updating peer URLs changes the cluster wide configuration and can affect the health of the etcd cluster.
To update the peer URLs, first, we need to find the target member's ID. You can list all members with `etcdctl`:
If you would like to update a member IP address (peerURLs), first, we need to find the target member's ID. You can list all members with `etcdctl`:
```sh
$ etcdctl member list
@@ -115,7 +102,7 @@ It is safe to remove the leader, however the cluster will be inactive while a ne
Adding a member is a two step process:
* Add the new member to the cluster via the [members API](members_api.md#post-v2members) or the `etcdctl member add` command.
* Add the new member to the cluster via the [members API](other_apis.md#post-v2members) or the `etcdctl member add` command.
* Start the new member with the new cluster configuration, including a list of the updated members (existing members + the new member).
Using `etcdctl` let's add the new member to the cluster by specifying its [name](configuration.md#-name) and [advertised peer URLs](configuration.md#-initial-advertise-peer-urls):
@@ -144,7 +131,7 @@ The new member will run as a part of the cluster and immediately begin catching
If you are adding multiple members the best practice is to configure a single member at a time and verify it starts correctly before adding more new members.
If you add a new member to a 1-node cluster, the cluster cannot make progress before the new member starts because it needs two members as majority to agree on the consensus. You will only see this behavior between the time `etcdctl member add` informs the cluster about the new member and the new member successfully establishing a connection to the existing one.
#### Error Cases When Adding Members
#### Error Cases
In the following case we have not included our new host in the list of enumerated nodes.
If this is a new cluster, the node must be added to the list of initial cluster members.
@@ -167,18 +154,10 @@ etcdserver: assign ids error: unmatched member while checking PeerURLs
exit 1
```
When we start etcd using the data directory of a removed member, etcd will exit automatically if it connects to any active member in the cluster:
When we start etcd using the data directory of a removed member, etcd will exit automatically if it connects to any alive member in the cluster:
```sh
$ etcd
etcd: this member has been permanently removed from the cluster. Exiting.
exit 1
```
### Strict Reconfiguration Check Mode (`-strict-reconfig-check`)
As described in the above, the best practice of adding new members is to configure a single member at a time and verify it starts correctly before adding more new members. This step by step approach is very important because if newly added members is not configured correctly (for example the peer URLs are incorrect), the cluster can lose quorum. The quorum loss happens since the newly added member are counted in the quorum even if that member is not reachable from other existing members. Also quorum loss might happen if there is a connectivity issue or there are operational issues.
For avoiding this problem, etcd provides an option `-strict-reconfig-check`. If this option is passed to etcd, etcd rejects reconfiguration requests if the number of started members will be less than a quorum of the reconfigured cluster.
It is recommended to enable this option. However, it is disabled by default because of keeping compatibility.

View File

@@ -1,10 +1,10 @@
# Design of Runtime Reconfiguration
### Design of Runtime Reconfiguration
Runtime reconfiguration is one of the hardest and most error prone features in a distributed system, especially in a consensus based system like etcd.
Read on to learn about the design of etcd's runtime reconfiguration commands and how we tackled these problems.
## Two Phase Config Changes Keep you Safe
### Two Phase Config Changes Keep you Safe
In etcd, every runtime reconfiguration has to go through [two phases](Documentation/runtime-configuration.md#add-a-new-member) for safety reasons. For example, to add a member you need to first inform cluster of new configuration and then start the new member.
@@ -22,7 +22,7 @@ Without the explicit workflow around cluster membership etcd would be vulnerable
We think runtime reconfiguration should be a low frequent operation. We made the decision to keep it explicit and user-driven to ensure configuration safety and keep your cluster always running smoothly under your control.
## Permanent Loss of Quorum Requires New Cluster
### Permanent Loss of Quorum Requires New Cluster
If a cluster permanently loses a majority of its members, a new cluster will need to be started from an old data directory to recover the previous state.
@@ -30,7 +30,7 @@ It is entirely possible to force removing the failed members from the existing c
If you have a correct deployment, the possibility of permanent majority lose is very low. But it is a severe enough problem that worth special care. We strongly suggest you to read the [disaster recovery documentation](admin_guide.md#disaster-recovery) and prepare for permanent majority lose before you put etcd into production.
## Do Not Use Public Discovery Service For Runtime Reconfiguration
### Do Not Use Public Discovery Service For Runtime Reconfiguration
The public discovery service should only be used for bootstrapping a cluster. To join member into an existing cluster, you should use runtime reconfiguration API.
@@ -38,10 +38,10 @@ Discovery service is designed for bootstrapping an etcd cluster in the cloud env
It seems that using public discovery service is a convenient way to do runtime reconfiguration, after all discovery service already has all the cluster configuration information. However relying on public discovery service brings troubles:
1. it introduces external dependencies for the entire life-cycle of your cluster, not just bootstrap time. If there is a network issue between your cluster and public discovery service, your cluster will suffer from it.
1. it introduces a external dependencies for the entire life-cycle of your cluster, not just bootstrap time. If there is a network issue between your cluster and public discover service, your cluster will suffer from it.
2. public discovery service must reflect correct runtime configuration of your cluster during it life-cycle. It has to provide security mechanism to avoid bad actions, and it is hard.
3. public discovery service has to keep tens of thousands of cluster configurations. Our public discovery service backend is not ready for that workload.
If you want to have a discovery service that supports runtime reconfiguration, the best choice is to build your private one.
If you want to have a discovery service that supports runtime reconfiguration, the best choice is to build your private one.

View File

@@ -1,10 +1,12 @@
# Security Model
# security model
etcd supports SSL/TLS as well as authentication through client certificates, both for clients to server as well as peer (server to server / cluster) communication.
To get up and running you first need to have a CA certificate and a signed key pair for one member. It is recommended to create and sign a new key pair for every member in a cluster.
For convenience, the [cfssl](https://github.com/cloudflare/cfssl) tool provides an easy interface to certificate generation, and we provide an example using the tool [here](https://github.com/coreos/etcd/tree/master/hack/tls-setup). You can also examine this [alternative guide to generating self-signed key pairs](http://www.g-loaded.eu/2005/11/10/be-your-own-ca/).
For convenience the [cfssl](https://github.com/cloudflare/cfssl) tool provides an easy interface to certificate generation, and we provide a full example using the tool at [here](../hack/tls-setup). Alternatively this site provides a good reference on how to generate self-signed key pairs:
http://www.g-loaded.eu/2005/11/10/be-your-own-ca/
## Basic setup
@@ -95,7 +97,7 @@ $ curl --cacert /path/to/ca.crt --cert /path/to/client.crt --key /path/to/client
-L https://127.0.0.1:2379/v2/keys/foo -XPUT -d value=bar -v
```
You should be able to see:
You should able to see:
```
...
@@ -136,7 +138,7 @@ $ etcd -name infra1 -data-dir infra1 \
# member2
$ etcd -name infra2 -data-dir infra2 \
-peer-client-cert-auth -peer-trusted-ca-file=/path/to/ca.crt -peer-cert-file=/path/to/member2.crt -peer-key-file=/path/to/member2.key \
-peer-client-cert-atuh -peer-trusted-ca-file=/path/to/ca.crt -peer-cert-file=/path/to/member2.crt -peer-key-file=/path/to/member2.key \
-initial-advertise-peer-urls=https://10.0.1.11:2380 -listen-peer-urls=https://10.0.1.11:2380 \
-discovery ${DISCOVERY_URL}
```
@@ -150,7 +152,7 @@ The etcd members will form a cluster and all communication between members in th
The internal protocol of etcd v2.0.x uses a lot of short-lived HTTP connections.
So, when enabling TLS you may need to increase the heartbeat interval and election timeouts to reduce internal cluster connection churn.
A reasonable place to start are these values: ` --heartbeat-interval 500 --election-timeout 2500`.
These issues are resolved in the etcd v2.1.x series of releases which uses fewer connections.
This issues is resolved in the etcd v2.1.x series of releases which uses fewer connections.
### I'm seeing a SSLv3 alert handshake failure when using SSL client authentication?

View File

@@ -1,16 +1,16 @@
# Tuning
## Tuning
The default settings in etcd should work well for installations on a local network where the average network latency is low.
However, when using etcd across multiple data centers or over networks with high latency you may need to tweak the heartbeat interval and election timeout settings.
The network isn't the only source of latency. Each request and response may be impacted by slow disks on both the leader and follower. Each of these timeouts represents the total time from request to successful response from the other machine.
## Time Parameters
### Time Parameters
The underlying distributed consensus protocol relies on two separate time parameters to ensure that nodes can handoff leadership if one stalls or goes offline.
The first parameter is called the *Heartbeat Interval*.
This is the frequency with which the leader will notify followers that it is still the leader.
For best practices, the parameter should be set around round-trip time between members.
For best pratices, the parameter should be set around round-trip time between members.
By default, etcd uses a `100ms` heartbeat interval.
The second parameter is the *Election Timeout*.
@@ -27,14 +27,11 @@ The election timeout should be set based on the heartbeat interval and average r
Election timeouts must be at least 10 times the round-trip time so it can account for variance in your network.
For example, if the round-trip time between your members is 10ms then you should have at least a 100ms election timeout.
The upper limit of election timeout is 50000ms, which should only be used when deploying global etcd cluster. First, 5s is the upper limit of average global round-trip time. A reasonable round-trip time for the continental united states is 130ms, and the time between US and japan is around 350-400ms. Because package gets delayed a lot, and network situation may be terrible, 5s is a safe value for it. Then, because election timeout should be an order of magnitude bigger than broadcast time, 50s becomes its maximum.
You should also set your election timeout to at least 5 to 10 times your heartbeat interval to account for variance in leader replication.
For a heartbeat interval of 50ms you should set your election timeout to at least 250ms - 500ms.
The upper limit of election timeout is 50000ms (50s), which should only be used when deploying a globally-distributed etcd cluster.
A reasonable round-trip time for the continental United States is 130ms, and the time between US and Japan is around 350-400ms.
If your network has uneven performance or regular packet delays/loss then it is possible that a couple of retries may be necessary to successfully send a packet. So 5s is a safe upper limit of global round-trip time.
As the election timeout should be an order of magnitude bigger than broadcast time, in the case of ~5s for a globally distributed cluster, then 50 seconds becomes a reasonable maximum.
The heartbeat interval and election timeout value should be the same for all members in one cluster. Setting different values for etcd members may disrupt cluster stability.
You can override the default values on the command line:
@@ -49,7 +46,7 @@ $ ETCD_HEARTBEAT_INTERVAL=100 ETCD_ELECTION_TIMEOUT=500 etcd
The values are specified in milliseconds.
## Snapshots
### Snapshots
etcd appends all key changes to a log file.
This log grows forever and is a complete linear history of every change made to the keys.

View File

@@ -1,4 +1,4 @@
# Upgrade etcd to 2.1
## Upgrade etcd to 2.1
In the general case, upgrading from etcd 2.0 to 2.1 can be a zero-downtime, rolling upgrade:
- one by one, stop the etcd v2.0 processes and replace them with etcd v2.1 processes
@@ -6,15 +6,15 @@ In the general case, upgrading from etcd 2.0 to 2.1 can be a zero-downtime, roll
Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.
## Upgrade Checklists
### Upgrade Checklists
### Upgrade Requirements
#### Upgrade Requirement
To upgrade an existing etcd deployment to 2.1, you must be running 2.0. If youre running a version of etcd before 2.0, you must upgrade to [2.0](https://github.com/coreos/etcd/releases/tag/v2.0.13) before upgrading to 2.1.
Also, to ensure a smooth rolling upgrade, your running cluster must be healthy. You can check the health of the cluster by using `etcdctl cluster-health` command.
### Preparedness
#### Preparedness
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
@@ -22,13 +22,13 @@ You might also want to [backup your data directory](admin_guide.md#backing-up-th
etcd 2.1 introduces a new [authentication](auth_api.md) feature, which is disabled by default. If your deployment depends on these, you may want to test the auth features before enabling them in production.
### Mixed Versions
#### Mixed Versions
While upgrading, an etcd cluster supports mixed versions of etcd members. The cluster is only considered upgraded once all its members are upgraded to 2.1.
Internally, etcd members negotiate with each other to determine the overall etcd cluster version, which controls the reported cluster version and the supported features. For example, if you are mid-upgrade, any 2.1 features (such as the the authentication feature mentioned above) wont be available.
### Limitations
#### Limitations
If you encounter any issues during the upgrade, you can attempt to restart the etcd process in trouble using a newer v2.1 binary to solve the problem. One known issue is that etcd v2.0.0 and v2.0.2 may panic during rolling upgrades due to an existing bug, which has been fixed since etcd v2.0.3.
@@ -36,7 +36,7 @@ It might take up to 2 minutes for the newly upgraded member to catch up with the
If you have even more data, this might take more time. If you have a data size larger than 100MB you should contact us before upgrading, so we can make sure the upgrades work smoothly.
### Downgrade
#### Downgrade
If all members have been upgraded to v2.1, the cluster will be upgraded to v2.1, and downgrade is **not possible**. If any member is still v2.0, the cluster will remain in v2.0, and you can go back to use v2.0 binary.

View File

@@ -1,4 +1,4 @@
# Upgrade etcd from 2.1 to 2.2
## Upgrade etcd from 2.1 to 2.2
In the general case, upgrading from etcd 2.1 to 2.2 can be a zero-downtime, rolling upgrade:
@@ -7,33 +7,33 @@ In the general case, upgrading from etcd 2.1 to 2.2 can be a zero-downtime, roll
Before [starting an upgrade](#upgrade-procedure), read through the rest of this guide to prepare.
## Upgrade Checklists
### Upgrade Checklists
### Upgrade Requirement
#### Upgrade Requirement
To upgrade an existing etcd deployment to 2.2, you must be running 2.1. If youre running a version of etcd before 2.1, you must upgrade to [2.1](https://github.com/coreos/etcd/releases/tag/v2.1.2) before upgrading to 2.2.
Also, to ensure a smooth rolling upgrade, your running cluster must be healthy. You can check the health of the cluster by using `etcdctl cluster-health` command.
### Preparedness
#### Preparedness
Before upgrading etcd, always test the services relying on etcd in a staging environment before deploying the upgrade to the production environment.
You might also want to [backup your data directory](admin_guide.md#backing-up-the-datastore) for a potential [downgrade](#downgrade).
### Mixed Versions
#### Mixed Versions
While upgrading, an etcd cluster supports mixed versions of etcd members. The cluster is only considered upgraded once all its members are upgraded to 2.2.
Internally, etcd members negotiate with each other to determine the overall etcd cluster version, which controls the reported cluster version and the supported features.
### Limitations
#### Limitations
If you have a data size larger than 100MB you should contact us before upgrading, so we can make sure the upgrades work smoothly.
Every etcd 2.2 member will do health checking across the cluster periodically. etcd 2.1 member does not support health checking. During the upgrade, etcd 2.2 member will log warning about the unhealthy state of etcd 2.1 member. You can ignore the warning.
### Downgrade
#### Downgrade
If all members have been upgraded to v2.2, the cluster will be upgraded to v2.2, and downgrade is **not possible**. If any member is still v2.1, the cluster will remain in v2.1, and you can go back to use v2.1 binary.

94
Godeps/Godeps.json generated
View File

@@ -1,6 +1,6 @@
{
"ImportPath": "github.com/coreos/etcd",
"GoVersion": "go1.5.1",
"GoVersion": "go1.4.2",
"Packages": [
"./..."
],
@@ -10,10 +10,6 @@
"Comment": "null-5",
"Rev": "75cd24fc2f2c2a2088577d12123ddee5f54e0675"
},
{
"ImportPath": "github.com/akrennmair/gopcap",
"Rev": "00e11033259acb75598ba416495bb708d864a010"
},
{
"ImportPath": "github.com/beorn7/perks/quantile",
"Rev": "b965b613227fddccbfffe13eae360ed3fa822f8d"
@@ -24,21 +20,17 @@
},
{
"ImportPath": "github.com/boltdb/bolt",
"Comment": "v1.1.0-65-gee4a088",
"Rev": "ee4a0888a9abe7eefe5a0992ca4cb06864839873"
"Comment": "v1.0-119-g90fef38",
"Rev": "90fef389f98027ca55594edd7dbd6e7f3926fdad"
},
{
"ImportPath": "github.com/cheggaaa/pb",
"Rev": "da1f27ad1d9509b16f65f52fd9d8138b0f2dc7b2"
"ImportPath": "github.com/bradfitz/http2",
"Rev": "3e36af6d3af0e56fa3da71099f864933dea3d9fb"
},
{
"ImportPath": "github.com/codegangsta/cli",
"Comment": "1.2.0-183-gb5232bb",
"Rev": "b5232bb2934f606f9f27a1305f1eea224e8e8b88"
},
{
"ImportPath": "github.com/coreos/gexpect",
"Rev": "5173270e159f5aa8fbc999dc7e3dcb50f4098a69"
"Comment": "1.2.0-26-gf7ebb76",
"Rev": "f7ebb761e83e21225d1d8954fde853bf8edd46c4"
},
{
"ImportPath": "github.com/coreos/go-semver/semver",
@@ -63,15 +55,9 @@
"ImportPath": "github.com/coreos/pkg/capnslog",
"Rev": "2c77715c4df99b5420ffcae14ead08f52104065d"
},
{
"ImportPath": "github.com/cpuguy83/go-md2man/md2man",
"Comment": "v1.0.4",
"Rev": "71acacd42f85e5e82f70a55327789582a5200a90"
},
{
"ImportPath": "github.com/gogo/protobuf/proto",
"Comment": "v0.1-118-ge8904f5",
"Rev": "e8904f58e872a473a5b91bc9bf3377d223555263"
"Rev": "64f27bf06efee53589314a6e5a4af34cdd85adf6"
},
{
"ImportPath": "github.com/golang/glog",
@@ -79,37 +65,20 @@
},
{
"ImportPath": "github.com/golang/protobuf/proto",
"Rev": "6aaa8d47701fa6cf07e914ec01fde3d4a1fe79c3"
"Rev": "5677a0e3d5e89854c9974e1256839ee23f8233ca"
},
{
"ImportPath": "github.com/google/btree",
"Rev": "cc6329d4279e3f025a53a83c397d2339b5705c45"
},
{
"ImportPath": "github.com/inconshreveable/mousetrap",
"Rev": "76626ae9c91c4f2a10f34cad8ce83ea42c93bb75"
},
{
"ImportPath": "github.com/jonboulle/clockwork",
"Rev": "72f9bd7c4e0c2a40055ab3d0f09654f730cce982"
},
{
"ImportPath": "github.com/kballard/go-shellquote",
"Rev": "d8ec1a69a250a17bb0e419c386eac1f3711dc142"
},
{
"ImportPath": "github.com/kr/pty",
"Comment": "release.r56-29-gf7ee69f",
"Rev": "f7ee69f31298ecbe5d2b349c711e2547a617d398"
},
{
"ImportPath": "github.com/matttproud/golang_protobuf_extensions/pbutil",
"Rev": "fc2b8d3a73c4867e51861bbdd5ae3c1f0869dd6a"
},
{
"ImportPath": "github.com/olekukonko/ts",
"Rev": "ecf753e7c962639ab5a1fb46f7da627d4c0a04b8"
},
{
"ImportPath": "github.com/prometheus/client_golang/prometheus",
"Comment": "0.7.0-52-ge51041b",
@@ -133,25 +102,8 @@
"Rev": "454a56f35412459b5e684fd5ec0f9211b94f002a"
},
{
"ImportPath": "github.com/russross/blackfriday",
"Comment": "v1.4-2-g300106c",
"Rev": "300106c228d52c8941d4b3de6054a6062a86dda3"
},
{
"ImportPath": "github.com/shurcooL/sanitized_anchor_name",
"Rev": "10ef21a441db47d8b13ebcc5fd2310f636973c77"
},
{
"ImportPath": "github.com/spacejam/loghisto",
"Rev": "323309774dec8b7430187e46cd0793974ccca04a"
},
{
"ImportPath": "github.com/spf13/cobra",
"Rev": "1c44ec8d3f1552cac48999f9306da23c4d8a288b"
},
{
"ImportPath": "github.com/spf13/pflag",
"Rev": "08b1a584251b5b62f458943640fc8ebd4d50aaa5"
"ImportPath": "github.com/rakyll/pb",
"Rev": "dc507ad06b7462501281bb4691ee43f0b1d1ec37"
},
{
"ImportPath": "github.com/stretchr/testify/assert",
@@ -175,27 +127,27 @@
},
{
"ImportPath": "golang.org/x/net/context",
"Rev": "04b9de9b512f58addf28c9853d50ebef61c3953e"
"Rev": "7dbad50ab5b31073856416cdcfeb2796d682f844"
},
{
"ImportPath": "golang.org/x/net/http2",
"Rev": "04b9de9b512f58addf28c9853d50ebef61c3953e"
},
{
"ImportPath": "golang.org/x/net/internal/timeseries",
"Rev": "04b9de9b512f58addf28c9853d50ebef61c3953e"
},
{
"ImportPath": "golang.org/x/net/trace",
"Rev": "04b9de9b512f58addf28c9853d50ebef61c3953e"
"ImportPath": "golang.org/x/oauth2",
"Rev": "3046bc76d6dfd7d3707f6640f85e42d9c4050f50"
},
{
"ImportPath": "golang.org/x/sys/unix",
"Rev": "9c60d1c508f5134d1ca726b4641db998f2523357"
},
{
"ImportPath": "google.golang.org/cloud/compute/metadata",
"Rev": "f20d6dcccb44ed49de45ae3703312cb46e627db1"
},
{
"ImportPath": "google.golang.org/cloud/internal",
"Rev": "f20d6dcccb44ed49de45ae3703312cb46e627db1"
},
{
"ImportPath": "google.golang.org/grpc",
"Rev": "e29d659177655e589850ba7d3d83f7ce12ef23dd"
"Rev": "f5ebd86be717593ab029545492c93ddf8914832b"
}
]
}

View File

@@ -1,5 +0,0 @@
#*
*~
/tools/pass/pass
/tools/pcaptest/pcaptest
/tools/tcpdump/tcpdump

View File

@@ -1,27 +0,0 @@
Copyright (c) 2009-2011 Andreas Krennmair. All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above
copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the
distribution.
* Neither the name of Andreas Krennmair nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -1,11 +0,0 @@
# PCAP
This is a simple wrapper around libpcap for Go. Originally written by Andreas
Krennmair <ak@synflood.at> and only minorly touched up by Mark Smith <mark@qq.is>.
Please see the included pcaptest.go and tcpdump.go programs for instructions on
how to use this library.
Miek Gieben <miek@miek.nl> has created a more Go-like package and replaced functionality
with standard functions from the standard library. The package has also been renamed to
pcap.

View File

@@ -1,527 +0,0 @@
package pcap
import (
"encoding/binary"
"fmt"
"net"
"reflect"
"strings"
)
const (
TYPE_IP = 0x0800
TYPE_ARP = 0x0806
TYPE_IP6 = 0x86DD
TYPE_VLAN = 0x8100
IP_ICMP = 1
IP_INIP = 4
IP_TCP = 6
IP_UDP = 17
)
const (
ERRBUF_SIZE = 256
// According to pcap-linktype(7).
LINKTYPE_NULL = 0
LINKTYPE_ETHERNET = 1
LINKTYPE_TOKEN_RING = 6
LINKTYPE_ARCNET = 7
LINKTYPE_SLIP = 8
LINKTYPE_PPP = 9
LINKTYPE_FDDI = 10
LINKTYPE_ATM_RFC1483 = 100
LINKTYPE_RAW = 101
LINKTYPE_PPP_HDLC = 50
LINKTYPE_PPP_ETHER = 51
LINKTYPE_C_HDLC = 104
LINKTYPE_IEEE802_11 = 105
LINKTYPE_FRELAY = 107
LINKTYPE_LOOP = 108
LINKTYPE_LINUX_SLL = 113
LINKTYPE_LTALK = 104
LINKTYPE_PFLOG = 117
LINKTYPE_PRISM_HEADER = 119
LINKTYPE_IP_OVER_FC = 122
LINKTYPE_SUNATM = 123
LINKTYPE_IEEE802_11_RADIO = 127
LINKTYPE_ARCNET_LINUX = 129
LINKTYPE_LINUX_IRDA = 144
LINKTYPE_LINUX_LAPD = 177
)
type addrHdr interface {
SrcAddr() string
DestAddr() string
Len() int
}
type addrStringer interface {
String(addr addrHdr) string
}
func decodemac(pkt []byte) uint64 {
mac := uint64(0)
for i := uint(0); i < 6; i++ {
mac = (mac << 8) + uint64(pkt[i])
}
return mac
}
// Decode decodes the headers of a Packet.
func (p *Packet) Decode() {
if len(p.Data) <= 14 {
return
}
p.Type = int(binary.BigEndian.Uint16(p.Data[12:14]))
p.DestMac = decodemac(p.Data[0:6])
p.SrcMac = decodemac(p.Data[6:12])
if len(p.Data) >= 15 {
p.Payload = p.Data[14:]
}
switch p.Type {
case TYPE_IP:
p.decodeIp()
case TYPE_IP6:
p.decodeIp6()
case TYPE_ARP:
p.decodeArp()
case TYPE_VLAN:
p.decodeVlan()
}
}
func (p *Packet) headerString(headers []interface{}) string {
// If there's just one header, return that.
if len(headers) == 1 {
if hdr, ok := headers[0].(fmt.Stringer); ok {
return hdr.String()
}
}
// If there are two headers (IPv4/IPv6 -> TCP/UDP/IP..)
if len(headers) == 2 {
// Commonly the first header is an address.
if addr, ok := p.Headers[0].(addrHdr); ok {
if hdr, ok := p.Headers[1].(addrStringer); ok {
return fmt.Sprintf("%s %s", p.Time, hdr.String(addr))
}
}
}
// For IP in IP, we do a recursive call.
if len(headers) >= 2 {
if addr, ok := headers[0].(addrHdr); ok {
if _, ok := headers[1].(addrHdr); ok {
return fmt.Sprintf("%s > %s IP in IP: ",
addr.SrcAddr(), addr.DestAddr(), p.headerString(headers[1:]))
}
}
}
var typeNames []string
for _, hdr := range headers {
typeNames = append(typeNames, reflect.TypeOf(hdr).String())
}
return fmt.Sprintf("unknown [%s]", strings.Join(typeNames, ","))
}
// String prints a one-line representation of the packet header.
// The output is suitable for use in a tcpdump program.
func (p *Packet) String() string {
// If there are no headers, print "unsupported protocol".
if len(p.Headers) == 0 {
return fmt.Sprintf("%s unsupported protocol %d", p.Time, int(p.Type))
}
return fmt.Sprintf("%s %s", p.Time, p.headerString(p.Headers))
}
// Arphdr is a ARP packet header.
type Arphdr struct {
Addrtype uint16
Protocol uint16
HwAddressSize uint8
ProtAddressSize uint8
Operation uint16
SourceHwAddress []byte
SourceProtAddress []byte
DestHwAddress []byte
DestProtAddress []byte
}
func (arp *Arphdr) String() (s string) {
switch arp.Operation {
case 1:
s = "ARP request"
case 2:
s = "ARP Reply"
}
if arp.Addrtype == LINKTYPE_ETHERNET && arp.Protocol == TYPE_IP {
s = fmt.Sprintf("%012x (%s) > %012x (%s)",
decodemac(arp.SourceHwAddress), arp.SourceProtAddress,
decodemac(arp.DestHwAddress), arp.DestProtAddress)
} else {
s = fmt.Sprintf("addrtype = %d protocol = %d", arp.Addrtype, arp.Protocol)
}
return
}
func (p *Packet) decodeArp() {
if len(p.Payload) < 8 {
return
}
pkt := p.Payload
arp := new(Arphdr)
arp.Addrtype = binary.BigEndian.Uint16(pkt[0:2])
arp.Protocol = binary.BigEndian.Uint16(pkt[2:4])
arp.HwAddressSize = pkt[4]
arp.ProtAddressSize = pkt[5]
arp.Operation = binary.BigEndian.Uint16(pkt[6:8])
if len(pkt) < int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
return
}
arp.SourceHwAddress = pkt[8 : 8+arp.HwAddressSize]
arp.SourceProtAddress = pkt[8+arp.HwAddressSize : 8+arp.HwAddressSize+arp.ProtAddressSize]
arp.DestHwAddress = pkt[8+arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+arp.ProtAddressSize]
arp.DestProtAddress = pkt[8+2*arp.HwAddressSize+arp.ProtAddressSize : 8+2*arp.HwAddressSize+2*arp.ProtAddressSize]
p.Headers = append(p.Headers, arp)
if len(pkt) >= int(8+2*arp.HwAddressSize+2*arp.ProtAddressSize) {
p.Payload = p.Payload[8+2*arp.HwAddressSize+2*arp.ProtAddressSize:]
}
}
// IPadr is the header of an IP packet.
type Iphdr struct {
Version uint8
Ihl uint8
Tos uint8
Length uint16
Id uint16
Flags uint8
FragOffset uint16
Ttl uint8
Protocol uint8
Checksum uint16
SrcIp []byte
DestIp []byte
}
func (p *Packet) decodeIp() {
if len(p.Payload) < 20 {
return
}
pkt := p.Payload
ip := new(Iphdr)
ip.Version = uint8(pkt[0]) >> 4
ip.Ihl = uint8(pkt[0]) & 0x0F
ip.Tos = pkt[1]
ip.Length = binary.BigEndian.Uint16(pkt[2:4])
ip.Id = binary.BigEndian.Uint16(pkt[4:6])
flagsfrags := binary.BigEndian.Uint16(pkt[6:8])
ip.Flags = uint8(flagsfrags >> 13)
ip.FragOffset = flagsfrags & 0x1FFF
ip.Ttl = pkt[8]
ip.Protocol = pkt[9]
ip.Checksum = binary.BigEndian.Uint16(pkt[10:12])
ip.SrcIp = pkt[12:16]
ip.DestIp = pkt[16:20]
pEnd := int(ip.Length)
if pEnd > len(pkt) {
pEnd = len(pkt)
}
if len(pkt) >= pEnd && int(ip.Ihl*4) < pEnd {
p.Payload = pkt[ip.Ihl*4 : pEnd]
} else {
p.Payload = []byte{}
}
p.Headers = append(p.Headers, ip)
p.IP = ip
switch ip.Protocol {
case IP_TCP:
p.decodeTcp()
case IP_UDP:
p.decodeUdp()
case IP_ICMP:
p.decodeIcmp()
case IP_INIP:
p.decodeIp()
}
}
func (ip *Iphdr) SrcAddr() string { return net.IP(ip.SrcIp).String() }
func (ip *Iphdr) DestAddr() string { return net.IP(ip.DestIp).String() }
func (ip *Iphdr) Len() int { return int(ip.Length) }
type Vlanhdr struct {
Priority byte
DropEligible bool
VlanIdentifier int
Type int // Not actually part of the vlan header, but the type of the actual packet
}
func (v *Vlanhdr) String() {
fmt.Sprintf("VLAN Priority:%d Drop:%v Tag:%d", v.Priority, v.DropEligible, v.VlanIdentifier)
}
func (p *Packet) decodeVlan() {
pkt := p.Payload
vlan := new(Vlanhdr)
if len(pkt) < 4 {
return
}
vlan.Priority = (pkt[2] & 0xE0) >> 13
vlan.DropEligible = pkt[2]&0x10 != 0
vlan.VlanIdentifier = int(binary.BigEndian.Uint16(pkt[:2])) & 0x0FFF
vlan.Type = int(binary.BigEndian.Uint16(p.Payload[2:4]))
p.Headers = append(p.Headers, vlan)
if len(pkt) >= 5 {
p.Payload = p.Payload[4:]
}
switch vlan.Type {
case TYPE_IP:
p.decodeIp()
case TYPE_IP6:
p.decodeIp6()
case TYPE_ARP:
p.decodeArp()
}
}
type Tcphdr struct {
SrcPort uint16
DestPort uint16
Seq uint32
Ack uint32
DataOffset uint8
Flags uint16
Window uint16
Checksum uint16
Urgent uint16
Data []byte
}
const (
TCP_FIN = 1 << iota
TCP_SYN
TCP_RST
TCP_PSH
TCP_ACK
TCP_URG
TCP_ECE
TCP_CWR
TCP_NS
)
func (p *Packet) decodeTcp() {
if len(p.Payload) < 20 {
return
}
pkt := p.Payload
tcp := new(Tcphdr)
tcp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
tcp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
tcp.Seq = binary.BigEndian.Uint32(pkt[4:8])
tcp.Ack = binary.BigEndian.Uint32(pkt[8:12])
tcp.DataOffset = (pkt[12] & 0xF0) >> 4
tcp.Flags = binary.BigEndian.Uint16(pkt[12:14]) & 0x1FF
tcp.Window = binary.BigEndian.Uint16(pkt[14:16])
tcp.Checksum = binary.BigEndian.Uint16(pkt[16:18])
tcp.Urgent = binary.BigEndian.Uint16(pkt[18:20])
if len(pkt) >= int(tcp.DataOffset*4) {
p.Payload = pkt[tcp.DataOffset*4:]
}
p.Headers = append(p.Headers, tcp)
p.TCP = tcp
}
func (tcp *Tcphdr) String(hdr addrHdr) string {
return fmt.Sprintf("TCP %s:%d > %s:%d %s SEQ=%d ACK=%d LEN=%d",
hdr.SrcAddr(), int(tcp.SrcPort), hdr.DestAddr(), int(tcp.DestPort),
tcp.FlagsString(), int64(tcp.Seq), int64(tcp.Ack), hdr.Len())
}
func (tcp *Tcphdr) FlagsString() string {
var sflags []string
if 0 != (tcp.Flags & TCP_SYN) {
sflags = append(sflags, "syn")
}
if 0 != (tcp.Flags & TCP_FIN) {
sflags = append(sflags, "fin")
}
if 0 != (tcp.Flags & TCP_ACK) {
sflags = append(sflags, "ack")
}
if 0 != (tcp.Flags & TCP_PSH) {
sflags = append(sflags, "psh")
}
if 0 != (tcp.Flags & TCP_RST) {
sflags = append(sflags, "rst")
}
if 0 != (tcp.Flags & TCP_URG) {
sflags = append(sflags, "urg")
}
if 0 != (tcp.Flags & TCP_NS) {
sflags = append(sflags, "ns")
}
if 0 != (tcp.Flags & TCP_CWR) {
sflags = append(sflags, "cwr")
}
if 0 != (tcp.Flags & TCP_ECE) {
sflags = append(sflags, "ece")
}
return fmt.Sprintf("[%s]", strings.Join(sflags, " "))
}
type Udphdr struct {
SrcPort uint16
DestPort uint16
Length uint16
Checksum uint16
}
func (p *Packet) decodeUdp() {
if len(p.Payload) < 8 {
return
}
pkt := p.Payload
udp := new(Udphdr)
udp.SrcPort = binary.BigEndian.Uint16(pkt[0:2])
udp.DestPort = binary.BigEndian.Uint16(pkt[2:4])
udp.Length = binary.BigEndian.Uint16(pkt[4:6])
udp.Checksum = binary.BigEndian.Uint16(pkt[6:8])
p.Headers = append(p.Headers, udp)
p.UDP = udp
if len(p.Payload) >= 8 {
p.Payload = pkt[8:]
}
}
func (udp *Udphdr) String(hdr addrHdr) string {
return fmt.Sprintf("UDP %s:%d > %s:%d LEN=%d CHKSUM=%d",
hdr.SrcAddr(), int(udp.SrcPort), hdr.DestAddr(), int(udp.DestPort),
int(udp.Length), int(udp.Checksum))
}
type Icmphdr struct {
Type uint8
Code uint8
Checksum uint16
Id uint16
Seq uint16
Data []byte
}
func (p *Packet) decodeIcmp() *Icmphdr {
if len(p.Payload) < 8 {
return nil
}
pkt := p.Payload
icmp := new(Icmphdr)
icmp.Type = pkt[0]
icmp.Code = pkt[1]
icmp.Checksum = binary.BigEndian.Uint16(pkt[2:4])
icmp.Id = binary.BigEndian.Uint16(pkt[4:6])
icmp.Seq = binary.BigEndian.Uint16(pkt[6:8])
p.Payload = pkt[8:]
p.Headers = append(p.Headers, icmp)
return icmp
}
func (icmp *Icmphdr) String(hdr addrHdr) string {
return fmt.Sprintf("ICMP %s > %s Type = %d Code = %d ",
hdr.SrcAddr(), hdr.DestAddr(), icmp.Type, icmp.Code)
}
func (icmp *Icmphdr) TypeString() (result string) {
switch icmp.Type {
case 0:
result = fmt.Sprintf("Echo reply seq=%d", icmp.Seq)
case 3:
switch icmp.Code {
case 0:
result = "Network unreachable"
case 1:
result = "Host unreachable"
case 2:
result = "Protocol unreachable"
case 3:
result = "Port unreachable"
default:
result = "Destination unreachable"
}
case 8:
result = fmt.Sprintf("Echo request seq=%d", icmp.Seq)
case 30:
result = "Traceroute"
}
return
}
type Ip6hdr struct {
// http://www.networksorcery.com/enp/protocol/ipv6.htm
Version uint8 // 4 bits
TrafficClass uint8 // 8 bits
FlowLabel uint32 // 20 bits
Length uint16 // 16 bits
NextHeader uint8 // 8 bits, same as Protocol in Iphdr
HopLimit uint8 // 8 bits
SrcIp []byte // 16 bytes
DestIp []byte // 16 bytes
}
func (p *Packet) decodeIp6() {
if len(p.Payload) < 40 {
return
}
pkt := p.Payload
ip6 := new(Ip6hdr)
ip6.Version = uint8(pkt[0]) >> 4
ip6.TrafficClass = uint8((binary.BigEndian.Uint16(pkt[0:2]) >> 4) & 0x00FF)
ip6.FlowLabel = binary.BigEndian.Uint32(pkt[0:4]) & 0x000FFFFF
ip6.Length = binary.BigEndian.Uint16(pkt[4:6])
ip6.NextHeader = pkt[6]
ip6.HopLimit = pkt[7]
ip6.SrcIp = pkt[8:24]
ip6.DestIp = pkt[24:40]
if len(p.Payload) >= 40 {
p.Payload = pkt[40:]
}
p.Headers = append(p.Headers, ip6)
switch ip6.NextHeader {
case IP_TCP:
p.decodeTcp()
case IP_UDP:
p.decodeUdp()
case IP_ICMP:
p.decodeIcmp()
case IP_INIP:
p.decodeIp()
}
}
func (ip6 *Ip6hdr) SrcAddr() string { return net.IP(ip6.SrcIp).String() }
func (ip6 *Ip6hdr) DestAddr() string { return net.IP(ip6.DestIp).String() }
func (ip6 *Ip6hdr) Len() int { return int(ip6.Length) }

View File

@@ -1,247 +0,0 @@
package pcap
import (
"bytes"
"testing"
"time"
)
var testSimpleTcpPacket *Packet = &Packet{
Data: []byte{
0x00, 0x00, 0x0c, 0x9f, 0xf0, 0x20, 0xbc, 0x30, 0x5b, 0xe8, 0xd3, 0x49,
0x08, 0x00, 0x45, 0x00, 0x01, 0xa4, 0x39, 0xdf, 0x40, 0x00, 0x40, 0x06,
0x55, 0x5a, 0xac, 0x11, 0x51, 0x49, 0xad, 0xde, 0xfe, 0xe1, 0xc5, 0xf7,
0x00, 0x50, 0xc5, 0x7e, 0x0e, 0x48, 0x49, 0x07, 0x42, 0x32, 0x80, 0x18,
0x00, 0x73, 0xab, 0xb1, 0x00, 0x00, 0x01, 0x01, 0x08, 0x0a, 0x03, 0x77,
0x37, 0x9c, 0x42, 0x77, 0x5e, 0x3a, 0x47, 0x45, 0x54, 0x20, 0x2f, 0x20,
0x48, 0x54, 0x54, 0x50, 0x2f, 0x31, 0x2e, 0x31, 0x0d, 0x0a, 0x48, 0x6f,
0x73, 0x74, 0x3a, 0x20, 0x77, 0x77, 0x77, 0x2e, 0x66, 0x69, 0x73, 0x68,
0x2e, 0x63, 0x6f, 0x6d, 0x0d, 0x0a, 0x43, 0x6f, 0x6e, 0x6e, 0x65, 0x63,
0x74, 0x69, 0x6f, 0x6e, 0x3a, 0x20, 0x6b, 0x65, 0x65, 0x70, 0x2d, 0x61,
0x6c, 0x69, 0x76, 0x65, 0x0d, 0x0a, 0x55, 0x73, 0x65, 0x72, 0x2d, 0x41,
0x67, 0x65, 0x6e, 0x74, 0x3a, 0x20, 0x4d, 0x6f, 0x7a, 0x69, 0x6c, 0x6c,
0x61, 0x2f, 0x35, 0x2e, 0x30, 0x20, 0x28, 0x58, 0x31, 0x31, 0x3b, 0x20,
0x4c, 0x69, 0x6e, 0x75, 0x78, 0x20, 0x78, 0x38, 0x36, 0x5f, 0x36, 0x34,
0x29, 0x20, 0x41, 0x70, 0x70, 0x6c, 0x65, 0x57, 0x65, 0x62, 0x4b, 0x69,
0x74, 0x2f, 0x35, 0x33, 0x35, 0x2e, 0x32, 0x20, 0x28, 0x4b, 0x48, 0x54,
0x4d, 0x4c, 0x2c, 0x20, 0x6c, 0x69, 0x6b, 0x65, 0x20, 0x47, 0x65, 0x63,
0x6b, 0x6f, 0x29, 0x20, 0x43, 0x68, 0x72, 0x6f, 0x6d, 0x65, 0x2f, 0x31,
0x35, 0x2e, 0x30, 0x2e, 0x38, 0x37, 0x34, 0x2e, 0x31, 0x32, 0x31, 0x20,
0x53, 0x61, 0x66, 0x61, 0x72, 0x69, 0x2f, 0x35, 0x33, 0x35, 0x2e, 0x32,
0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x3a, 0x20, 0x74, 0x65,
0x78, 0x74, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x2c, 0x61, 0x70, 0x70, 0x6c,
0x69, 0x63, 0x61, 0x74, 0x69, 0x6f, 0x6e, 0x2f, 0x78, 0x68, 0x74, 0x6d,
0x6c, 0x2b, 0x78, 0x6d, 0x6c, 0x2c, 0x61, 0x70, 0x70, 0x6c, 0x69, 0x63,
0x61, 0x74, 0x69, 0x6f, 0x6e, 0x2f, 0x78, 0x6d, 0x6c, 0x3b, 0x71, 0x3d,
0x30, 0x2e, 0x39, 0x2c, 0x2a, 0x2f, 0x2a, 0x3b, 0x71, 0x3d, 0x30, 0x2e,
0x38, 0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x45, 0x6e,
0x63, 0x6f, 0x64, 0x69, 0x6e, 0x67, 0x3a, 0x20, 0x67, 0x7a, 0x69, 0x70,
0x2c, 0x64, 0x65, 0x66, 0x6c, 0x61, 0x74, 0x65, 0x2c, 0x73, 0x64, 0x63,
0x68, 0x0d, 0x0a, 0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x4c, 0x61,
0x6e, 0x67, 0x75, 0x61, 0x67, 0x65, 0x3a, 0x20, 0x65, 0x6e, 0x2d, 0x55,
0x53, 0x2c, 0x65, 0x6e, 0x3b, 0x71, 0x3d, 0x30, 0x2e, 0x38, 0x0d, 0x0a,
0x41, 0x63, 0x63, 0x65, 0x70, 0x74, 0x2d, 0x43, 0x68, 0x61, 0x72, 0x73,
0x65, 0x74, 0x3a, 0x20, 0x49, 0x53, 0x4f, 0x2d, 0x38, 0x38, 0x35, 0x39,
0x2d, 0x31, 0x2c, 0x75, 0x74, 0x66, 0x2d, 0x38, 0x3b, 0x71, 0x3d, 0x30,
0x2e, 0x37, 0x2c, 0x2a, 0x3b, 0x71, 0x3d, 0x30, 0x2e, 0x33, 0x0d, 0x0a,
0x0d, 0x0a,
}}
func BenchmarkDecodeSimpleTcpPacket(b *testing.B) {
for i := 0; i < b.N; i++ {
testSimpleTcpPacket.Decode()
}
}
func TestDecodeSimpleTcpPacket(t *testing.T) {
p := testSimpleTcpPacket
p.Decode()
if p.DestMac != 0x00000c9ff020 {
t.Error("Dest mac", p.DestMac)
}
if p.SrcMac != 0xbc305be8d349 {
t.Error("Src mac", p.SrcMac)
}
if len(p.Headers) != 2 {
t.Error("Incorrect number of headers", len(p.Headers))
return
}
if ip, ipOk := p.Headers[0].(*Iphdr); ipOk {
if ip.Version != 4 {
t.Error("ip Version", ip.Version)
}
if ip.Ihl != 5 {
t.Error("ip header length", ip.Ihl)
}
if ip.Tos != 0 {
t.Error("ip TOS", ip.Tos)
}
if ip.Length != 420 {
t.Error("ip Length", ip.Length)
}
if ip.Id != 14815 {
t.Error("ip ID", ip.Id)
}
if ip.Flags != 0x02 {
t.Error("ip Flags", ip.Flags)
}
if ip.FragOffset != 0 {
t.Error("ip Fragoffset", ip.FragOffset)
}
if ip.Ttl != 64 {
t.Error("ip TTL", ip.Ttl)
}
if ip.Protocol != 6 {
t.Error("ip Protocol", ip.Protocol)
}
if ip.Checksum != 0x555A {
t.Error("ip Checksum", ip.Checksum)
}
if !bytes.Equal(ip.SrcIp, []byte{172, 17, 81, 73}) {
t.Error("ip Src", ip.SrcIp)
}
if !bytes.Equal(ip.DestIp, []byte{173, 222, 254, 225}) {
t.Error("ip Dest", ip.DestIp)
}
if tcp, tcpOk := p.Headers[1].(*Tcphdr); tcpOk {
if tcp.SrcPort != 50679 {
t.Error("tcp srcport", tcp.SrcPort)
}
if tcp.DestPort != 80 {
t.Error("tcp destport", tcp.DestPort)
}
if tcp.Seq != 0xc57e0e48 {
t.Error("tcp seq", tcp.Seq)
}
if tcp.Ack != 0x49074232 {
t.Error("tcp ack", tcp.Ack)
}
if tcp.DataOffset != 8 {
t.Error("tcp dataoffset", tcp.DataOffset)
}
if tcp.Flags != 0x18 {
t.Error("tcp flags", tcp.Flags)
}
if tcp.Window != 0x73 {
t.Error("tcp window", tcp.Window)
}
if tcp.Checksum != 0xabb1 {
t.Error("tcp checksum", tcp.Checksum)
}
if tcp.Urgent != 0 {
t.Error("tcp urgent", tcp.Urgent)
}
} else {
t.Error("Second header is not TCP header")
}
} else {
t.Error("First header is not IP header")
}
if string(p.Payload) != "GET / HTTP/1.1\r\nHost: www.fish.com\r\nConnection: keep-alive\r\nUser-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\nAccept-Encoding: gzip,deflate,sdch\r\nAccept-Language: en-US,en;q=0.8\r\nAccept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n\r\n" {
t.Error("--- PAYLOAD STRING ---\n", string(p.Payload), "\n--- PAYLOAD BYTES ---\n", p.Payload)
}
}
// Makes sure packet payload doesn't display the 6 trailing null of this packet
// as part of the payload. They're actually the ethernet trailer.
func TestDecodeSmallTcpPacketHasEmptyPayload(t *testing.T) {
p := &Packet{
// This packet is only 54 bits (an empty TCP RST), thus 6 trailing null
// bytes are added by the ethernet layer to make it the minimum packet size.
Data: []byte{
0xbc, 0x30, 0x5b, 0xe8, 0xd3, 0x49, 0xb8, 0xac, 0x6f, 0x92, 0xd5, 0xbf,
0x08, 0x00, 0x45, 0x00, 0x00, 0x28, 0x00, 0x00, 0x40, 0x00, 0x40, 0x06,
0x3f, 0x9f, 0xac, 0x11, 0x51, 0xc5, 0xac, 0x11, 0x51, 0x49, 0x00, 0x63,
0x9a, 0xef, 0x00, 0x00, 0x00, 0x00, 0x2e, 0xc1, 0x27, 0x83, 0x50, 0x14,
0x00, 0x00, 0xc3, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
}}
p.Decode()
if p.Payload == nil {
t.Error("Nil payload")
}
if len(p.Payload) != 0 {
t.Error("Non-empty payload:", p.Payload)
}
}
func TestDecodeVlanPacket(t *testing.T) {
p := &Packet{
Data: []byte{
0x00, 0x10, 0xdb, 0xff, 0x10, 0x00, 0x00, 0x15, 0x2c, 0x9d, 0xcc, 0x00, 0x81, 0x00, 0x01, 0xf7,
0x08, 0x00, 0x45, 0x00, 0x00, 0x28, 0x29, 0x8d, 0x40, 0x00, 0x7d, 0x06, 0x83, 0xa0, 0xac, 0x1b,
0xca, 0x8e, 0x45, 0x16, 0x94, 0xe2, 0xd4, 0x0a, 0x00, 0x50, 0xdf, 0xab, 0x9c, 0xc6, 0xcd, 0x1e,
0xe5, 0xd1, 0x50, 0x10, 0x01, 0x00, 0x5a, 0x74, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
}}
p.Decode()
if p.Type != TYPE_VLAN {
t.Error("Didn't detect vlan")
}
if len(p.Headers) != 3 {
t.Error("Incorrect number of headers:", len(p.Headers))
for i, h := range p.Headers {
t.Errorf("Header %d: %#v", i, h)
}
t.FailNow()
}
if _, ok := p.Headers[0].(*Vlanhdr); !ok {
t.Errorf("First header isn't vlan: %q", p.Headers[0])
}
if _, ok := p.Headers[1].(*Iphdr); !ok {
t.Errorf("Second header isn't IP: %q", p.Headers[1])
}
if _, ok := p.Headers[2].(*Tcphdr); !ok {
t.Errorf("Third header isn't TCP: %q", p.Headers[2])
}
}
func TestDecodeFuzzFallout(t *testing.T) {
testData := []struct {
Data []byte
}{
{[]byte("000000000000\x81\x000")},
{[]byte("000000000000\x81\x00000")},
{[]byte("000000000000\x86\xdd0")},
{[]byte("000000000000\b\x000")},
{[]byte("000000000000\b\x060")},
{[]byte{}},
{[]byte("000000000000\b\x0600000000")},
{[]byte("000000000000\x86\xdd000000\x01000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x0600000000")},
{[]byte("000000000000\b\x00n0000000000000000000")},
{[]byte("000000000000\x86\xdd000000\x0100000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x00g0000000000000000000")},
//{[]byte()},
{[]byte("000000000000\b\x00400000000\x110000000000")},
{[]byte("0nMء\xfe\x13\x13\x81\x00gr\b\x00&x\xc9\xe5b'\x1e0\x00\x04\x00\x0020596224")},
{[]byte("000000000000\x81\x0000\b\x00400000000\x110000000000")},
{[]byte("000000000000\b\x00000000000\x0600\xff0000000")},
{[]byte("000000000000\x86\xdd000000\x06000000000000000000000000000000000")},
{[]byte("000000000000\x81\x0000\b\x00000000000\x0600b0000000")},
{[]byte("000000000000\x81\x0000\b\x00400000000\x060000000000")},
{[]byte("000000000000\x86\xdd000000\x11000000000000000000000000000000000")},
{[]byte("000000000000\x86\xdd000000\x0600000000000000000000000000000000000000000000M")},
{[]byte("000000000000\b\x00500000000\x0600000000000")},
{[]byte("0nM\xd80\xfe\x13\x13\x81\x00gr\b\x00&x\xc9\xe5b'\x1e0\x00\x04\x00\x0020596224")},
}
for _, entry := range testData {
pkt := &Packet{
Time: time.Now(),
Caplen: uint32(len(entry.Data)),
Len: uint32(len(entry.Data)),
Data: entry.Data,
}
pkt.Decode()
/*
func() {
defer func() {
if err := recover(); err != nil {
t.Fatalf("%d. %q failed: %v", idx, string(entry.Data), err)
}
}()
pkt.Decode()
}()
*/
}
}

View File

@@ -1,206 +0,0 @@
package pcap
import (
"encoding/binary"
"fmt"
"io"
"time"
)
// FileHeader is the parsed header of a pcap file.
// http://wiki.wireshark.org/Development/LibpcapFileFormat
type FileHeader struct {
MagicNumber uint32
VersionMajor uint16
VersionMinor uint16
TimeZone int32
SigFigs uint32
SnapLen uint32
Network uint32
}
type PacketTime struct {
Sec int32
Usec int32
}
// Convert the PacketTime to a go Time struct.
func (p *PacketTime) Time() time.Time {
return time.Unix(int64(p.Sec), int64(p.Usec)*1000)
}
// Packet is a single packet parsed from a pcap file.
//
// Convenient access to IP, TCP, and UDP headers is provided after Decode()
// is called if the packet is of the appropriate type.
type Packet struct {
Time time.Time // packet send/receive time
Caplen uint32 // bytes stored in the file (caplen <= len)
Len uint32 // bytes sent/received
Data []byte // packet data
Type int // protocol type, see LINKTYPE_*
DestMac uint64
SrcMac uint64
Headers []interface{} // decoded headers, in order
Payload []byte // remaining non-header bytes
IP *Iphdr // IP header (for IP packets, after decoding)
TCP *Tcphdr // TCP header (for TCP packets, after decoding)
UDP *Udphdr // UDP header (for UDP packets after decoding)
}
// Reader parses pcap files.
type Reader struct {
flip bool
buf io.Reader
err error
fourBytes []byte
twoBytes []byte
sixteenBytes []byte
Header FileHeader
}
// NewReader reads pcap data from an io.Reader.
func NewReader(reader io.Reader) (*Reader, error) {
r := &Reader{
buf: reader,
fourBytes: make([]byte, 4),
twoBytes: make([]byte, 2),
sixteenBytes: make([]byte, 16),
}
switch magic := r.readUint32(); magic {
case 0xa1b2c3d4:
r.flip = false
case 0xd4c3b2a1:
r.flip = true
default:
return nil, fmt.Errorf("pcap: bad magic number: %0x", magic)
}
r.Header = FileHeader{
MagicNumber: 0xa1b2c3d4,
VersionMajor: r.readUint16(),
VersionMinor: r.readUint16(),
TimeZone: r.readInt32(),
SigFigs: r.readUint32(),
SnapLen: r.readUint32(),
Network: r.readUint32(),
}
return r, nil
}
// Next returns the next packet or nil if no more packets can be read.
func (r *Reader) Next() *Packet {
d := r.sixteenBytes
r.err = r.read(d)
if r.err != nil {
return nil
}
timeSec := asUint32(d[0:4], r.flip)
timeUsec := asUint32(d[4:8], r.flip)
capLen := asUint32(d[8:12], r.flip)
origLen := asUint32(d[12:16], r.flip)
data := make([]byte, capLen)
if r.err = r.read(data); r.err != nil {
return nil
}
return &Packet{
Time: time.Unix(int64(timeSec), int64(timeUsec)),
Caplen: capLen,
Len: origLen,
Data: data,
}
}
func (r *Reader) read(data []byte) error {
var err error
n, err := r.buf.Read(data)
for err == nil && n != len(data) {
var chunk int
chunk, err = r.buf.Read(data[n:])
n += chunk
}
if len(data) == n {
return nil
}
return err
}
func (r *Reader) readUint32() uint32 {
data := r.fourBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return asUint32(data, r.flip)
}
func (r *Reader) readInt32() int32 {
data := r.fourBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return int32(asUint32(data, r.flip))
}
func (r *Reader) readUint16() uint16 {
data := r.twoBytes
if r.err = r.read(data); r.err != nil {
return 0
}
return asUint16(data, r.flip)
}
// Writer writes a pcap file.
type Writer struct {
writer io.Writer
buf []byte
}
// NewWriter creates a Writer that stores output in an io.Writer.
// The FileHeader is written immediately.
func NewWriter(writer io.Writer, header *FileHeader) (*Writer, error) {
w := &Writer{
writer: writer,
buf: make([]byte, 24),
}
binary.LittleEndian.PutUint32(w.buf, header.MagicNumber)
binary.LittleEndian.PutUint16(w.buf[4:], header.VersionMajor)
binary.LittleEndian.PutUint16(w.buf[6:], header.VersionMinor)
binary.LittleEndian.PutUint32(w.buf[8:], uint32(header.TimeZone))
binary.LittleEndian.PutUint32(w.buf[12:], header.SigFigs)
binary.LittleEndian.PutUint32(w.buf[16:], header.SnapLen)
binary.LittleEndian.PutUint32(w.buf[20:], header.Network)
if _, err := writer.Write(w.buf); err != nil {
return nil, err
}
return w, nil
}
// Writer writes a packet to the underlying writer.
func (w *Writer) Write(pkt *Packet) error {
binary.LittleEndian.PutUint32(w.buf, uint32(pkt.Time.Unix()))
binary.LittleEndian.PutUint32(w.buf[4:], uint32(pkt.Time.Nanosecond()))
binary.LittleEndian.PutUint32(w.buf[8:], uint32(pkt.Time.Unix()))
binary.LittleEndian.PutUint32(w.buf[12:], pkt.Len)
if _, err := w.writer.Write(w.buf[:16]); err != nil {
return err
}
_, err := w.writer.Write(pkt.Data)
return err
}
func asUint32(data []byte, flip bool) uint32 {
if flip {
return binary.BigEndian.Uint32(data)
}
return binary.LittleEndian.Uint32(data)
}
func asUint16(data []byte, flip bool) uint16 {
if flip {
return binary.BigEndian.Uint16(data)
}
return binary.LittleEndian.Uint16(data)
}

View File

@@ -1,266 +0,0 @@
// Interface to both live and offline pcap parsing.
package pcap
/*
#cgo linux LDFLAGS: -lpcap
#cgo freebsd LDFLAGS: -lpcap
#cgo darwin LDFLAGS: -lpcap
#cgo windows CFLAGS: -I C:/WpdPack/Include
#cgo windows,386 LDFLAGS: -L C:/WpdPack/Lib -lwpcap
#cgo windows,amd64 LDFLAGS: -L C:/WpdPack/Lib/x64 -lwpcap
#include <stdlib.h>
#include <pcap.h>
// Workaround for not knowing how to cast to const u_char**
int hack_pcap_next_ex(pcap_t *p, struct pcap_pkthdr **pkt_header,
u_char **pkt_data) {
return pcap_next_ex(p, pkt_header, (const u_char **)pkt_data);
}
*/
import "C"
import (
"errors"
"net"
"syscall"
"time"
"unsafe"
)
type Pcap struct {
cptr *C.pcap_t
}
type Stat struct {
PacketsReceived uint32
PacketsDropped uint32
PacketsIfDropped uint32
}
type Interface struct {
Name string
Description string
Addresses []IFAddress
// TODO: add more elements
}
type IFAddress struct {
IP net.IP
Netmask net.IPMask
// TODO: add broadcast + PtP dst ?
}
func (p *Pcap) Next() (pkt *Packet) {
rv, _ := p.NextEx()
return rv
}
// Openlive opens a device and returns a *Pcap handler
func Openlive(device string, snaplen int32, promisc bool, timeout_ms int32) (handle *Pcap, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
h := new(Pcap)
var pro int32
if promisc {
pro = 1
}
dev := C.CString(device)
defer C.free(unsafe.Pointer(dev))
h.cptr = C.pcap_open_live(dev, C.int(snaplen), C.int(pro), C.int(timeout_ms), buf)
if nil == h.cptr {
handle = nil
err = errors.New(C.GoString(buf))
} else {
handle = h
}
C.free(unsafe.Pointer(buf))
return
}
func Openoffline(file string) (handle *Pcap, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
h := new(Pcap)
cf := C.CString(file)
defer C.free(unsafe.Pointer(cf))
h.cptr = C.pcap_open_offline(cf, buf)
if nil == h.cptr {
handle = nil
err = errors.New(C.GoString(buf))
} else {
handle = h
}
C.free(unsafe.Pointer(buf))
return
}
func (p *Pcap) NextEx() (pkt *Packet, result int32) {
var pkthdr *C.struct_pcap_pkthdr
var buf_ptr *C.u_char
var buf unsafe.Pointer
result = int32(C.hack_pcap_next_ex(p.cptr, &pkthdr, &buf_ptr))
buf = unsafe.Pointer(buf_ptr)
if nil == buf {
return
}
pkt = new(Packet)
pkt.Time = time.Unix(int64(pkthdr.ts.tv_sec), int64(pkthdr.ts.tv_usec)*1000)
pkt.Caplen = uint32(pkthdr.caplen)
pkt.Len = uint32(pkthdr.len)
pkt.Data = C.GoBytes(buf, C.int(pkthdr.caplen))
return
}
func (p *Pcap) Close() {
C.pcap_close(p.cptr)
}
func (p *Pcap) Geterror() error {
return errors.New(C.GoString(C.pcap_geterr(p.cptr)))
}
func (p *Pcap) Getstats() (stat *Stat, err error) {
var cstats _Ctype_struct_pcap_stat
if -1 == C.pcap_stats(p.cptr, &cstats) {
return nil, p.Geterror()
}
stats := new(Stat)
stats.PacketsReceived = uint32(cstats.ps_recv)
stats.PacketsDropped = uint32(cstats.ps_drop)
stats.PacketsIfDropped = uint32(cstats.ps_ifdrop)
return stats, nil
}
func (p *Pcap) Setfilter(expr string) (err error) {
var bpf _Ctype_struct_bpf_program
cexpr := C.CString(expr)
defer C.free(unsafe.Pointer(cexpr))
if -1 == C.pcap_compile(p.cptr, &bpf, cexpr, 1, 0) {
return p.Geterror()
}
if -1 == C.pcap_setfilter(p.cptr, &bpf) {
C.pcap_freecode(&bpf)
return p.Geterror()
}
C.pcap_freecode(&bpf)
return nil
}
func Version() string {
return C.GoString(C.pcap_lib_version())
}
func (p *Pcap) Datalink() int {
return int(C.pcap_datalink(p.cptr))
}
func (p *Pcap) Setdatalink(dlt int) error {
if -1 == C.pcap_set_datalink(p.cptr, C.int(dlt)) {
return p.Geterror()
}
return nil
}
func DatalinkValueToName(dlt int) string {
if name := C.pcap_datalink_val_to_name(C.int(dlt)); name != nil {
return C.GoString(name)
}
return ""
}
func DatalinkValueToDescription(dlt int) string {
if desc := C.pcap_datalink_val_to_description(C.int(dlt)); desc != nil {
return C.GoString(desc)
}
return ""
}
func Findalldevs() (ifs []Interface, err error) {
var buf *C.char
buf = (*C.char)(C.calloc(ERRBUF_SIZE, 1))
defer C.free(unsafe.Pointer(buf))
var alldevsp *C.pcap_if_t
if -1 == C.pcap_findalldevs((**C.pcap_if_t)(&alldevsp), buf) {
return nil, errors.New(C.GoString(buf))
}
defer C.pcap_freealldevs((*C.pcap_if_t)(alldevsp))
dev := alldevsp
var i uint32
for i = 0; dev != nil; dev = (*C.pcap_if_t)(dev.next) {
i++
}
ifs = make([]Interface, i)
dev = alldevsp
for j := uint32(0); dev != nil; dev = (*C.pcap_if_t)(dev.next) {
var iface Interface
iface.Name = C.GoString(dev.name)
iface.Description = C.GoString(dev.description)
iface.Addresses = findalladdresses(dev.addresses)
// TODO: add more elements
ifs[j] = iface
j++
}
return
}
func findalladdresses(addresses *_Ctype_struct_pcap_addr) (retval []IFAddress) {
// TODO - make it support more than IPv4 and IPv6?
retval = make([]IFAddress, 0, 1)
for curaddr := addresses; curaddr != nil; curaddr = (*_Ctype_struct_pcap_addr)(curaddr.next) {
var a IFAddress
var err error
if a.IP, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
continue
}
if a.Netmask, err = sockaddr_to_IP((*syscall.RawSockaddr)(unsafe.Pointer(curaddr.addr))); err != nil {
continue
}
retval = append(retval, a)
}
return
}
func sockaddr_to_IP(rsa *syscall.RawSockaddr) (IP []byte, err error) {
switch rsa.Family {
case syscall.AF_INET:
pp := (*syscall.RawSockaddrInet4)(unsafe.Pointer(rsa))
IP = make([]byte, 4)
for i := 0; i < len(IP); i++ {
IP[i] = pp.Addr[i]
}
return
case syscall.AF_INET6:
pp := (*syscall.RawSockaddrInet6)(unsafe.Pointer(rsa))
IP = make([]byte, 16)
for i := 0; i < len(IP); i++ {
IP[i] = pp.Addr[i]
}
return
}
err = errors.New("Unsupported address type")
return
}
func (p *Pcap) Inject(data []byte) (err error) {
buf := (*C.char)(C.malloc((C.size_t)(len(data))))
for i := 0; i < len(data); i++ {
*(*byte)(unsafe.Pointer(uintptr(unsafe.Pointer(buf)) + uintptr(i))) = data[i]
}
if -1 == C.pcap_sendpacket(p.cptr, (*C.u_char)(unsafe.Pointer(buf)), (C.int)(len(data))) {
err = p.Geterror()
}
C.free(unsafe.Pointer(buf))
return
}

View File

@@ -1,49 +0,0 @@
package main
import (
"flag"
"fmt"
"os"
"runtime/pprof"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
func main() {
var filename *string = flag.String("file", "", "filename")
var decode *bool = flag.Bool("d", false, "If true, decode each packet")
var cpuprofile *string = flag.String("cpuprofile", "", "filename")
flag.Parse()
h, err := pcap.Openoffline(*filename)
if err != nil {
fmt.Printf("Couldn't create pcap reader: %v", err)
}
if *cpuprofile != "" {
if out, err := os.Create(*cpuprofile); err == nil {
pprof.StartCPUProfile(out)
defer func() {
pprof.StopCPUProfile()
out.Close()
}()
} else {
panic(err)
}
}
i, nilPackets := 0, 0
start := time.Now()
for pkt, code := h.NextEx(); code != -2; pkt, code = h.NextEx() {
if pkt == nil {
nilPackets++
} else if *decode {
pkt.Decode()
}
i++
}
duration := time.Since(start)
fmt.Printf("Took %v to process %v packets, %v per packet, %d nil packets\n", duration, i, duration/time.Duration(i), nilPackets)
}

View File

@@ -1,96 +0,0 @@
package main
// Parses a pcap file, writes it back to disk, then verifies the files
// are the same.
import (
"bufio"
"flag"
"fmt"
"io"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
var input *string = flag.String("input", "", "input file")
var output *string = flag.String("output", "", "output file")
var decode *bool = flag.Bool("decode", false, "print decoded packets")
func copyPcap(dest, src string) {
f, err := os.Open(src)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer f.Close()
reader, err := pcap.NewReader(bufio.NewReader(f))
if err != nil {
fmt.Printf("couldn't create reader: %v\n", err)
return
}
w, err := os.Create(dest)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", dest, err)
return
}
defer w.Close()
buf := bufio.NewWriter(w)
writer, err := pcap.NewWriter(buf, &reader.Header)
if err != nil {
fmt.Printf("couldn't create writer: %v\n", err)
return
}
for {
pkt := reader.Next()
if pkt == nil {
break
}
if *decode {
pkt.Decode()
fmt.Println(pkt.String())
}
writer.Write(pkt)
}
buf.Flush()
}
func check(dest, src string) {
f, err := os.Open(src)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer f.Close()
freader := bufio.NewReader(f)
g, err := os.Open(dest)
if err != nil {
fmt.Printf("couldn't open %q: %v\n", src, err)
return
}
defer g.Close()
greader := bufio.NewReader(g)
for {
fb, ferr := freader.ReadByte()
gb, gerr := greader.ReadByte()
if ferr == io.EOF && gerr == io.EOF {
break
}
if fb == gb {
continue
}
fmt.Println("FAIL")
return
}
fmt.Println("PASS")
}
func main() {
flag.Parse()
copyPcap(*output, *input)
check(*output, *input)
}

View File

@@ -1,82 +0,0 @@
package main
import (
"flag"
"fmt"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
func min(x uint32, y uint32) uint32 {
if x < y {
return x
}
return y
}
func main() {
var device *string = flag.String("d", "", "device")
var file *string = flag.String("r", "", "file")
var expr *string = flag.String("e", "", "filter expression")
flag.Parse()
var h *pcap.Pcap
var err error
ifs, err := pcap.Findalldevs()
if len(ifs) == 0 {
fmt.Printf("Warning: no devices found : %s\n", err)
} else {
for i := 0; i < len(ifs); i++ {
fmt.Printf("dev %d: %s (%s)\n", i+1, ifs[i].Name, ifs[i].Description)
}
}
if *device != "" {
h, err = pcap.Openlive(*device, 65535, true, 0)
if h == nil {
fmt.Printf("Openlive(%s) failed: %s\n", *device, err)
return
}
} else if *file != "" {
h, err = pcap.Openoffline(*file)
if h == nil {
fmt.Printf("Openoffline(%s) failed: %s\n", *file, err)
return
}
} else {
fmt.Printf("usage: pcaptest [-d <device> | -r <file>]\n")
return
}
defer h.Close()
fmt.Printf("pcap version: %s\n", pcap.Version())
if *expr != "" {
fmt.Printf("Setting filter: %s\n", *expr)
err := h.Setfilter(*expr)
if err != nil {
fmt.Printf("Warning: setting filter failed: %s\n", err)
}
}
for pkt := h.Next(); pkt != nil; pkt = h.Next() {
fmt.Printf("time: %d.%06d (%s) caplen: %d len: %d\nData:",
int64(pkt.Time.Second()), int64(pkt.Time.Nanosecond()),
time.Unix(int64(pkt.Time.Second()), 0).String(), int64(pkt.Caplen), int64(pkt.Len))
for i := uint32(0); i < pkt.Caplen; i++ {
if i%32 == 0 {
fmt.Printf("\n")
}
if 32 <= pkt.Data[i] && pkt.Data[i] <= 126 {
fmt.Printf("%c", pkt.Data[i])
} else {
fmt.Printf(".")
}
}
fmt.Printf("\n\n")
}
}

View File

@@ -1,121 +0,0 @@
package main
import (
"bufio"
"flag"
"fmt"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/akrennmair/gopcap"
)
const (
TYPE_IP = 0x0800
TYPE_ARP = 0x0806
TYPE_IP6 = 0x86DD
IP_ICMP = 1
IP_INIP = 4
IP_TCP = 6
IP_UDP = 17
)
var out *bufio.Writer
var errout *bufio.Writer
func main() {
var device *string = flag.String("i", "", "interface")
var snaplen *int = flag.Int("s", 65535, "snaplen")
var hexdump *bool = flag.Bool("X", false, "hexdump")
expr := ""
out = bufio.NewWriter(os.Stdout)
errout = bufio.NewWriter(os.Stderr)
flag.Usage = func() {
fmt.Fprintf(errout, "usage: %s [ -i interface ] [ -s snaplen ] [ -X ] [ expression ]\n", os.Args[0])
errout.Flush()
os.Exit(1)
}
flag.Parse()
if len(flag.Args()) > 0 {
expr = flag.Arg(0)
}
if *device == "" {
devs, err := pcap.Findalldevs()
if err != nil {
fmt.Fprintf(errout, "tcpdump: couldn't find any devices: %s\n", err)
}
if 0 == len(devs) {
flag.Usage()
}
*device = devs[0].Name
}
h, err := pcap.Openlive(*device, int32(*snaplen), true, 0)
if h == nil {
fmt.Fprintf(errout, "tcpdump: %s\n", err)
errout.Flush()
return
}
defer h.Close()
if expr != "" {
ferr := h.Setfilter(expr)
if ferr != nil {
fmt.Fprintf(out, "tcpdump: %s\n", ferr)
out.Flush()
}
}
for pkt := h.Next(); pkt != nil; pkt = h.Next() {
pkt.Decode()
fmt.Fprintf(out, "%s\n", pkt.String())
if *hexdump {
Hexdump(pkt)
}
out.Flush()
}
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
func Hexdump(pkt *pcap.Packet) {
for i := 0; i < len(pkt.Data); i += 16 {
Dumpline(uint32(i), pkt.Data[i:min(i+16, len(pkt.Data))])
}
}
func Dumpline(addr uint32, line []byte) {
fmt.Fprintf(out, "\t0x%04x: ", int32(addr))
var i uint16
for i = 0; i < 16 && i < uint16(len(line)); i++ {
if i%2 == 0 {
out.WriteString(" ")
}
fmt.Fprintf(out, "%02x", line[i])
}
for j := i; j <= 16; j++ {
if j%2 == 0 {
out.WriteString(" ")
}
out.WriteString(" ")
}
out.WriteString(" ")
for i = 0; i < 16 && i < uint16(len(line)); i++ {
if line[i] >= 32 && line[i] <= 126 {
fmt.Fprintf(out, "%c", line[i])
} else {
out.WriteString(".")
}
}
out.WriteString("\n")
}

View File

@@ -1,18 +1,54 @@
TEST=.
BENCH=.
COVERPROFILE=/tmp/c.out
BRANCH=`git rev-parse --abbrev-ref HEAD`
COMMIT=`git rev-parse --short HEAD`
GOLDFLAGS="-X main.branch $(BRANCH) -X main.commit $(COMMIT)"
default: build
race:
@go test -v -race -test.run="TestSimulate_(100op|1000op)"
bench:
go test -v -test.run=NOTHINCONTAINSTHIS -test.bench=$(BENCH)
# http://cloc.sourceforge.net/
cloc:
@cloc --not-match-f='Makefile|_test.go' .
cover: fmt
go test -coverprofile=$(COVERPROFILE) -test.run=$(TEST) $(COVERFLAG) .
go tool cover -html=$(COVERPROFILE)
rm $(COVERPROFILE)
cpuprofile: fmt
@go test -c
@./bolt.test -test.v -test.run=$(TEST) -test.cpuprofile cpu.prof
# go get github.com/kisielk/errcheck
errcheck:
@errcheck -ignorepkg=bytes -ignore=os:Remove github.com/boltdb/bolt
@echo "=== errcheck ==="
@errcheck github.com/boltdb/bolt
test:
@go test -v -cover .
@go test -v ./cmd/bolt
fmt:
@go fmt ./...
.PHONY: fmt test
get:
@go get -d ./...
build: get
@mkdir -p bin
@go build -ldflags=$(GOLDFLAGS) -a -o bin/bolt ./cmd/bolt
test: fmt
@go get github.com/stretchr/testify/assert
@echo "=== TESTS ==="
@go test -v -cover -test.run=$(TEST)
@echo ""
@echo ""
@echo "=== CLI ==="
@go test -v -test.run=$(TEST) ./cmd/bolt
@echo ""
@echo ""
@echo "=== RACE DETECTOR ==="
@go test -v -race -test.run="TestSimulate_(100op|1000op)"
.PHONY: bench cloc cover cpuprofile fmt memprofile test

View File

@@ -1,8 +1,8 @@
Bolt [![Build Status](https://drone.io/github.com/boltdb/bolt/status.png)](https://drone.io/github.com/boltdb/bolt/latest) [![Coverage Status](https://coveralls.io/repos/boltdb/bolt/badge.svg?branch=master)](https://coveralls.io/r/boltdb/bolt?branch=master) [![GoDoc](https://godoc.org/github.com/boltdb/bolt?status.svg)](https://godoc.org/github.com/boltdb/bolt) ![Version](https://img.shields.io/badge/version-1.0-green.svg)
Bolt [![Build Status](https://drone.io/github.com/boltdb/bolt/status.png)](https://drone.io/github.com/boltdb/bolt/latest) [![Coverage Status](https://coveralls.io/repos/boltdb/bolt/badge.png?branch=master)](https://coveralls.io/r/boltdb/bolt?branch=master) [![GoDoc](https://godoc.org/github.com/boltdb/bolt?status.png)](https://godoc.org/github.com/boltdb/bolt) ![Version](http://img.shields.io/badge/version-1.0-green.png)
====
Bolt is a pure Go key/value store inspired by [Howard Chu's][hyc_symas]
[LMDB project][lmdb]. The goal of the project is to provide a simple,
Bolt is a pure Go key/value store inspired by [Howard Chu's][hyc_symas] and
the [LMDB project][lmdb]. The goal of the project is to provide a simple,
fast, and reliable database for projects that don't require a full database
server such as Postgres or MySQL.
@@ -13,6 +13,7 @@ and setting values. That's it.
[hyc_symas]: https://twitter.com/hyc_symas
[lmdb]: http://symas.com/mdb/
## Project Status
Bolt is stable and the API is fixed. Full unit test coverage and randomized
@@ -21,36 +22,6 @@ Bolt is currently in high-load production environments serving databases as
large as 1TB. Many companies such as Shopify and Heroku use Bolt-backed
services every day.
## Table of Contents
- [Getting Started](#getting-started)
- [Installing](#installing)
- [Opening a database](#opening-a-database)
- [Transactions](#transactions)
- [Read-write transactions](#read-write-transactions)
- [Read-only transactions](#read-only-transactions)
- [Batch read-write transactions](#batch-read-write-transactions)
- [Managing transactions manually](#managing-transactions-manually)
- [Using buckets](#using-buckets)
- [Using key/value pairs](#using-keyvalue-pairs)
- [Autoincrementing integer for the bucket](#autoincrementing-integer-for-the-bucket)
- [Iterating over keys](#iterating-over-keys)
- [Prefix scans](#prefix-scans)
- [Range scans](#range-scans)
- [ForEach()](#foreach)
- [Nested buckets](#nested-buckets)
- [Database backups](#database-backups)
- [Statistics](#statistics)
- [Read-Only Mode](#read-only-mode)
- [Mobile Use (iOS/Android)](#mobile-use-iosandroid)
- [Resources](#resources)
- [Comparison with other databases](#comparison-with-other-databases)
- [Postgres, MySQL, & other relational databases](#postgres-mysql--other-relational-databases)
- [LevelDB, RocksDB](#leveldb-rocksdb)
- [LMDB](#lmdb)
- [Caveats & Limitations](#caveats--limitations)
- [Reading the Source](#reading-the-source)
- [Other Projects Using Bolt](#other-projects-using-bolt)
## Getting Started
@@ -209,8 +180,8 @@ and then safely close your transaction if an error is returned. This is the
recommended way to use Bolt transactions.
However, sometimes you may want to manually start and end your transactions.
You can use the `Tx.Begin()` function directly but **please** be sure to close
the transaction.
You can use the `Tx.Begin()` function directly but _please_ be sure to close the
transaction.
```go
// Start a writable transaction.
@@ -285,7 +256,7 @@ db.View(func(tx *bolt.Tx) error {
```
The `Get()` function does not return an error because its operation is
guaranteed to work (unless there is some kind of system failure). If the key
guarenteed to work (unless there is some kind of system failure). If the key
exists then it will return its byte slice value. If it doesn't exist then it
will return `nil`. It's important to note that you can have a zero-length value
set to a key which is different than the key not existing.
@@ -297,49 +268,6 @@ transaction is open. If you need to use a value outside of the transaction
then you must use `copy()` to copy it to another byte slice.
### Autoincrementing integer for the bucket
By using the `NextSequence()` function, you can let Bolt determine a sequence
which can be used as the unique identifier for your key/value pairs. See the
example below.
```go
// CreateUser saves u to the store. The new user ID is set on u once the data is persisted.
func (s *Store) CreateUser(u *User) error {
return s.db.Update(func(tx *bolt.Tx) error {
// Retrieve the users bucket.
// This should be created when the DB is first opened.
b := tx.Bucket([]byte("users"))
// Generate ID for the user.
// This returns an error only if the Tx is closed or not writeable.
// That can't happen in an Update() call so I ignore the error check.
id, _ = b.NextSequence()
u.ID = int(id)
// Marshal user data into bytes.
buf, err := json.Marshal(u)
if err != nil {
return err
}
// Persist bytes to users bucket.
return b.Put(itob(u.ID), buf)
})
}
// itob returns an 8-byte big endian representation of v.
func itob(v int) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, uint64(v))
return b
}
type User struct {
ID int
...
}
```
### Iterating over keys
Bolt stores its keys in byte-sorted order within a bucket. This makes sequential
@@ -348,9 +276,7 @@ iteration over these keys extremely fast. To iterate over keys we'll use a
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
b := tx.Bucket([]byte("MyBucket"))
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
@@ -374,15 +300,10 @@ Next() Move to the next key.
Prev() Move to the previous key.
```
Each of those functions has a return signature of `(key []byte, value []byte)`.
When you have iterated to the end of the cursor then `Next()` will return a
`nil` key. You must seek to a position using `First()`, `Last()`, or `Seek()`
before calling `Next()` or `Prev()`. If you do not seek to a position then
these functions will return a `nil` key.
During iteration, if the key is non-`nil` but the value is `nil`, that means
the key refers to a bucket rather than a value. Use `Bucket.Bucket()` to
access the sub-bucket.
When you have iterated to the end of the cursor then `Next()` will return `nil`.
You must seek to a position using `First()`, `Last()`, or `Seek()` before
calling `Next()` or `Prev()`. If you do not seek to a position then these
functions will return `nil`.
#### Prefix scans
@@ -391,7 +312,6 @@ To iterate over a key prefix, you can combine `Seek()` and `bytes.HasPrefix()`:
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
c := tx.Bucket([]byte("MyBucket")).Cursor()
prefix := []byte("1234")
@@ -411,7 +331,7 @@ date range like this:
```go
db.View(func(tx *bolt.Tx) error {
// Assume our events bucket exists and has RFC3339 encoded time keys.
// Assume our events bucket has RFC3339 encoded time keys.
c := tx.Bucket([]byte("Events")).Cursor()
// Our time range spans the 90's decade.
@@ -435,9 +355,7 @@ all the keys in a bucket:
```go
db.View(func(tx *bolt.Tx) error {
// Assume bucket exists and has keys
b := tx.Bucket([]byte("MyBucket"))
b.ForEach(func(k, v []byte) error {
fmt.Printf("key=%s, value=%s\n", k, v)
return nil
@@ -464,11 +382,8 @@ func (*Bucket) DeleteBucket(key []byte) error
Bolt is a single file so it's easy to backup. You can use the `Tx.WriteTo()`
function to write a consistent view of the database to a writer. If you call
this from a read-only transaction, it will perform a hot backup and not block
your other database reads and writes.
By default, it will use a regular file handle which will utilize the operating
system's page cache. See the [`Tx`](https://godoc.org/github.com/boltdb/bolt#Tx)
documentation for information about optimizing for larger-than-RAM datasets.
your other database reads and writes. It will also use `O_DIRECT` when available
to prevent page cache trashing.
One common use case is to backup over HTTP so you can use tools like `cURL` to
do database backups:
@@ -550,84 +465,6 @@ if err != nil {
}
```
### Mobile Use (iOS/Android)
Bolt is able to run on mobile devices by leveraging the binding feature of the
[gomobile](https://github.com/golang/mobile) tool. Create a struct that will
contain your database logic and a reference to a `*bolt.DB` with a initializing
contstructor that takes in a filepath where the database file will be stored.
Neither Android nor iOS require extra permissions or cleanup from using this method.
```go
func NewBoltDB(filepath string) *BoltDB {
db, err := bolt.Open(filepath+"/demo.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
return &BoltDB{db}
}
type BoltDB struct {
db *bolt.DB
...
}
func (b *BoltDB) Path() string {
return b.db.Path()
}
func (b *BoltDB) Close() {
b.db.Close()
}
```
Database logic should be defined as methods on this wrapper struct.
To initialize this struct from the native language (both platforms now sync
their local storage to the cloud. These snippets disable that functionality for the
database file):
#### Android
```java
String path;
if (android.os.Build.VERSION.SDK_INT >=android.os.Build.VERSION_CODES.LOLLIPOP){
path = getNoBackupFilesDir().getAbsolutePath();
} else{
path = getFilesDir().getAbsolutePath();
}
Boltmobiledemo.BoltDB boltDB = Boltmobiledemo.NewBoltDB(path)
```
#### iOS
```objc
- (void)demo {
NSString* path = [NSSearchPathForDirectoriesInDomains(NSLibraryDirectory,
NSUserDomainMask,
YES) objectAtIndex:0];
GoBoltmobiledemoBoltDB * demo = GoBoltmobiledemoNewBoltDB(path);
[self addSkipBackupAttributeToItemAtPath:demo.path];
//Some DB Logic would go here
[demo close];
}
- (BOOL)addSkipBackupAttributeToItemAtPath:(NSString *) filePathString
{
NSURL* URL= [NSURL fileURLWithPath: filePathString];
assert([[NSFileManager defaultManager] fileExistsAtPath: [URL path]]);
NSError *error = nil;
BOOL success = [URL setResourceValue: [NSNumber numberWithBool: YES]
forKey: NSURLIsExcludedFromBackupKey error: &error];
if(!success){
NSLog(@"Error excluding %@ from backup %@", [URL lastPathComponent], error);
}
return success;
}
```
## Resources
@@ -663,7 +500,7 @@ they are libraries bundled into the application, however, their underlying
structure is a log-structured merge-tree (LSM tree). An LSM tree optimizes
random writes by using a write ahead log and multi-tiered, sorted files called
SSTables. Bolt uses a B+tree internally and only a single file. Both approaches
have trade-offs.
have trade offs.
If you require a high random write throughput (>10,000 w/sec) or you need to use
spinning disks then LevelDB could be a good choice. If your application is
@@ -699,8 +536,9 @@ It's important to pick the right tool for the job and Bolt is no exception.
Here are a few things to note when evaluating and using Bolt:
* Bolt is good for read intensive workloads. Sequential write performance is
also fast but random writes can be slow. You can use `DB.Batch()` or add a
write-ahead log to help mitigate this issue.
also fast but random writes can be slow. You can add a write-ahead log or
[transaction coalescer](https://github.com/boltdb/coalescer) in front of Bolt
to mitigate this issue.
* Bolt uses a B+tree internally so there can be a lot of random page access.
SSDs provide a significant performance boost over spinning disks.
@@ -730,13 +568,11 @@ Here are a few things to note when evaluating and using Bolt:
can in memory and will release memory as needed to other processes. This means
that Bolt can show very high memory usage when working with large databases.
However, this is expected and the OS will release memory as needed. Bolt can
handle databases much larger than the available physical RAM, provided its
memory-map fits in the process virtual address space. It may be problematic
on 32-bits systems.
handle databases much larger than the available physical RAM.
* The data structures in the Bolt database are memory mapped so the data file
will be endian specific. This means that you cannot copy a Bolt file from a
little endian machine to a big endian machine and have it work. For most
little endian machine to a big endian machine and have it work. For most
users this is not a concern since most modern CPUs are little endian.
* Because of the way pages are laid out on disk, Bolt cannot truncate data files
@@ -751,56 +587,6 @@ Here are a few things to note when evaluating and using Bolt:
[page-allocation]: https://github.com/boltdb/bolt/issues/308#issuecomment-74811638
## Reading the Source
Bolt is a relatively small code base (<3KLOC) for an embedded, serializable,
transactional key/value database so it can be a good starting point for people
interested in how databases work.
The best places to start are the main entry points into Bolt:
- `Open()` - Initializes the reference to the database. It's responsible for
creating the database if it doesn't exist, obtaining an exclusive lock on the
file, reading the meta pages, & memory-mapping the file.
- `DB.Begin()` - Starts a read-only or read-write transaction depending on the
value of the `writable` argument. This requires briefly obtaining the "meta"
lock to keep track of open transactions. Only one read-write transaction can
exist at a time so the "rwlock" is acquired during the life of a read-write
transaction.
- `Bucket.Put()` - Writes a key/value pair into a bucket. After validating the
arguments, a cursor is used to traverse the B+tree to the page and position
where they key & value will be written. Once the position is found, the bucket
materializes the underlying page and the page's parent pages into memory as
"nodes". These nodes are where mutations occur during read-write transactions.
These changes get flushed to disk during commit.
- `Bucket.Get()` - Retrieves a key/value pair from a bucket. This uses a cursor
to move to the page & position of a key/value pair. During a read-only
transaction, the key and value data is returned as a direct reference to the
underlying mmap file so there's no allocation overhead. For read-write
transactions, this data may reference the mmap file or one of the in-memory
node values.
- `Cursor` - This object is simply for traversing the B+tree of on-disk pages
or in-memory nodes. It can seek to a specific key, move to the first or last
value, or it can move forward or backward. The cursor handles the movement up
and down the B+tree transparently to the end user.
- `Tx.Commit()` - Converts the in-memory dirty nodes and the list of free pages
into pages to be written to disk. Writing to disk then occurs in two phases.
First, the dirty pages are written to disk and an `fsync()` occurs. Second, a
new meta page with an incremented transaction ID is written and another
`fsync()` occurs. This two phase write ensures that partially written data
pages are ignored in the event of a crash since the meta page pointing to them
is never written. Partially written meta pages are invalidated because they
are written with a checksum.
If you have additional notes that could be helpful for others, please submit
them via pull request.
## Other Projects Using Bolt
Below is a list of public, open source projects that use Bolt:
@@ -811,30 +597,25 @@ Below is a list of public, open source projects that use Bolt:
* [Skybox Analytics](https://github.com/skybox/skybox) - A standalone funnel analysis tool for web analytics.
* [Scuttlebutt](https://github.com/benbjohnson/scuttlebutt) - Uses Bolt to store and process all Twitter mentions of GitHub projects.
* [Wiki](https://github.com/peterhellberg/wiki) - A tiny wiki using Goji, BoltDB and Blackfriday.
* [ChainStore](https://github.com/pressly/chainstore) - Simple key-value interface to a variety of storage engines organized as a chain of operations.
* [ChainStore](https://github.com/nulayer/chainstore) - Simple key-value interface to a variety of storage engines organized as a chain of operations.
* [MetricBase](https://github.com/msiebuhr/MetricBase) - Single-binary version of Graphite.
* [Gitchain](https://github.com/gitchain/gitchain) - Decentralized, peer-to-peer Git repositories aka "Git meets Bitcoin".
* [event-shuttle](https://github.com/sclasen/event-shuttle) - A Unix system service to collect and reliably deliver messages to Kafka.
* [ipxed](https://github.com/kelseyhightower/ipxed) - Web interface and api for ipxed.
* [BoltStore](https://github.com/yosssi/boltstore) - Session store using Bolt.
* [photosite/session](https://godoc.org/bitbucket.org/kardianos/photosite/session) - Sessions for a photo viewing site.
* [photosite/session](http://godoc.org/bitbucket.org/kardianos/photosite/session) - Sessions for a photo viewing site.
* [LedisDB](https://github.com/siddontang/ledisdb) - A high performance NoSQL, using Bolt as optional storage.
* [ipLocator](https://github.com/AndreasBriese/ipLocator) - A fast ip-geo-location-server using bolt with bloom filters.
* [cayley](https://github.com/google/cayley) - Cayley is an open-source graph database using Bolt as optional backend.
* [bleve](http://www.blevesearch.com/) - A pure Go search engine similar to ElasticSearch that uses Bolt as the default storage backend.
* [tentacool](https://github.com/optiflows/tentacool) - REST api server to manage system stuff (IP, DNS, Gateway...) on a linux server.
* [SkyDB](https://github.com/skydb/sky) - Behavioral analytics database.
* [Seaweed File System](https://github.com/chrislusf/seaweedfs) - Highly scalable distributed key~file system with O(1) disk read.
* [InfluxDB](https://influxdata.com) - Scalable datastore for metrics, events, and real-time analytics.
* [Seaweed File System](https://github.com/chrislusf/weed-fs) - Highly scalable distributed key~file system with O(1) disk read.
* [InfluxDB](http://influxdb.com) - Scalable datastore for metrics, events, and real-time analytics.
* [Freehold](http://tshannon.bitbucket.org/freehold/) - An open, secure, and lightweight platform for your files and data.
* [Prometheus Annotation Server](https://github.com/oliver006/prom_annotation_server) - Annotation server for PromDash & Prometheus service monitoring system.
* [Consul](https://github.com/hashicorp/consul) - Consul is service discovery and configuration made easy. Distributed, highly available, and datacenter-aware.
* [Kala](https://github.com/ajvb/kala) - Kala is a modern job scheduler optimized to run on a single node. It is persistent, JSON over HTTP API, ISO 8601 duration notation, and dependent jobs.
* [Kala](https://github.com/ajvb/kala) - Kala is a modern job scheduler optimized to run on a single node. It is persistant, JSON over HTTP API, ISO 8601 duration notation, and dependent jobs.
* [drive](https://github.com/odeke-em/drive) - drive is an unofficial Google Drive command line client for \*NIX operating systems.
* [stow](https://github.com/djherbis/stow) - a persistence manager for objects
backed by boltdb.
* [buckets](https://github.com/joyrexus/buckets) - a bolt wrapper streamlining
simple tx and key scans.
* [Request Baskets](https://github.com/darklynx/request-baskets) - A web service to collect arbitrary HTTP requests and inspect them via REST API or simple web UI, similar to [RequestBin](http://requestb.in/) service
If you are using Bolt in a project please send a pull request to add it to the list.

138
Godeps/_workspace/src/github.com/boltdb/bolt/batch.go generated vendored Normal file
View File

@@ -0,0 +1,138 @@
package bolt
import (
"errors"
"fmt"
"sync"
"time"
)
// Batch calls fn as part of a batch. It behaves similar to Update,
// except:
//
// 1. concurrent Batch calls can be combined into a single Bolt
// transaction.
//
// 2. the function passed to Batch may be called multiple times,
// regardless of whether it returns error or not.
//
// This means that Batch function side effects must be idempotent and
// take permanent effect only after a successful return is seen in
// caller.
//
// The maximum batch size and delay can be adjusted with DB.MaxBatchSize
// and DB.MaxBatchDelay, respectively.
//
// Batch is only useful when there are multiple goroutines calling it.
func (db *DB) Batch(fn func(*Tx) error) error {
errCh := make(chan error, 1)
db.batchMu.Lock()
if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
// There is no existing batch, or the existing batch is full; start a new one.
db.batch = &batch{
db: db,
}
db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
}
db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
if len(db.batch.calls) >= db.MaxBatchSize {
// wake up batch, it's ready to run
go db.batch.trigger()
}
db.batchMu.Unlock()
err := <-errCh
if err == trySolo {
err = db.Update(fn)
}
return err
}
type call struct {
fn func(*Tx) error
err chan<- error
}
type batch struct {
db *DB
timer *time.Timer
start sync.Once
calls []call
}
// trigger runs the batch if it hasn't already been run.
func (b *batch) trigger() {
b.start.Do(b.run)
}
// run performs the transactions in the batch and communicates results
// back to DB.Batch.
func (b *batch) run() {
b.db.batchMu.Lock()
b.timer.Stop()
// Make sure no new work is added to this batch, but don't break
// other batches.
if b.db.batch == b {
b.db.batch = nil
}
b.db.batchMu.Unlock()
retry:
for len(b.calls) > 0 {
var failIdx = -1
err := b.db.Update(func(tx *Tx) error {
for i, c := range b.calls {
if err := safelyCall(c.fn, tx); err != nil {
failIdx = i
return err
}
}
return nil
})
if failIdx >= 0 {
// take the failing transaction out of the batch. it's
// safe to shorten b.calls here because db.batch no longer
// points to us, and we hold the mutex anyway.
c := b.calls[failIdx]
b.calls[failIdx], b.calls = b.calls[len(b.calls)-1], b.calls[:len(b.calls)-1]
// tell the submitter re-run it solo, continue with the rest of the batch
c.err <- trySolo
continue retry
}
// pass success, or bolt internal errors, to all callers
for _, c := range b.calls {
if c.err != nil {
c.err <- err
}
}
break retry
}
}
// trySolo is a special sentinel error value used for signaling that a
// transaction function should be re-run. It should never be seen by
// callers.
var trySolo = errors.New("batch function returned an error and should be re-run solo")
type panicked struct {
reason interface{}
}
func (p panicked) Error() string {
if err, ok := p.reason.(error); ok {
return err.Error()
}
return fmt.Sprintf("panic: %v", p.reason)
}
func safelyCall(fn func(*Tx) error, tx *Tx) (err error) {
defer func() {
if p := recover(); p != nil {
err = panicked{p}
}
}()
return fn(tx)
}

View File

@@ -0,0 +1,170 @@
package bolt_test
import (
"bytes"
"encoding/binary"
"errors"
"hash/fnv"
"sync"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
func validateBatchBench(b *testing.B, db *TestDB) {
var rollback = errors.New("sentinel error to cause rollback")
validate := func(tx *bolt.Tx) error {
bucket := tx.Bucket([]byte("bench"))
h := fnv.New32a()
buf := make([]byte, 4)
for id := uint32(0); id < 1000; id++ {
binary.LittleEndian.PutUint32(buf, id)
h.Reset()
h.Write(buf[:])
k := h.Sum(nil)
v := bucket.Get(k)
if v == nil {
b.Errorf("not found id=%d key=%x", id, k)
continue
}
if g, e := v, []byte("filler"); !bytes.Equal(g, e) {
b.Errorf("bad value for id=%d key=%x: %s != %q", id, k, g, e)
}
if err := bucket.Delete(k); err != nil {
return err
}
}
// should be empty now
c := bucket.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
b.Errorf("unexpected key: %x = %q", k, v)
}
return rollback
}
if err := db.Update(validate); err != nil && err != rollback {
b.Error(err)
}
}
func BenchmarkDBBatchAutomatic(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for round := 0; round < 1000; round++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
h := fnv.New32a()
buf := make([]byte, 4)
binary.LittleEndian.PutUint32(buf, id)
h.Write(buf[:])
k := h.Sum(nil)
insert := func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("bench"))
return b.Put(k, []byte("filler"))
}
if err := db.Batch(insert); err != nil {
b.Error(err)
return
}
}(uint32(round))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}
func BenchmarkDBBatchSingle(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for round := 0; round < 1000; round++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
h := fnv.New32a()
buf := make([]byte, 4)
binary.LittleEndian.PutUint32(buf, id)
h.Write(buf[:])
k := h.Sum(nil)
insert := func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("bench"))
return b.Put(k, []byte("filler"))
}
if err := db.Update(insert); err != nil {
b.Error(err)
return
}
}(uint32(round))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}
func BenchmarkDBBatchManual10x100(b *testing.B) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("bench"))
b.ResetTimer()
for i := 0; i < b.N; i++ {
start := make(chan struct{})
var wg sync.WaitGroup
for major := 0; major < 10; major++ {
wg.Add(1)
go func(id uint32) {
defer wg.Done()
<-start
insert100 := func(tx *bolt.Tx) error {
h := fnv.New32a()
buf := make([]byte, 4)
for minor := uint32(0); minor < 100; minor++ {
binary.LittleEndian.PutUint32(buf, uint32(id*100+minor))
h.Reset()
h.Write(buf[:])
k := h.Sum(nil)
b := tx.Bucket([]byte("bench"))
if err := b.Put(k, []byte("filler")); err != nil {
return err
}
}
return nil
}
if err := db.Update(insert100); err != nil {
b.Fatal(err)
}
}(uint32(major))
}
close(start)
wg.Wait()
}
b.StopTimer()
validateBatchBench(b, db)
}

View File

@@ -0,0 +1,148 @@
package bolt_test
import (
"encoding/binary"
"fmt"
"io/ioutil"
"log"
"math/rand"
"net/http"
"net/http/httptest"
"os"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Set this to see how the counts are actually updated.
const verbose = false
// Counter updates a counter in Bolt for every URL path requested.
type counter struct {
db *bolt.DB
}
func (c counter) ServeHTTP(rw http.ResponseWriter, req *http.Request) {
// Communicates the new count from a successful database
// transaction.
var result uint64
increment := func(tx *bolt.Tx) error {
b, err := tx.CreateBucketIfNotExists([]byte("hits"))
if err != nil {
return err
}
key := []byte(req.URL.String())
// Decode handles key not found for us.
count := decode(b.Get(key)) + 1
b.Put(key, encode(count))
// All good, communicate new count.
result = count
return nil
}
if err := c.db.Batch(increment); err != nil {
http.Error(rw, err.Error(), 500)
return
}
if verbose {
log.Printf("server: %s: %d", req.URL.String(), result)
}
rw.Header().Set("Content-Type", "application/octet-stream")
fmt.Fprintf(rw, "%d\n", result)
}
func client(id int, base string, paths []string) error {
// Process paths in random order.
rng := rand.New(rand.NewSource(int64(id)))
permutation := rng.Perm(len(paths))
for i := range paths {
path := paths[permutation[i]]
resp, err := http.Get(base + path)
if err != nil {
return err
}
defer resp.Body.Close()
buf, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
if verbose {
log.Printf("client: %s: %s", path, buf)
}
}
return nil
}
func ExampleDB_Batch() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start our web server
count := counter{db}
srv := httptest.NewServer(count)
defer srv.Close()
// Decrease the batch size to make things more interesting.
db.MaxBatchSize = 3
// Get every path multiple times concurrently.
const clients = 10
paths := []string{
"/foo",
"/bar",
"/baz",
"/quux",
"/thud",
"/xyzzy",
}
errors := make(chan error, clients)
for i := 0; i < clients; i++ {
go func(id int) {
errors <- client(id, srv.URL, paths)
}(i)
}
// Check all responses to make sure there's no error.
for i := 0; i < clients; i++ {
if err := <-errors; err != nil {
fmt.Printf("client error: %v", err)
return
}
}
// Check the final result
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("hits"))
c := b.Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("hits to %s: %d\n", k, decode(v))
}
return nil
})
// Output:
// hits to /bar: 10
// hits to /baz: 10
// hits to /foo: 10
// hits to /quux: 10
// hits to /thud: 10
// hits to /xyzzy: 10
}
// encode marshals a counter.
func encode(n uint64) []byte {
buf := make([]byte, 8)
binary.BigEndian.PutUint64(buf, n)
return buf
}
// decode unmarshals a counter. Nil buffers are decoded as 0.
func decode(buf []byte) uint64 {
if buf == nil {
return 0
}
return binary.BigEndian.Uint64(buf)
}

View File

@@ -0,0 +1,167 @@
package bolt_test
import (
"testing"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure two functions can perform updates in a single batch.
func TestDB_Batch(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
// Iterate over multiple updates in separate goroutines.
n := 2
ch := make(chan error)
for i := 0; i < n; i++ {
go func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}(i)
}
// Check all responses to make sure there's no error.
for i := 0; i < n; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 0; i < n; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}
func TestDB_Batch_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
var sentinel int
var bork = &sentinel
var problem interface{}
var err error
// Execute a function inside a batch that panics.
func() {
defer func() {
if p := recover(); p != nil {
problem = p
}
}()
err = db.Batch(func(tx *bolt.Tx) error {
panic(bork)
})
}()
// Verify there is no error.
if g, e := err, error(nil); g != e {
t.Fatalf("wrong error: %v != %v", g, e)
}
// Verify the panic was captured.
if g, e := problem, bork; g != e {
t.Fatalf("wrong error: %v != %v", g, e)
}
}
func TestDB_BatchFull(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
const size = 3
// buffered so we never leak goroutines
ch := make(chan error, size)
put := func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}
db.MaxBatchSize = size
// high enough to never trigger here
db.MaxBatchDelay = 1 * time.Hour
go put(1)
go put(2)
// Give the batch a chance to exhibit bugs.
time.Sleep(10 * time.Millisecond)
// not triggered yet
select {
case <-ch:
t.Fatalf("batch triggered too early")
default:
}
go put(3)
// Check all responses to make sure there's no error.
for i := 0; i < size; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 1; i <= size; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}
func TestDB_BatchTime(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.MustCreateBucket([]byte("widgets"))
const size = 1
// buffered so we never leak goroutines
ch := make(chan error, size)
put := func(i int) {
ch <- db.Batch(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put(u64tob(uint64(i)), []byte{})
})
}
db.MaxBatchSize = 1000
db.MaxBatchDelay = 0
go put(1)
// Batch must trigger by time alone.
// Check all responses to make sure there's no error.
for i := 0; i < size; i++ {
if err := <-ch; err != nil {
t.Fatal(err)
}
}
// Ensure data is correct.
db.MustView(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
for i := 1; i <= size; i++ {
if v := b.Get(u64tob(uint64(i))); v == nil {
t.Errorf("key not found: %d", i)
}
}
return nil
})
}

View File

@@ -1,9 +0,0 @@
// +build arm64
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@@ -4,6 +4,8 @@ import (
"syscall"
)
var odirect = syscall.O_DIRECT
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
return syscall.Fdatasync(int(db.file.Fd()))

View File

@@ -11,6 +11,8 @@ const (
msInvalidate // invalidate cached data
)
var odirect int
func msync(db *DB) error {
_, _, errno := syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(db.data)), uintptr(db.datasz), msInvalidate)
if errno != 0 {

View File

@@ -1,9 +0,0 @@
// +build ppc64le
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@@ -1,9 +0,0 @@
// +build s390x
package bolt
// maxMapSize represents the largest mmap size supported by Bolt.
const maxMapSize = 0xFFFFFFFFFFFF // 256TB
// maxAllocSize is the size used when creating array pointers.
const maxAllocSize = 0x7FFFFFFF

View File

@@ -0,0 +1,36 @@
package bolt_test
import (
"fmt"
"path/filepath"
"reflect"
"runtime"
"testing"
)
// assert fails the test if the condition is false.
func assert(tb testing.TB, condition bool, msg string, v ...interface{}) {
if !condition {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d: "+msg+"\033[39m\n\n", append([]interface{}{filepath.Base(file), line}, v...)...)
tb.FailNow()
}
}
// ok fails the test if an err is not nil.
func ok(tb testing.TB, err error) {
if err != nil {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d: unexpected error: %s\033[39m\n\n", filepath.Base(file), line, err.Error())
tb.FailNow()
}
}
// equals fails the test if exp is not equal to act.
func equals(tb testing.TB, exp, act interface{}) {
if !reflect.DeepEqual(exp, act) {
_, file, line, _ := runtime.Caller(1)
fmt.Printf("\033[31m%s:%d:\n\n\texp: %#v\n\n\tgot: %#v\033[39m\n\n", filepath.Base(file), line, exp, act)
tb.FailNow()
}
}

View File

@@ -46,8 +46,19 @@ func funlock(f *os.File) error {
// mmap memory maps a DB's data file.
func mmap(db *DB, sz int) error {
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
// Map the data file to memory.
b, err := syscall.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED|db.MmapFlags)
b, err := syscall.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return err
}

View File

@@ -2,12 +2,11 @@ package bolt
import (
"fmt"
"github.com/coreos/etcd/Godeps/_workspace/src/golang.org/x/sys/unix"
"os"
"syscall"
"time"
"unsafe"
"github.com/coreos/etcd/Godeps/_workspace/src/golang.org/x/sys/unix"
)
// flock acquires an advisory lock on a file descriptor.
@@ -56,8 +55,19 @@ func funlock(f *os.File) error {
// mmap memory maps a DB's data file.
func mmap(db *DB, sz int) error {
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
// Map the data file to memory.
b, err := unix.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED|db.MmapFlags)
b, err := unix.Mmap(int(db.file.Fd()), 0, sz, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return err
}

View File

@@ -8,37 +8,7 @@ import (
"unsafe"
)
// LockFileEx code derived from golang build filemutex_windows.go @ v1.5.1
var (
modkernel32 = syscall.NewLazyDLL("kernel32.dll")
procLockFileEx = modkernel32.NewProc("LockFileEx")
procUnlockFileEx = modkernel32.NewProc("UnlockFileEx")
)
const (
// see https://msdn.microsoft.com/en-us/library/windows/desktop/aa365203(v=vs.85).aspx
flagLockExclusive = 2
flagLockFailImmediately = 1
// see https://msdn.microsoft.com/en-us/library/windows/desktop/ms681382(v=vs.85).aspx
errLockViolation syscall.Errno = 0x21
)
func lockFileEx(h syscall.Handle, flags, reserved, locklow, lockhigh uint32, ol *syscall.Overlapped) (err error) {
r, _, err := procLockFileEx.Call(uintptr(h), uintptr(flags), uintptr(reserved), uintptr(locklow), uintptr(lockhigh), uintptr(unsafe.Pointer(ol)))
if r == 0 {
return err
}
return nil
}
func unlockFileEx(h syscall.Handle, reserved, locklow, lockhigh uint32, ol *syscall.Overlapped) (err error) {
r, _, err := procUnlockFileEx.Call(uintptr(h), uintptr(reserved), uintptr(locklow), uintptr(lockhigh), uintptr(unsafe.Pointer(ol)), 0)
if r == 0 {
return err
}
return nil
}
var odirect int
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
@@ -46,37 +16,13 @@ func fdatasync(db *DB) error {
}
// flock acquires an advisory lock on a file descriptor.
func flock(f *os.File, exclusive bool, timeout time.Duration) error {
var t time.Time
for {
// If we're beyond our timeout then return an error.
// This can only occur after we've attempted a flock once.
if t.IsZero() {
t = time.Now()
} else if timeout > 0 && time.Since(t) > timeout {
return ErrTimeout
}
var flag uint32 = flagLockFailImmediately
if exclusive {
flag |= flagLockExclusive
}
err := lockFileEx(syscall.Handle(f.Fd()), flag, 0, 1, 0, &syscall.Overlapped{})
if err == nil {
return nil
} else if err != errLockViolation {
return err
}
// Wait for a bit and try again.
time.Sleep(50 * time.Millisecond)
}
func flock(f *os.File, _ bool, _ time.Duration) error {
return nil
}
// funlock releases an advisory lock on a file descriptor.
func funlock(f *os.File) error {
return unlockFileEx(syscall.Handle(f.Fd()), 0, 1, 0, &syscall.Overlapped{})
return nil
}
// mmap memory maps a DB's data file.

View File

@@ -2,6 +2,8 @@
package bolt
var odirect int
// fdatasync flushes written data to a file descriptor.
func fdatasync(db *DB) error {
return db.file.Sync()

View File

@@ -11,7 +11,7 @@ const (
MaxKeySize = 32768
// MaxValueSize is the maximum length of a value, in bytes.
MaxValueSize = (1 << 31) - 2
MaxValueSize = 4294967295
)
const (
@@ -99,7 +99,6 @@ func (b *Bucket) Cursor() *Cursor {
// Bucket retrieves a nested bucket by name.
// Returns nil if the bucket does not exist.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) Bucket(name []byte) *Bucket {
if b.buckets != nil {
if child := b.buckets[string(name)]; child != nil {
@@ -149,7 +148,6 @@ func (b *Bucket) openBucket(value []byte) *Bucket {
// CreateBucket creates a new bucket at the given key and returns the new bucket.
// Returns an error if the key already exists, if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
if b.tx.db == nil {
return nil, ErrTxClosed
@@ -194,7 +192,6 @@ func (b *Bucket) CreateBucket(key []byte) (*Bucket, error) {
// CreateBucketIfNotExists creates a new bucket if it doesn't already exist and returns a reference to it.
// Returns an error if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (b *Bucket) CreateBucketIfNotExists(key []byte) (*Bucket, error) {
child, err := b.CreateBucket(key)
if err == ErrBucketExists {
@@ -273,7 +270,6 @@ func (b *Bucket) Get(key []byte) []byte {
// Put sets the value for a key in the bucket.
// If the key exist then its previous value will be overwritten.
// Supplied value must remain valid for the life of the transaction.
// Returns an error if the bucket was created from a read-only transaction, if the key is blank, if the key is too large, or if the value is too large.
func (b *Bucket) Put(key []byte, value []byte) error {
if b.tx.db == nil {
@@ -350,8 +346,7 @@ func (b *Bucket) NextSequence() (uint64, error) {
// ForEach executes a function for each key/value pair in a bucket.
// If the provided function returns an error then the iteration is stopped and
// the error is returned to the caller. The provided function must not modify
// the bucket; this will result in undefined behavior.
// the error is returned to the caller.
func (b *Bucket) ForEach(fn func(k, v []byte) error) error {
if b.tx.db == nil {
return ErrTxClosed

File diff suppressed because it is too large Load Diff

View File

@@ -825,10 +825,7 @@ func (cmd *StatsCommand) Run(args ...string) error {
fmt.Fprintln(cmd.Stdout, "Bucket statistics")
fmt.Fprintf(cmd.Stdout, "\tTotal number of buckets: %d\n", s.BucketN)
percentage = 0
if s.BucketN != 0 {
percentage = int(float32(s.InlineBucketN) * 100.0 / float32(s.BucketN))
}
percentage = int(float32(s.InlineBucketN) * 100.0 / float32(s.BucketN))
fmt.Fprintf(cmd.Stdout, "\tTotal number on inlined buckets: %d (%d%%)\n", s.InlineBucketN, percentage)
percentage = 0
if s.LeafInuse != 0 {

View File

@@ -0,0 +1,145 @@
package main_test
import (
"bytes"
"io/ioutil"
"os"
"strconv"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt/cmd/bolt"
)
// Ensure the "info" command can print information about a database.
func TestInfoCommand_Run(t *testing.T) {
db := MustOpen(0666, nil)
db.DB.Close()
defer db.Close()
// Run the info command.
m := NewMain()
if err := m.Run("info", db.Path); err != nil {
t.Fatal(err)
}
}
// Ensure the "stats" command can execute correctly.
func TestStatsCommand_Run(t *testing.T) {
// Ignore
if os.Getpagesize() != 4096 {
t.Skip("system does not use 4KB page size")
}
db := MustOpen(0666, nil)
defer db.Close()
if err := db.Update(func(tx *bolt.Tx) error {
// Create "foo" bucket.
b, err := tx.CreateBucket([]byte("foo"))
if err != nil {
return err
}
for i := 0; i < 10; i++ {
if err := b.Put([]byte(strconv.Itoa(i)), []byte(strconv.Itoa(i))); err != nil {
return err
}
}
// Create "bar" bucket.
b, err = tx.CreateBucket([]byte("bar"))
if err != nil {
return err
}
for i := 0; i < 100; i++ {
if err := b.Put([]byte(strconv.Itoa(i)), []byte(strconv.Itoa(i))); err != nil {
return err
}
}
// Create "baz" bucket.
b, err = tx.CreateBucket([]byte("baz"))
if err != nil {
return err
}
if err := b.Put([]byte("key"), []byte("value")); err != nil {
return err
}
return nil
}); err != nil {
t.Fatal(err)
}
db.DB.Close()
// Generate expected result.
exp := "Aggregate statistics for 3 buckets\n\n" +
"Page count statistics\n" +
"\tNumber of logical branch pages: 0\n" +
"\tNumber of physical branch overflow pages: 0\n" +
"\tNumber of logical leaf pages: 1\n" +
"\tNumber of physical leaf overflow pages: 0\n" +
"Tree statistics\n" +
"\tNumber of keys/value pairs: 111\n" +
"\tNumber of levels in B+tree: 1\n" +
"Page size utilization\n" +
"\tBytes allocated for physical branch pages: 0\n" +
"\tBytes actually used for branch data: 0 (0%)\n" +
"\tBytes allocated for physical leaf pages: 4096\n" +
"\tBytes actually used for leaf data: 1996 (48%)\n" +
"Bucket statistics\n" +
"\tTotal number of buckets: 3\n" +
"\tTotal number on inlined buckets: 2 (66%)\n" +
"\tBytes used for inlined buckets: 236 (11%)\n"
// Run the command.
m := NewMain()
if err := m.Run("stats", db.Path); err != nil {
t.Fatal(err)
} else if m.Stdout.String() != exp {
t.Fatalf("unexpected stdout:\n\n%s", m.Stdout.String())
}
}
// Main represents a test wrapper for main.Main that records output.
type Main struct {
*main.Main
Stdin bytes.Buffer
Stdout bytes.Buffer
Stderr bytes.Buffer
}
// NewMain returns a new instance of Main.
func NewMain() *Main {
m := &Main{Main: main.NewMain()}
m.Main.Stdin = &m.Stdin
m.Main.Stdout = &m.Stdout
m.Main.Stderr = &m.Stderr
return m
}
// MustOpen creates a Bolt database in a temporary location.
func MustOpen(mode os.FileMode, options *bolt.Options) *DB {
// Create temporary path.
f, _ := ioutil.TempFile("", "bolt-")
f.Close()
os.Remove(f.Name())
db, err := bolt.Open(f.Name(), mode, options)
if err != nil {
panic(err.Error())
}
return &DB{DB: db, Path: f.Name()}
}
// DB is a test wrapper for bolt.DB.
type DB struct {
*bolt.DB
Path string
}
// Close closes and removes the database.
func (db *DB) Close() error {
defer os.Remove(db.Path)
return db.DB.Close()
}

View File

@@ -34,13 +34,6 @@ func (c *Cursor) First() (key []byte, value []byte) {
p, n := c.bucket.pageNode(c.bucket.root)
c.stack = append(c.stack, elemRef{page: p, node: n, index: 0})
c.first()
// If we land on an empty page then move to the next value.
// https://github.com/boltdb/bolt/issues/450
if c.stack[len(c.stack)-1].count() == 0 {
c.next()
}
k, v, flags := c.keyValue()
if (flags & uint32(bucketLeafFlag)) != 0 {
return k, nil
@@ -216,37 +209,28 @@ func (c *Cursor) last() {
// next moves to the next leaf element and returns the key and value.
// If the cursor is at the last leaf element then it stays there and returns nil.
func (c *Cursor) next() (key []byte, value []byte, flags uint32) {
for {
// Attempt to move over one element until we're successful.
// Move up the stack as we hit the end of each page in our stack.
var i int
for i = len(c.stack) - 1; i >= 0; i-- {
elem := &c.stack[i]
if elem.index < elem.count()-1 {
elem.index++
break
}
// Attempt to move over one element until we're successful.
// Move up the stack as we hit the end of each page in our stack.
var i int
for i = len(c.stack) - 1; i >= 0; i-- {
elem := &c.stack[i]
if elem.index < elem.count()-1 {
elem.index++
break
}
// If we've hit the root page then stop and return. This will leave the
// cursor on the last element of the last page.
if i == -1 {
return nil, nil, 0
}
// Otherwise start from where we left off in the stack and find the
// first element of the first leaf page.
c.stack = c.stack[:i+1]
c.first()
// If this is an empty page then restart and move back up the stack.
// https://github.com/boltdb/bolt/issues/450
if c.stack[len(c.stack)-1].count() == 0 {
continue
}
return c.keyValue()
}
// If we've hit the root page then stop and return. This will leave the
// cursor on the last element of the last page.
if i == -1 {
return nil, nil, 0
}
// Otherwise start from where we left off in the stack and find the
// first element of the first leaf page.
c.stack = c.stack[:i+1]
c.first()
return c.keyValue()
}
// search recursively performs a binary search against a given page/node until it finds a given key.

View File

@@ -0,0 +1,511 @@
package bolt_test
import (
"bytes"
"encoding/binary"
"fmt"
"os"
"sort"
"testing"
"testing/quick"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure that a cursor can return a reference to the bucket that created it.
func TestCursor_Bucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
c := b.Cursor()
equals(t, b, c.Bucket())
return nil
})
}
// Ensure that a Tx cursor can seek to the appropriate keys.
func TestCursor_Seek(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
ok(t, b.Put([]byte("foo"), []byte("0001")))
ok(t, b.Put([]byte("bar"), []byte("0002")))
ok(t, b.Put([]byte("baz"), []byte("0003")))
_, err = b.CreateBucket([]byte("bkt"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
// Exact match should go to the key.
k, v := c.Seek([]byte("bar"))
equals(t, []byte("bar"), k)
equals(t, []byte("0002"), v)
// Inexact match should go to the next key.
k, v = c.Seek([]byte("bas"))
equals(t, []byte("baz"), k)
equals(t, []byte("0003"), v)
// Low key should go to the first key.
k, v = c.Seek([]byte(""))
equals(t, []byte("bar"), k)
equals(t, []byte("0002"), v)
// High key should return no key.
k, v = c.Seek([]byte("zzz"))
assert(t, k == nil, "")
assert(t, v == nil, "")
// Buckets should return their key but no value.
k, v = c.Seek([]byte("bkt"))
equals(t, []byte("bkt"), k)
assert(t, v == nil, "")
return nil
})
}
func TestCursor_Delete(t *testing.T) {
db := NewTestDB()
defer db.Close()
var count = 1000
// Insert every other key between 0 and $count.
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
for i := 0; i < count; i += 1 {
k := make([]byte, 8)
binary.BigEndian.PutUint64(k, uint64(i))
b.Put(k, make([]byte, 100))
}
b.CreateBucket([]byte("sub"))
return nil
})
db.Update(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
bound := make([]byte, 8)
binary.BigEndian.PutUint64(bound, uint64(count/2))
for key, _ := c.First(); bytes.Compare(key, bound) < 0; key, _ = c.Next() {
if err := c.Delete(); err != nil {
return err
}
}
c.Seek([]byte("sub"))
err := c.Delete()
equals(t, err, bolt.ErrIncompatibleValue)
return nil
})
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
equals(t, b.Stats().KeyN, count/2+1)
return nil
})
}
// Ensure that a Tx cursor can seek to the appropriate keys when there are a
// large number of keys. This test also checks that seek will always move
// forward to the next key.
//
// Related: https://github.com/boltdb/bolt/pull/187
func TestCursor_Seek_Large(t *testing.T) {
db := NewTestDB()
defer db.Close()
var count = 10000
// Insert every other key between 0 and $count.
db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucket([]byte("widgets"))
for i := 0; i < count; i += 100 {
for j := i; j < i+100; j += 2 {
k := make([]byte, 8)
binary.BigEndian.PutUint64(k, uint64(j))
b.Put(k, make([]byte, 100))
}
}
return nil
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
for i := 0; i < count; i++ {
seek := make([]byte, 8)
binary.BigEndian.PutUint64(seek, uint64(i))
k, _ := c.Seek(seek)
// The last seek is beyond the end of the the range so
// it should return nil.
if i == count-1 {
assert(t, k == nil, "")
continue
}
// Otherwise we should seek to the exact key or the next key.
num := binary.BigEndian.Uint64(k)
if i%2 == 0 {
equals(t, uint64(i), num)
} else {
equals(t, uint64(i+1), num)
}
}
return nil
})
}
// Ensure that a cursor can iterate over an empty bucket without error.
func TestCursor_EmptyBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.First()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that a Tx cursor can reverse iterate over an empty bucket without error.
func TestCursor_EmptyBucketReverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
db.View(func(tx *bolt.Tx) error {
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.Last()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that a Tx cursor can iterate over a single root with a couple elements.
func TestCursor_Iterate_Leaf(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{0})
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{1})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.First()
equals(t, string(k), "bar")
equals(t, v, []byte{1})
k, v = c.Next()
equals(t, string(k), "baz")
equals(t, v, []byte{})
k, v = c.Next()
equals(t, string(k), "foo")
equals(t, v, []byte{0})
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
tx.Rollback()
}
// Ensure that a Tx cursor can iterate in reverse over a single root with a couple elements.
func TestCursor_LeafRootReverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{0})
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{1})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, v := c.Last()
equals(t, string(k), "foo")
equals(t, v, []byte{0})
k, v = c.Prev()
equals(t, string(k), "baz")
equals(t, v, []byte{})
k, v = c.Prev()
equals(t, string(k), "bar")
equals(t, v, []byte{1})
k, v = c.Prev()
assert(t, k == nil, "")
assert(t, v == nil, "")
k, v = c.Prev()
assert(t, k == nil, "")
assert(t, v == nil, "")
tx.Rollback()
}
// Ensure that a Tx cursor can restart from the beginning.
func TestCursor_Restart(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("bar"), []byte{})
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte{})
return nil
})
tx, _ := db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
k, _ := c.First()
equals(t, string(k), "bar")
k, _ = c.Next()
equals(t, string(k), "foo")
k, _ = c.First()
equals(t, string(k), "bar")
k, _ = c.Next()
equals(t, string(k), "foo")
tx.Rollback()
}
// Ensure that a Tx can iterate over all elements in a bucket.
func TestCursor_QuickCheck(t *testing.T) {
f := func(items testdata) bool {
db := NewTestDB()
defer db.Close()
// Bulk insert all values.
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
for _, item := range items {
ok(t, b.Put(item.Key, item.Value))
}
ok(t, tx.Commit())
// Sort test data.
sort.Sort(items)
// Iterate over all items and check consistency.
var index = 0
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil && index < len(items); k, v = c.Next() {
equals(t, k, items[index].Key)
equals(t, v, items[index].Value)
index++
}
equals(t, len(items), index)
tx.Rollback()
return true
}
if err := quick.Check(f, qconfig()); err != nil {
t.Error(err)
}
}
// Ensure that a transaction can iterate over all elements in a bucket in reverse.
func TestCursor_QuickCheck_Reverse(t *testing.T) {
f := func(items testdata) bool {
db := NewTestDB()
defer db.Close()
// Bulk insert all values.
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
for _, item := range items {
ok(t, b.Put(item.Key, item.Value))
}
ok(t, tx.Commit())
// Sort test data.
sort.Sort(revtestdata(items))
// Iterate over all items and check consistency.
var index = 0
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.Last(); k != nil && index < len(items); k, v = c.Prev() {
equals(t, k, items[index].Key)
equals(t, v, items[index].Value)
index++
}
equals(t, len(items), index)
tx.Rollback()
return true
}
if err := quick.Check(f, qconfig()); err != nil {
t.Error(err)
}
}
// Ensure that a Tx cursor can iterate over subbuckets.
func TestCursor_QuickCheck_BucketsOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
_, err = b.CreateBucket([]byte("foo"))
ok(t, err)
_, err = b.CreateBucket([]byte("bar"))
ok(t, err)
_, err = b.CreateBucket([]byte("baz"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
var names []string
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
names = append(names, string(k))
assert(t, v == nil, "")
}
equals(t, names, []string{"bar", "baz", "foo"})
return nil
})
}
// Ensure that a Tx cursor can reverse iterate over subbuckets.
func TestCursor_QuickCheck_BucketsOnly_Reverse(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
ok(t, err)
_, err = b.CreateBucket([]byte("foo"))
ok(t, err)
_, err = b.CreateBucket([]byte("bar"))
ok(t, err)
_, err = b.CreateBucket([]byte("baz"))
ok(t, err)
return nil
})
db.View(func(tx *bolt.Tx) error {
var names []string
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.Last(); k != nil; k, v = c.Prev() {
names = append(names, string(k))
assert(t, v == nil, "")
}
equals(t, names, []string{"foo", "baz", "bar"})
return nil
})
}
func ExampleCursor() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start a read-write transaction.
db.Update(func(tx *bolt.Tx) error {
// Create a new bucket.
tx.CreateBucket([]byte("animals"))
// Insert data into a bucket.
b := tx.Bucket([]byte("animals"))
b.Put([]byte("dog"), []byte("fun"))
b.Put([]byte("cat"), []byte("lame"))
b.Put([]byte("liger"), []byte("awesome"))
// Create a cursor for iteration.
c := b.Cursor()
// Iterate over items in sorted key order. This starts from the
// first key/value pair and updates the k/v variables to the
// next key/value on each iteration.
//
// The loop finishes at the end of the cursor when a nil key is returned.
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("A %s is %s.\n", k, v)
}
return nil
})
// Output:
// A cat is lame.
// A dog is fun.
// A liger is awesome.
}
func ExampleCursor_reverse() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Start a read-write transaction.
db.Update(func(tx *bolt.Tx) error {
// Create a new bucket.
tx.CreateBucket([]byte("animals"))
// Insert data into a bucket.
b := tx.Bucket([]byte("animals"))
b.Put([]byte("dog"), []byte("fun"))
b.Put([]byte("cat"), []byte("lame"))
b.Put([]byte("liger"), []byte("awesome"))
// Create a cursor for iteration.
c := b.Cursor()
// Iterate over items in reverse sorted key order. This starts
// from the last key/value pair and updates the k/v variables to
// the previous key/value on each iteration.
//
// The loop finishes at the beginning of the cursor when a nil key
// is returned.
for k, v := c.Last(); k != nil; k, v = c.Prev() {
fmt.Printf("A %s is %s.\n", k, v)
}
return nil
})
// Output:
// A liger is awesome.
// A dog is fun.
// A cat is lame.
}

View File

@@ -1,10 +1,8 @@
package bolt
import (
"errors"
"fmt"
"hash/fnv"
"log"
"os"
"runtime"
"runtime/debug"
@@ -26,14 +24,13 @@ const magic uint32 = 0xED0CDAED
// IgnoreNoSync specifies whether the NoSync field of a DB is ignored when
// syncing changes to a file. This is required as some operating systems,
// such as OpenBSD, do not have a unified buffer cache (UBC) and writes
// must be synchronized using the msync(2) syscall.
// must be synchronzied using the msync(2) syscall.
const IgnoreNoSync = runtime.GOOS == "openbsd"
// Default values if not set in a DB instance.
const (
DefaultMaxBatchSize int = 1000
DefaultMaxBatchDelay = 10 * time.Millisecond
DefaultAllocSize = 16 * 1024 * 1024
)
// DB represents a collection of buckets persisted to a file on disk.
@@ -66,10 +63,6 @@ type DB struct {
// https://github.com/boltdb/bolt/issues/284
NoGrowSync bool
// If you want to read the entire database fast, you can set MmapFlag to
// syscall.MAP_POPULATE on Linux 2.6.23+ for sequential read-ahead.
MmapFlags int
// MaxBatchSize is the maximum size of a batch. Default value is
// copied from DefaultMaxBatchSize in Open.
//
@@ -86,17 +79,11 @@ type DB struct {
// Do not change concurrently with calls to Batch.
MaxBatchDelay time.Duration
// AllocSize is the amount of space allocated when the database
// needs to create new pages. This is done to amortize the cost
// of truncate() and fsync() when growing the data file.
AllocSize int
path string
file *os.File
dataref []byte // mmap'ed readonly, write throws SEGV
data *[maxMapSize]byte
datasz int
filesz int // current on disk file size
meta0 *meta
meta1 *meta
pageSize int
@@ -149,12 +136,10 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
options = DefaultOptions
}
db.NoGrowSync = options.NoGrowSync
db.MmapFlags = options.MmapFlags
// Set default values for later DB operations.
db.MaxBatchSize = DefaultMaxBatchSize
db.MaxBatchDelay = DefaultMaxBatchDelay
db.AllocSize = DefaultAllocSize
flag := os.O_RDWR
if options.ReadOnly {
@@ -187,7 +172,7 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
// Initialize the database if it doesn't exist.
if info, err := db.file.Stat(); err != nil {
return nil, err
return nil, fmt.Errorf("stat error: %s", err)
} else if info.Size() == 0 {
// Initialize new files with meta pages.
if err := db.init(); err != nil {
@@ -199,14 +184,14 @@ func Open(path string, mode os.FileMode, options *Options) (*DB, error) {
if _, err := db.file.ReadAt(buf[:], 0); err == nil {
m := db.pageInBuffer(buf[:], 0).meta()
if err := m.validate(); err != nil {
return nil, err
return nil, fmt.Errorf("meta0 error: %s", err)
}
db.pageSize = int(m.pageSize)
}
}
// Memory map the data file.
if err := db.mmap(options.InitialMmapSize); err != nil {
if err := db.mmap(0); err != nil {
_ = db.close()
return nil, err
}
@@ -263,10 +248,10 @@ func (db *DB) mmap(minsz int) error {
// Validate the meta pages.
if err := db.meta0.validate(); err != nil {
return err
return fmt.Errorf("meta0 error: %s", err)
}
if err := db.meta1.validate(); err != nil {
return err
return fmt.Errorf("meta1 error: %s", err)
}
return nil
@@ -281,7 +266,7 @@ func (db *DB) munmap() error {
}
// mmapSize determines the appropriate size for the mmap given the current size
// of the database. The minimum size is 32KB and doubles until it reaches 1GB.
// of the database. The minimum size is 1MB and doubles until it reaches 1GB.
// Returns an error if the new mmap size is greater than the max allowed.
func (db *DB) mmapSize(size int) (int, error) {
// Double the size from 32KB until 1GB.
@@ -397,9 +382,7 @@ func (db *DB) close() error {
// No need to unlock read-only file.
if !db.readOnly {
// Unlock the file.
if err := funlock(db.file); err != nil {
log.Printf("bolt.Close(): funlock error: %s", err)
}
_ = funlock(db.file)
}
// Close the file descriptor.
@@ -418,15 +401,11 @@ func (db *DB) close() error {
// will cause the calls to block and be serialized until the current write
// transaction finishes.
//
// Transactions should not be dependent on one another. Opening a read
// Transactions should not be depedent on one another. Opening a read
// transaction and a write transaction in the same goroutine can cause the
// writer to deadlock because the database periodically needs to re-mmap itself
// as it grows and it cannot do that while a read transaction is open.
//
// If a long running read transaction (for example, a snapshot transaction) is
// needed, you might want to set DB.InitialMmapSize to a large enough value
// to avoid potential blocking of write transaction.
//
// IMPORTANT: You must close read-only transactions after you are finished or
// else the database will not reclaim old pages.
func (db *DB) Begin(writable bool) (*Tx, error) {
@@ -610,136 +589,6 @@ func (db *DB) View(fn func(*Tx) error) error {
return nil
}
// Batch calls fn as part of a batch. It behaves similar to Update,
// except:
//
// 1. concurrent Batch calls can be combined into a single Bolt
// transaction.
//
// 2. the function passed to Batch may be called multiple times,
// regardless of whether it returns error or not.
//
// This means that Batch function side effects must be idempotent and
// take permanent effect only after a successful return is seen in
// caller.
//
// The maximum batch size and delay can be adjusted with DB.MaxBatchSize
// and DB.MaxBatchDelay, respectively.
//
// Batch is only useful when there are multiple goroutines calling it.
func (db *DB) Batch(fn func(*Tx) error) error {
errCh := make(chan error, 1)
db.batchMu.Lock()
if (db.batch == nil) || (db.batch != nil && len(db.batch.calls) >= db.MaxBatchSize) {
// There is no existing batch, or the existing batch is full; start a new one.
db.batch = &batch{
db: db,
}
db.batch.timer = time.AfterFunc(db.MaxBatchDelay, db.batch.trigger)
}
db.batch.calls = append(db.batch.calls, call{fn: fn, err: errCh})
if len(db.batch.calls) >= db.MaxBatchSize {
// wake up batch, it's ready to run
go db.batch.trigger()
}
db.batchMu.Unlock()
err := <-errCh
if err == trySolo {
err = db.Update(fn)
}
return err
}
type call struct {
fn func(*Tx) error
err chan<- error
}
type batch struct {
db *DB
timer *time.Timer
start sync.Once
calls []call
}
// trigger runs the batch if it hasn't already been run.
func (b *batch) trigger() {
b.start.Do(b.run)
}
// run performs the transactions in the batch and communicates results
// back to DB.Batch.
func (b *batch) run() {
b.db.batchMu.Lock()
b.timer.Stop()
// Make sure no new work is added to this batch, but don't break
// other batches.
if b.db.batch == b {
b.db.batch = nil
}
b.db.batchMu.Unlock()
retry:
for len(b.calls) > 0 {
var failIdx = -1
err := b.db.Update(func(tx *Tx) error {
for i, c := range b.calls {
if err := safelyCall(c.fn, tx); err != nil {
failIdx = i
return err
}
}
return nil
})
if failIdx >= 0 {
// take the failing transaction out of the batch. it's
// safe to shorten b.calls here because db.batch no longer
// points to us, and we hold the mutex anyway.
c := b.calls[failIdx]
b.calls[failIdx], b.calls = b.calls[len(b.calls)-1], b.calls[:len(b.calls)-1]
// tell the submitter re-run it solo, continue with the rest of the batch
c.err <- trySolo
continue retry
}
// pass success, or bolt internal errors, to all callers
for _, c := range b.calls {
if c.err != nil {
c.err <- err
}
}
break retry
}
}
// trySolo is a special sentinel error value used for signaling that a
// transaction function should be re-run. It should never be seen by
// callers.
var trySolo = errors.New("batch function returned an error and should be re-run solo")
type panicked struct {
reason interface{}
}
func (p panicked) Error() string {
if err, ok := p.reason.(error); ok {
return err.Error()
}
return fmt.Sprintf("panic: %v", p.reason)
}
func safelyCall(fn func(*Tx) error, tx *Tx) (err error) {
defer func() {
if p := recover(); p != nil {
err = panicked{p}
}
}()
return fn(tx)
}
// Sync executes fdatasync() against the database file handle.
//
// This is not necessary under normal operation, however, if you use NoSync
@@ -806,36 +655,6 @@ func (db *DB) allocate(count int) (*page, error) {
return p, nil
}
// grow grows the size of the database to the given sz.
func (db *DB) grow(sz int) error {
// Ignore if the new size is less than available file size.
if sz <= db.filesz {
return nil
}
// If the data is smaller than the alloc size then only allocate what's needed.
// Once it goes over the allocation size then allocate in chunks.
if db.datasz < db.AllocSize {
sz = db.datasz
} else {
sz += db.AllocSize
}
// Truncate and fsync to ensure file size metadata is flushed.
// https://github.com/boltdb/bolt/issues/284
if !db.NoGrowSync && !db.readOnly {
if err := db.file.Truncate(int64(sz)); err != nil {
return fmt.Errorf("file resize error: %s", err)
}
if err := db.file.Sync(); err != nil {
return fmt.Errorf("file sync error: %s", err)
}
}
db.filesz = sz
return nil
}
func (db *DB) IsReadOnly() bool {
return db.readOnly
}
@@ -853,19 +672,6 @@ type Options struct {
// Open database in read-only mode. Uses flock(..., LOCK_SH |LOCK_NB) to
// grab a shared lock (UNIX).
ReadOnly bool
// Sets the DB.MmapFlags flag before memory mapping the file.
MmapFlags int
// InitialMmapSize is the initial mmap size of the database
// in bytes. Read transactions won't block write transaction
// if the InitialMmapSize is large enough to hold database mmap
// size. (See DB.Begin for more information)
//
// If <=0, the initial map size is 0.
// If initialMmapSize is smaller than the previous database size,
// it takes no effect.
InitialMmapSize int
}
// DefaultOptions represent the options used if nil options are passed into Open().

913
Godeps/_workspace/src/github.com/boltdb/bolt/db_test.go generated vendored Normal file
View File

@@ -0,0 +1,913 @@
package bolt_test
import (
"encoding/binary"
"errors"
"flag"
"fmt"
"io/ioutil"
"os"
"regexp"
"runtime"
"sort"
"strings"
"testing"
"time"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
var statsFlag = flag.Bool("stats", false, "show performance stats")
// Ensure that opening a database with a bad path returns an error.
func TestOpen_BadPath(t *testing.T) {
db, err := bolt.Open("", 0666, nil)
assert(t, err != nil, "err: %s", err)
assert(t, db == nil, "")
}
// Ensure that a database can be opened without error.
func TestOpen(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
assert(t, db != nil, "")
ok(t, err)
equals(t, db.Path(), path)
ok(t, db.Close())
}
// Ensure that opening an already open database file will timeout.
func TestOpen_Timeout(t *testing.T) {
if runtime.GOOS == "windows" {
t.Skip("timeout not supported on windows")
}
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
path := tempfile()
defer os.Remove(path)
// Open a data file.
db0, err := bolt.Open(path, 0666, nil)
assert(t, db0 != nil, "")
ok(t, err)
// Attempt to open the database again.
start := time.Now()
db1, err := bolt.Open(path, 0666, &bolt.Options{Timeout: 100 * time.Millisecond})
assert(t, db1 == nil, "")
equals(t, bolt.ErrTimeout, err)
assert(t, time.Since(start) > 100*time.Millisecond, "")
db0.Close()
}
// Ensure that opening an already open database file will wait until its closed.
func TestOpen_Wait(t *testing.T) {
if runtime.GOOS == "windows" {
t.Skip("timeout not supported on windows")
}
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
path := tempfile()
defer os.Remove(path)
// Open a data file.
db0, err := bolt.Open(path, 0666, nil)
assert(t, db0 != nil, "")
ok(t, err)
// Close it in just a bit.
time.AfterFunc(100*time.Millisecond, func() { db0.Close() })
// Attempt to open the database again.
start := time.Now()
db1, err := bolt.Open(path, 0666, &bolt.Options{Timeout: 200 * time.Millisecond})
assert(t, db1 != nil, "")
ok(t, err)
assert(t, time.Since(start) > 100*time.Millisecond, "")
}
// Ensure that opening a database does not increase its size.
// https://github.com/boltdb/bolt/issues/291
func TestOpen_Size(t *testing.T) {
// Open a data file.
db := NewTestDB()
path := db.Path()
defer db.Close()
// Insert until we get above the minimum 4MB size.
ok(t, db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucketIfNotExists([]byte("data"))
for i := 0; i < 10000; i++ {
ok(t, b.Put([]byte(fmt.Sprintf("%04d", i)), make([]byte, 1000)))
}
return nil
}))
// Close database and grab the size.
db.DB.Close()
sz := fileSize(path)
if sz == 0 {
t.Fatalf("unexpected new file size: %d", sz)
}
// Reopen database, update, and check size again.
db0, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db0.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("data")).Put([]byte{0}, []byte{0}) }))
ok(t, db0.Close())
newSz := fileSize(path)
if newSz == 0 {
t.Fatalf("unexpected new file size: %d", newSz)
}
// Compare the original size with the new size.
if sz != newSz {
t.Fatalf("unexpected file growth: %d => %d", sz, newSz)
}
}
// Ensure that opening a database beyond the max step size does not increase its size.
// https://github.com/boltdb/bolt/issues/303
func TestOpen_Size_Large(t *testing.T) {
if testing.Short() {
t.Skip("short mode")
}
// Open a data file.
db := NewTestDB()
path := db.Path()
defer db.Close()
// Insert until we get above the minimum 4MB size.
var index uint64
for i := 0; i < 10000; i++ {
ok(t, db.Update(func(tx *bolt.Tx) error {
b, _ := tx.CreateBucketIfNotExists([]byte("data"))
for j := 0; j < 1000; j++ {
ok(t, b.Put(u64tob(index), make([]byte, 50)))
index++
}
return nil
}))
}
// Close database and grab the size.
db.DB.Close()
sz := fileSize(path)
if sz == 0 {
t.Fatalf("unexpected new file size: %d", sz)
} else if sz < (1 << 30) {
t.Fatalf("expected larger initial size: %d", sz)
}
// Reopen database, update, and check size again.
db0, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db0.Update(func(tx *bolt.Tx) error { return tx.Bucket([]byte("data")).Put([]byte{0}, []byte{0}) }))
ok(t, db0.Close())
newSz := fileSize(path)
if newSz == 0 {
t.Fatalf("unexpected new file size: %d", newSz)
}
// Compare the original size with the new size.
if sz != newSz {
t.Fatalf("unexpected file growth: %d => %d", sz, newSz)
}
}
// Ensure that a re-opened database is consistent.
func TestOpen_Check(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db.View(func(tx *bolt.Tx) error { return <-tx.Check() }))
db.Close()
db, err = bolt.Open(path, 0666, nil)
ok(t, err)
ok(t, db.View(func(tx *bolt.Tx) error { return <-tx.Check() }))
db.Close()
}
// Ensure that the database returns an error if the file handle cannot be open.
func TestDB_Open_FileError(t *testing.T) {
path := tempfile()
defer os.Remove(path)
_, err := bolt.Open(path+"/youre-not-my-real-parent", 0666, nil)
assert(t, err.(*os.PathError) != nil, "")
equals(t, path+"/youre-not-my-real-parent", err.(*os.PathError).Path)
equals(t, "open", err.(*os.PathError).Op)
}
// Ensure that write errors to the meta file handler during initialization are returned.
func TestDB_Open_MetaInitWriteError(t *testing.T) {
t.Skip("pending")
}
// Ensure that a database that is too small returns an error.
func TestDB_Open_FileTooSmall(t *testing.T) {
path := tempfile()
defer os.Remove(path)
db, err := bolt.Open(path, 0666, nil)
ok(t, err)
db.Close()
// corrupt the database
ok(t, os.Truncate(path, int64(os.Getpagesize())))
db, err = bolt.Open(path, 0666, nil)
equals(t, errors.New("file size too small"), err)
}
// Ensure that a database can be opened in read-only mode by multiple processes
// and that a database can not be opened in read-write mode and in read-only
// mode at the same time.
func TestOpen_ReadOnly(t *testing.T) {
if runtime.GOOS == "solaris" {
t.Skip("solaris fcntl locks don't support intra-process locking")
}
bucket, key, value := []byte(`bucket`), []byte(`key`), []byte(`value`)
path := tempfile()
defer os.Remove(path)
// Open in read-write mode.
db, err := bolt.Open(path, 0666, nil)
ok(t, db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket(bucket)
if err != nil {
return err
}
return b.Put(key, value)
}))
assert(t, db != nil, "")
assert(t, !db.IsReadOnly(), "")
ok(t, err)
ok(t, db.Close())
// Open in read-only mode.
db0, err := bolt.Open(path, 0666, &bolt.Options{ReadOnly: true})
ok(t, err)
defer db0.Close()
// Opening in read-write mode should return an error.
_, err = bolt.Open(path, 0666, &bolt.Options{Timeout: time.Millisecond * 100})
assert(t, err != nil, "")
// And again (in read-only mode).
db1, err := bolt.Open(path, 0666, &bolt.Options{ReadOnly: true})
ok(t, err)
defer db1.Close()
// Verify both read-only databases are accessible.
for _, db := range []*bolt.DB{db0, db1} {
// Verify is is in read only mode indeed.
assert(t, db.IsReadOnly(), "")
// Read-only databases should not allow updates.
assert(t,
bolt.ErrDatabaseReadOnly == db.Update(func(*bolt.Tx) error {
panic(`should never get here`)
}),
"")
// Read-only databases should not allow beginning writable txns.
_, err = db.Begin(true)
assert(t, bolt.ErrDatabaseReadOnly == err, "")
// Verify the data.
ok(t, db.View(func(tx *bolt.Tx) error {
b := tx.Bucket(bucket)
if b == nil {
return fmt.Errorf("expected bucket `%s`", string(bucket))
}
got := string(b.Get(key))
expected := string(value)
if got != expected {
return fmt.Errorf("expected `%s`, got `%s`", expected, got)
}
return nil
}))
}
}
// TODO(benbjohnson): Test corruption at every byte of the first two pages.
// Ensure that a database cannot open a transaction when it's not open.
func TestDB_Begin_DatabaseNotOpen(t *testing.T) {
var db bolt.DB
tx, err := db.Begin(false)
assert(t, tx == nil, "")
equals(t, err, bolt.ErrDatabaseNotOpen)
}
// Ensure that a read-write transaction can be retrieved.
func TestDB_BeginRW(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, err := db.Begin(true)
assert(t, tx != nil, "")
ok(t, err)
assert(t, tx.DB() == db.DB, "")
equals(t, tx.Writable(), true)
ok(t, tx.Commit())
}
// Ensure that opening a transaction while the DB is closed returns an error.
func TestDB_BeginRW_Closed(t *testing.T) {
var db bolt.DB
tx, err := db.Begin(true)
equals(t, err, bolt.ErrDatabaseNotOpen)
assert(t, tx == nil, "")
}
func TestDB_Close_PendingTx_RW(t *testing.T) { testDB_Close_PendingTx(t, true) }
func TestDB_Close_PendingTx_RO(t *testing.T) { testDB_Close_PendingTx(t, false) }
// Ensure that a database cannot close while transactions are open.
func testDB_Close_PendingTx(t *testing.T, writable bool) {
db := NewTestDB()
defer db.Close()
// Start transaction.
tx, err := db.Begin(true)
if err != nil {
t.Fatal(err)
}
// Open update in separate goroutine.
done := make(chan struct{})
go func() {
db.Close()
close(done)
}()
// Ensure database hasn't closed.
time.Sleep(100 * time.Millisecond)
select {
case <-done:
t.Fatal("database closed too early")
default:
}
// Commit transaction.
if err := tx.Commit(); err != nil {
t.Fatal(err)
}
// Ensure database closed now.
time.Sleep(100 * time.Millisecond)
select {
case <-done:
default:
t.Fatal("database did not close")
}
}
// Ensure a database can provide a transactional block.
func TestDB_Update(t *testing.T) {
db := NewTestDB()
defer db.Close()
err := db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("foo"), []byte("bar"))
b.Put([]byte("baz"), []byte("bat"))
b.Delete([]byte("foo"))
return nil
})
ok(t, err)
err = db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")).Get([]byte("foo")) == nil, "")
equals(t, []byte("bat"), tx.Bucket([]byte("widgets")).Get([]byte("baz")))
return nil
})
ok(t, err)
}
// Ensure a closed database returns an error while running a transaction block
func TestDB_Update_Closed(t *testing.T) {
var db bolt.DB
err := db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
return nil
})
equals(t, err, bolt.ErrDatabaseNotOpen)
}
// Ensure a panic occurs while trying to commit a managed transaction.
func TestDB_Update_ManualCommit(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Commit()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to rollback a managed transaction.
func TestDB_Update_ManualRollback(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Rollback()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to commit a managed transaction.
func TestDB_View_ManualCommit(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Commit()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a panic occurs while trying to rollback a managed transaction.
func TestDB_View_ManualRollback(t *testing.T) {
db := NewTestDB()
defer db.Close()
var ok bool
db.Update(func(tx *bolt.Tx) error {
func() {
defer func() {
if r := recover(); r != nil {
ok = true
}
}()
tx.Rollback()
}()
return nil
})
assert(t, ok, "expected panic")
}
// Ensure a write transaction that panics does not hold open locks.
func TestDB_Update_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
func() {
defer func() {
if r := recover(); r != nil {
t.Log("recover: update", r)
}
}()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
panic("omg")
})
}()
// Verify we can update again.
err := db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
ok(t, err)
// Verify that our change persisted.
err = db.Update(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
return nil
})
}
// Ensure a database can return an error through a read-only transactional block.
func TestDB_View_Error(t *testing.T) {
db := NewTestDB()
defer db.Close()
err := db.View(func(tx *bolt.Tx) error {
return errors.New("xxx")
})
equals(t, errors.New("xxx"), err)
}
// Ensure a read transaction that panics does not hold open locks.
func TestDB_View_Panic(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
return nil
})
func() {
defer func() {
if r := recover(); r != nil {
t.Log("recover: view", r)
}
}()
db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
panic("omg")
})
}()
// Verify that we can still use read transactions.
db.View(func(tx *bolt.Tx) error {
assert(t, tx.Bucket([]byte("widgets")) != nil, "")
return nil
})
}
// Ensure that an error is returned when a database write fails.
func TestDB_Commit_WriteFail(t *testing.T) {
t.Skip("pending") // TODO(benbjohnson)
}
// Ensure that DB stats can be returned.
func TestDB_Stats(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
stats := db.Stats()
equals(t, 2, stats.TxStats.PageCount)
equals(t, 0, stats.FreePageN)
equals(t, 2, stats.PendingPageN)
}
// Ensure that database pages are in expected order and type.
func TestDB_Consistency(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
for i := 0; i < 10; i++ {
db.Update(func(tx *bolt.Tx) error {
ok(t, tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar")))
return nil
})
}
db.Update(func(tx *bolt.Tx) error {
p, _ := tx.Page(0)
assert(t, p != nil, "")
equals(t, "meta", p.Type)
p, _ = tx.Page(1)
assert(t, p != nil, "")
equals(t, "meta", p.Type)
p, _ = tx.Page(2)
assert(t, p != nil, "")
equals(t, "free", p.Type)
p, _ = tx.Page(3)
assert(t, p != nil, "")
equals(t, "free", p.Type)
p, _ = tx.Page(4)
assert(t, p != nil, "")
equals(t, "leaf", p.Type)
p, _ = tx.Page(5)
assert(t, p != nil, "")
equals(t, "freelist", p.Type)
p, _ = tx.Page(6)
assert(t, p == nil, "")
return nil
})
}
// Ensure that DB stats can be substracted from one another.
func TestDBStats_Sub(t *testing.T) {
var a, b bolt.Stats
a.TxStats.PageCount = 3
a.FreePageN = 4
b.TxStats.PageCount = 10
b.FreePageN = 14
diff := b.Sub(&a)
equals(t, 7, diff.TxStats.PageCount)
// free page stats are copied from the receiver and not subtracted
equals(t, 14, diff.FreePageN)
}
func ExampleDB_Update() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Execute several commands within a write transaction.
err := db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
if err != nil {
return err
}
if err := b.Put([]byte("foo"), []byte("bar")); err != nil {
return err
}
return nil
})
// If our transactional block didn't return an error then our data is saved.
if err == nil {
db.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value of 'foo' is: %s\n", value)
return nil
})
}
// Output:
// The value of 'foo' is: bar
}
func ExampleDB_View() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Insert data into a bucket.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("people"))
b := tx.Bucket([]byte("people"))
b.Put([]byte("john"), []byte("doe"))
b.Put([]byte("susy"), []byte("que"))
return nil
})
// Access data from within a read-only transactional block.
db.View(func(tx *bolt.Tx) error {
v := tx.Bucket([]byte("people")).Get([]byte("john"))
fmt.Printf("John's last name is %s.\n", v)
return nil
})
// Output:
// John's last name is doe.
}
func ExampleDB_Begin_ReadOnly() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
// Create several keys in a transaction.
tx, _ := db.Begin(true)
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("john"), []byte("blue"))
b.Put([]byte("abby"), []byte("red"))
b.Put([]byte("zephyr"), []byte("purple"))
tx.Commit()
// Iterate over the values in sorted key order.
tx, _ = db.Begin(false)
c := tx.Bucket([]byte("widgets")).Cursor()
for k, v := c.First(); k != nil; k, v = c.Next() {
fmt.Printf("%s likes %s\n", k, v)
}
tx.Rollback()
// Output:
// abby likes red
// john likes blue
// zephyr likes purple
}
// TestDB represents a wrapper around a Bolt DB to handle temporary file
// creation and automatic cleanup on close.
type TestDB struct {
*bolt.DB
}
// NewTestDB returns a new instance of TestDB.
func NewTestDB() *TestDB {
db, err := bolt.Open(tempfile(), 0666, nil)
if err != nil {
panic("cannot open db: " + err.Error())
}
return &TestDB{db}
}
// MustView executes a read-only function. Panic on error.
func (db *TestDB) MustView(fn func(tx *bolt.Tx) error) {
if err := db.DB.View(func(tx *bolt.Tx) error {
return fn(tx)
}); err != nil {
panic(err.Error())
}
}
// MustUpdate executes a read-write function. Panic on error.
func (db *TestDB) MustUpdate(fn func(tx *bolt.Tx) error) {
if err := db.DB.View(func(tx *bolt.Tx) error {
return fn(tx)
}); err != nil {
panic(err.Error())
}
}
// MustCreateBucket creates a new bucket. Panic on error.
func (db *TestDB) MustCreateBucket(name []byte) {
if err := db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte(name))
return err
}); err != nil {
panic(err.Error())
}
}
// Close closes the database and deletes the underlying file.
func (db *TestDB) Close() {
// Log statistics.
if *statsFlag {
db.PrintStats()
}
// Check database consistency after every test.
db.MustCheck()
// Close database and remove file.
defer os.Remove(db.Path())
db.DB.Close()
}
// PrintStats prints the database stats
func (db *TestDB) PrintStats() {
var stats = db.Stats()
fmt.Printf("[db] %-20s %-20s %-20s\n",
fmt.Sprintf("pg(%d/%d)", stats.TxStats.PageCount, stats.TxStats.PageAlloc),
fmt.Sprintf("cur(%d)", stats.TxStats.CursorCount),
fmt.Sprintf("node(%d/%d)", stats.TxStats.NodeCount, stats.TxStats.NodeDeref),
)
fmt.Printf(" %-20s %-20s %-20s\n",
fmt.Sprintf("rebal(%d/%v)", stats.TxStats.Rebalance, truncDuration(stats.TxStats.RebalanceTime)),
fmt.Sprintf("spill(%d/%v)", stats.TxStats.Spill, truncDuration(stats.TxStats.SpillTime)),
fmt.Sprintf("w(%d/%v)", stats.TxStats.Write, truncDuration(stats.TxStats.WriteTime)),
)
}
// MustCheck runs a consistency check on the database and panics if any errors are found.
func (db *TestDB) MustCheck() {
db.Update(func(tx *bolt.Tx) error {
// Collect all the errors.
var errors []error
for err := range tx.Check() {
errors = append(errors, err)
if len(errors) > 10 {
break
}
}
// If errors occurred, copy the DB and print the errors.
if len(errors) > 0 {
var path = tempfile()
tx.CopyFile(path, 0600)
// Print errors.
fmt.Print("\n\n")
fmt.Printf("consistency check failed (%d errors)\n", len(errors))
for _, err := range errors {
fmt.Println(err)
}
fmt.Println("")
fmt.Println("db saved to:")
fmt.Println(path)
fmt.Print("\n\n")
os.Exit(-1)
}
return nil
})
}
// CopyTempFile copies a database to a temporary file.
func (db *TestDB) CopyTempFile() {
path := tempfile()
db.View(func(tx *bolt.Tx) error { return tx.CopyFile(path, 0600) })
fmt.Println("db copied to: ", path)
}
// tempfile returns a temporary file path.
func tempfile() string {
f, _ := ioutil.TempFile("", "bolt-")
f.Close()
os.Remove(f.Name())
return f.Name()
}
// mustContainKeys checks that a bucket contains a given set of keys.
func mustContainKeys(b *bolt.Bucket, m map[string]string) {
found := make(map[string]string)
b.ForEach(func(k, _ []byte) error {
found[string(k)] = ""
return nil
})
// Check for keys found in bucket that shouldn't be there.
var keys []string
for k, _ := range found {
if _, ok := m[string(k)]; !ok {
keys = append(keys, k)
}
}
if len(keys) > 0 {
sort.Strings(keys)
panic(fmt.Sprintf("keys found(%d): %s", len(keys), strings.Join(keys, ",")))
}
// Check for keys not found in bucket that should be there.
for k, _ := range m {
if _, ok := found[string(k)]; !ok {
keys = append(keys, k)
}
}
if len(keys) > 0 {
sort.Strings(keys)
panic(fmt.Sprintf("keys not found(%d): %s", len(keys), strings.Join(keys, ",")))
}
}
func trunc(b []byte, length int) []byte {
if length < len(b) {
return b[:length]
}
return b
}
func truncDuration(d time.Duration) string {
return regexp.MustCompile(`^(\d+)(\.\d+)`).ReplaceAllString(d.String(), "$1")
}
func fileSize(path string) int64 {
fi, err := os.Stat(path)
if err != nil {
return 0
}
return fi.Size()
}
func warn(v ...interface{}) { fmt.Fprintln(os.Stderr, v...) }
func warnf(msg string, v ...interface{}) { fmt.Fprintf(os.Stderr, msg+"\n", v...) }
// u64tob converts a uint64 into an 8-byte slice.
func u64tob(v uint64) []byte {
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, v)
return b
}
// btou64 converts an 8-byte slice into an uint64.
func btou64(b []byte) uint64 { return binary.BigEndian.Uint64(b) }

View File

@@ -0,0 +1,156 @@
package bolt
import (
"math/rand"
"reflect"
"sort"
"testing"
"unsafe"
)
// Ensure that a page is added to a transaction's freelist.
func TestFreelist_free(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12})
if !reflect.DeepEqual([]pgid{12}, f.pending[100]) {
t.Fatalf("exp=%v; got=%v", []pgid{12}, f.pending[100])
}
}
// Ensure that a page and its overflow is added to a transaction's freelist.
func TestFreelist_free_overflow(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12, overflow: 3})
if exp := []pgid{12, 13, 14, 15}; !reflect.DeepEqual(exp, f.pending[100]) {
t.Fatalf("exp=%v; got=%v", exp, f.pending[100])
}
}
// Ensure that a transaction's free pages can be released.
func TestFreelist_release(t *testing.T) {
f := newFreelist()
f.free(100, &page{id: 12, overflow: 1})
f.free(100, &page{id: 9})
f.free(102, &page{id: 39})
f.release(100)
f.release(101)
if exp := []pgid{9, 12, 13}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
f.release(102)
if exp := []pgid{9, 12, 13, 39}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can find contiguous blocks of pages.
func TestFreelist_allocate(t *testing.T) {
f := &freelist{ids: []pgid{3, 4, 5, 6, 7, 9, 12, 13, 18}}
if id := int(f.allocate(3)); id != 3 {
t.Fatalf("exp=3; got=%v", id)
}
if id := int(f.allocate(1)); id != 6 {
t.Fatalf("exp=6; got=%v", id)
}
if id := int(f.allocate(3)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if id := int(f.allocate(2)); id != 12 {
t.Fatalf("exp=12; got=%v", id)
}
if id := int(f.allocate(1)); id != 7 {
t.Fatalf("exp=7; got=%v", id)
}
if id := int(f.allocate(0)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if id := int(f.allocate(0)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if exp := []pgid{9, 18}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
if id := int(f.allocate(1)); id != 9 {
t.Fatalf("exp=9; got=%v", id)
}
if id := int(f.allocate(1)); id != 18 {
t.Fatalf("exp=18; got=%v", id)
}
if id := int(f.allocate(1)); id != 0 {
t.Fatalf("exp=0; got=%v", id)
}
if exp := []pgid{}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can deserialize from a freelist page.
func TestFreelist_read(t *testing.T) {
// Create a page.
var buf [4096]byte
page := (*page)(unsafe.Pointer(&buf[0]))
page.flags = freelistPageFlag
page.count = 2
// Insert 2 page ids.
ids := (*[3]pgid)(unsafe.Pointer(&page.ptr))
ids[0] = 23
ids[1] = 50
// Deserialize page into a freelist.
f := newFreelist()
f.read(page)
// Ensure that there are two page ids in the freelist.
if exp := []pgid{23, 50}; !reflect.DeepEqual(exp, f.ids) {
t.Fatalf("exp=%v; got=%v", exp, f.ids)
}
}
// Ensure that a freelist can serialize into a freelist page.
func TestFreelist_write(t *testing.T) {
// Create a freelist and write it to a page.
var buf [4096]byte
f := &freelist{ids: []pgid{12, 39}, pending: make(map[txid][]pgid)}
f.pending[100] = []pgid{28, 11}
f.pending[101] = []pgid{3}
p := (*page)(unsafe.Pointer(&buf[0]))
f.write(p)
// Read the page back out.
f2 := newFreelist()
f2.read(p)
// Ensure that the freelist is correct.
// All pages should be present and in reverse order.
if exp := []pgid{3, 11, 12, 28, 39}; !reflect.DeepEqual(exp, f2.ids) {
t.Fatalf("exp=%v; got=%v", exp, f2.ids)
}
}
func Benchmark_FreelistRelease10K(b *testing.B) { benchmark_FreelistRelease(b, 10000) }
func Benchmark_FreelistRelease100K(b *testing.B) { benchmark_FreelistRelease(b, 100000) }
func Benchmark_FreelistRelease1000K(b *testing.B) { benchmark_FreelistRelease(b, 1000000) }
func Benchmark_FreelistRelease10000K(b *testing.B) { benchmark_FreelistRelease(b, 10000000) }
func benchmark_FreelistRelease(b *testing.B, size int) {
ids := randomPgids(size)
pending := randomPgids(len(ids) / 400)
b.ResetTimer()
for i := 0; i < b.N; i++ {
f := &freelist{ids: ids, pending: map[txid][]pgid{1: pending}}
f.release(1)
}
}
func randomPgids(n int) []pgid {
rand.Seed(42)
pgids := make(pgids, n)
for i := range pgids {
pgids[i] = pgid(rand.Int63())
}
sort.Sort(pgids)
return pgids
}

View File

@@ -0,0 +1,156 @@
package bolt
import (
"testing"
"unsafe"
)
// Ensure that a node can insert a key/value.
func TestNode_put(t *testing.T) {
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{meta: &meta{pgid: 1}}}}
n.put([]byte("baz"), []byte("baz"), []byte("2"), 0, 0)
n.put([]byte("foo"), []byte("foo"), []byte("0"), 0, 0)
n.put([]byte("bar"), []byte("bar"), []byte("1"), 0, 0)
n.put([]byte("foo"), []byte("foo"), []byte("3"), 0, leafPageFlag)
if len(n.inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(n.inodes))
}
if k, v := n.inodes[0].key, n.inodes[0].value; string(k) != "bar" || string(v) != "1" {
t.Fatalf("exp=<bar,1>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[1].key, n.inodes[1].value; string(k) != "baz" || string(v) != "2" {
t.Fatalf("exp=<baz,2>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[2].key, n.inodes[2].value; string(k) != "foo" || string(v) != "3" {
t.Fatalf("exp=<foo,3>; got=<%s,%s>", k, v)
}
if n.inodes[2].flags != uint32(leafPageFlag) {
t.Fatalf("not a leaf: %d", n.inodes[2].flags)
}
}
// Ensure that a node can deserialize from a leaf page.
func TestNode_read_LeafPage(t *testing.T) {
// Create a page.
var buf [4096]byte
page := (*page)(unsafe.Pointer(&buf[0]))
page.flags = leafPageFlag
page.count = 2
// Insert 2 elements at the beginning. sizeof(leafPageElement) == 16
nodes := (*[3]leafPageElement)(unsafe.Pointer(&page.ptr))
nodes[0] = leafPageElement{flags: 0, pos: 32, ksize: 3, vsize: 4} // pos = sizeof(leafPageElement) * 2
nodes[1] = leafPageElement{flags: 0, pos: 23, ksize: 10, vsize: 3} // pos = sizeof(leafPageElement) + 3 + 4
// Write data for the nodes at the end.
data := (*[4096]byte)(unsafe.Pointer(&nodes[2]))
copy(data[:], []byte("barfooz"))
copy(data[7:], []byte("helloworldbye"))
// Deserialize page into a leaf.
n := &node{}
n.read(page)
// Check that there are two inodes with correct data.
if !n.isLeaf {
t.Fatal("expected leaf")
}
if len(n.inodes) != 2 {
t.Fatalf("exp=2; got=%d", len(n.inodes))
}
if k, v := n.inodes[0].key, n.inodes[0].value; string(k) != "bar" || string(v) != "fooz" {
t.Fatalf("exp=<bar,fooz>; got=<%s,%s>", k, v)
}
if k, v := n.inodes[1].key, n.inodes[1].value; string(k) != "helloworld" || string(v) != "bye" {
t.Fatalf("exp=<helloworld,bye>; got=<%s,%s>", k, v)
}
}
// Ensure that a node can serialize into a leaf page.
func TestNode_write_LeafPage(t *testing.T) {
// Create a node.
n := &node{isLeaf: true, inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("susy"), []byte("susy"), []byte("que"), 0, 0)
n.put([]byte("ricki"), []byte("ricki"), []byte("lake"), 0, 0)
n.put([]byte("john"), []byte("john"), []byte("johnson"), 0, 0)
// Write it to a page.
var buf [4096]byte
p := (*page)(unsafe.Pointer(&buf[0]))
n.write(p)
// Read the page back in.
n2 := &node{}
n2.read(p)
// Check that the two pages are the same.
if len(n2.inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(n2.inodes))
}
if k, v := n2.inodes[0].key, n2.inodes[0].value; string(k) != "john" || string(v) != "johnson" {
t.Fatalf("exp=<john,johnson>; got=<%s,%s>", k, v)
}
if k, v := n2.inodes[1].key, n2.inodes[1].value; string(k) != "ricki" || string(v) != "lake" {
t.Fatalf("exp=<ricki,lake>; got=<%s,%s>", k, v)
}
if k, v := n2.inodes[2].key, n2.inodes[2].value; string(k) != "susy" || string(v) != "que" {
t.Fatalf("exp=<susy,que>; got=<%s,%s>", k, v)
}
}
// Ensure that a node can split into appropriate subgroups.
func TestNode_split(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000003"), []byte("00000003"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000004"), []byte("00000004"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000005"), []byte("00000005"), []byte("0123456701234567"), 0, 0)
// Split between 2 & 3.
n.split(100)
var parent = n.parent
if len(parent.children) != 2 {
t.Fatalf("exp=2; got=%d", len(parent.children))
}
if len(parent.children[0].inodes) != 2 {
t.Fatalf("exp=2; got=%d", len(parent.children[0].inodes))
}
if len(parent.children[1].inodes) != 3 {
t.Fatalf("exp=3; got=%d", len(parent.children[1].inodes))
}
}
// Ensure that a page with the minimum number of inodes just returns a single node.
func TestNode_split_MinKeys(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
// Split.
n.split(20)
if n.parent != nil {
t.Fatalf("expected nil parent")
}
}
// Ensure that a node that has keys that all fit on a page just returns one leaf.
func TestNode_split_SinglePage(t *testing.T) {
// Create a node.
n := &node{inodes: make(inodes, 0), bucket: &Bucket{tx: &Tx{db: &DB{}, meta: &meta{pgid: 1}}}}
n.put([]byte("00000001"), []byte("00000001"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000002"), []byte("00000002"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000003"), []byte("00000003"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000004"), []byte("00000004"), []byte("0123456701234567"), 0, 0)
n.put([]byte("00000005"), []byte("00000005"), []byte("0123456701234567"), 0, 0)
// Split.
n.split(4096)
if n.parent != nil {
t.Fatalf("expected nil parent")
}
}

View File

@@ -0,0 +1,72 @@
package bolt
import (
"reflect"
"sort"
"testing"
"testing/quick"
)
// Ensure that the page type can be returned in human readable format.
func TestPage_typ(t *testing.T) {
if typ := (&page{flags: branchPageFlag}).typ(); typ != "branch" {
t.Fatalf("exp=branch; got=%v", typ)
}
if typ := (&page{flags: leafPageFlag}).typ(); typ != "leaf" {
t.Fatalf("exp=leaf; got=%v", typ)
}
if typ := (&page{flags: metaPageFlag}).typ(); typ != "meta" {
t.Fatalf("exp=meta; got=%v", typ)
}
if typ := (&page{flags: freelistPageFlag}).typ(); typ != "freelist" {
t.Fatalf("exp=freelist; got=%v", typ)
}
if typ := (&page{flags: 20000}).typ(); typ != "unknown<4e20>" {
t.Fatalf("exp=unknown<4e20>; got=%v", typ)
}
}
// Ensure that the hexdump debugging function doesn't blow up.
func TestPage_dump(t *testing.T) {
(&page{id: 256}).hexdump(16)
}
func TestPgids_merge(t *testing.T) {
a := pgids{4, 5, 6, 10, 11, 12, 13, 27}
b := pgids{1, 3, 8, 9, 25, 30}
c := a.merge(b)
if !reflect.DeepEqual(c, pgids{1, 3, 4, 5, 6, 8, 9, 10, 11, 12, 13, 25, 27, 30}) {
t.Errorf("mismatch: %v", c)
}
a = pgids{4, 5, 6, 10, 11, 12, 13, 27, 35, 36}
b = pgids{8, 9, 25, 30}
c = a.merge(b)
if !reflect.DeepEqual(c, pgids{4, 5, 6, 8, 9, 10, 11, 12, 13, 25, 27, 30, 35, 36}) {
t.Errorf("mismatch: %v", c)
}
}
func TestPgids_merge_quick(t *testing.T) {
if err := quick.Check(func(a, b pgids) bool {
// Sort incoming lists.
sort.Sort(a)
sort.Sort(b)
// Merge the two lists together.
got := a.merge(b)
// The expected value should be the two lists combined and sorted.
exp := append(a, b...)
sort.Sort(exp)
if !reflect.DeepEqual(exp, got) {
t.Errorf("\nexp=%+v\ngot=%+v\n", exp, got)
return false
}
return true
}, nil); err != nil {
t.Fatal(err)
}
}

View File

@@ -0,0 +1,79 @@
package bolt_test
import (
"bytes"
"flag"
"fmt"
"math/rand"
"os"
"reflect"
"testing/quick"
"time"
)
// testing/quick defaults to 5 iterations and a random seed.
// You can override these settings from the command line:
//
// -quick.count The number of iterations to perform.
// -quick.seed The seed to use for randomizing.
// -quick.maxitems The maximum number of items to insert into a DB.
// -quick.maxksize The maximum size of a key.
// -quick.maxvsize The maximum size of a value.
//
var qcount, qseed, qmaxitems, qmaxksize, qmaxvsize int
func init() {
flag.IntVar(&qcount, "quick.count", 5, "")
flag.IntVar(&qseed, "quick.seed", int(time.Now().UnixNano())%100000, "")
flag.IntVar(&qmaxitems, "quick.maxitems", 1000, "")
flag.IntVar(&qmaxksize, "quick.maxksize", 1024, "")
flag.IntVar(&qmaxvsize, "quick.maxvsize", 1024, "")
flag.Parse()
fmt.Fprintln(os.Stderr, "seed:", qseed)
fmt.Fprintf(os.Stderr, "quick settings: count=%v, items=%v, ksize=%v, vsize=%v\n", qcount, qmaxitems, qmaxksize, qmaxvsize)
}
func qconfig() *quick.Config {
return &quick.Config{
MaxCount: qcount,
Rand: rand.New(rand.NewSource(int64(qseed))),
}
}
type testdata []testdataitem
func (t testdata) Len() int { return len(t) }
func (t testdata) Swap(i, j int) { t[i], t[j] = t[j], t[i] }
func (t testdata) Less(i, j int) bool { return bytes.Compare(t[i].Key, t[j].Key) == -1 }
func (t testdata) Generate(rand *rand.Rand, size int) reflect.Value {
n := rand.Intn(qmaxitems-1) + 1
items := make(testdata, n)
for i := 0; i < n; i++ {
item := &items[i]
item.Key = randByteSlice(rand, 1, qmaxksize)
item.Value = randByteSlice(rand, 0, qmaxvsize)
}
return reflect.ValueOf(items)
}
type revtestdata []testdataitem
func (t revtestdata) Len() int { return len(t) }
func (t revtestdata) Swap(i, j int) { t[i], t[j] = t[j], t[i] }
func (t revtestdata) Less(i, j int) bool { return bytes.Compare(t[i].Key, t[j].Key) == 1 }
type testdataitem struct {
Key []byte
Value []byte
}
func randByteSlice(rand *rand.Rand, minSize, maxSize int) []byte {
n := rand.Intn(maxSize-minSize) + minSize
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}

View File

@@ -0,0 +1,327 @@
package bolt_test
import (
"bytes"
"fmt"
"math/rand"
"sync"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
func TestSimulate_1op_1p(t *testing.T) { testSimulate(t, 100, 1) }
func TestSimulate_10op_1p(t *testing.T) { testSimulate(t, 10, 1) }
func TestSimulate_100op_1p(t *testing.T) { testSimulate(t, 100, 1) }
func TestSimulate_1000op_1p(t *testing.T) { testSimulate(t, 1000, 1) }
func TestSimulate_10000op_1p(t *testing.T) { testSimulate(t, 10000, 1) }
func TestSimulate_10op_10p(t *testing.T) { testSimulate(t, 10, 10) }
func TestSimulate_100op_10p(t *testing.T) { testSimulate(t, 100, 10) }
func TestSimulate_1000op_10p(t *testing.T) { testSimulate(t, 1000, 10) }
func TestSimulate_10000op_10p(t *testing.T) { testSimulate(t, 10000, 10) }
func TestSimulate_100op_100p(t *testing.T) { testSimulate(t, 100, 100) }
func TestSimulate_1000op_100p(t *testing.T) { testSimulate(t, 1000, 100) }
func TestSimulate_10000op_100p(t *testing.T) { testSimulate(t, 10000, 100) }
func TestSimulate_10000op_1000p(t *testing.T) { testSimulate(t, 10000, 1000) }
// Randomly generate operations on a given database with multiple clients to ensure consistency and thread safety.
func testSimulate(t *testing.T, threadCount, parallelism int) {
if testing.Short() {
t.Skip("skipping test in short mode.")
}
rand.Seed(int64(qseed))
// A list of operations that readers and writers can perform.
var readerHandlers = []simulateHandler{simulateGetHandler}
var writerHandlers = []simulateHandler{simulateGetHandler, simulatePutHandler}
var versions = make(map[int]*QuickDB)
versions[1] = NewQuickDB()
db := NewTestDB()
defer db.Close()
var mutex sync.Mutex
// Run n threads in parallel, each with their own operation.
var wg sync.WaitGroup
var threads = make(chan bool, parallelism)
var i int
for {
threads <- true
wg.Add(1)
writable := ((rand.Int() % 100) < 20) // 20% writers
// Choose an operation to execute.
var handler simulateHandler
if writable {
handler = writerHandlers[rand.Intn(len(writerHandlers))]
} else {
handler = readerHandlers[rand.Intn(len(readerHandlers))]
}
// Execute a thread for the given operation.
go func(writable bool, handler simulateHandler) {
defer wg.Done()
// Start transaction.
tx, err := db.Begin(writable)
if err != nil {
t.Fatal("tx begin: ", err)
}
// Obtain current state of the dataset.
mutex.Lock()
var qdb = versions[tx.ID()]
if writable {
qdb = versions[tx.ID()-1].Copy()
}
mutex.Unlock()
// Make sure we commit/rollback the tx at the end and update the state.
if writable {
defer func() {
mutex.Lock()
versions[tx.ID()] = qdb
mutex.Unlock()
ok(t, tx.Commit())
}()
} else {
defer tx.Rollback()
}
// Ignore operation if we don't have data yet.
if qdb == nil {
return
}
// Execute handler.
handler(tx, qdb)
// Release a thread back to the scheduling loop.
<-threads
}(writable, handler)
i++
if i > threadCount {
break
}
}
// Wait until all threads are done.
wg.Wait()
}
type simulateHandler func(tx *bolt.Tx, qdb *QuickDB)
// Retrieves a key from the database and verifies that it is what is expected.
func simulateGetHandler(tx *bolt.Tx, qdb *QuickDB) {
// Randomly retrieve an existing exist.
keys := qdb.Rand()
if len(keys) == 0 {
return
}
// Retrieve root bucket.
b := tx.Bucket(keys[0])
if b == nil {
panic(fmt.Sprintf("bucket[0] expected: %08x\n", trunc(keys[0], 4)))
}
// Drill into nested buckets.
for _, key := range keys[1 : len(keys)-1] {
b = b.Bucket(key)
if b == nil {
panic(fmt.Sprintf("bucket[n] expected: %v -> %v\n", keys, key))
}
}
// Verify key/value on the final bucket.
expected := qdb.Get(keys)
actual := b.Get(keys[len(keys)-1])
if !bytes.Equal(actual, expected) {
fmt.Println("=== EXPECTED ===")
fmt.Println(expected)
fmt.Println("=== ACTUAL ===")
fmt.Println(actual)
fmt.Println("=== END ===")
panic("value mismatch")
}
}
// Inserts a key into the database.
func simulatePutHandler(tx *bolt.Tx, qdb *QuickDB) {
var err error
keys, value := randKeys(), randValue()
// Retrieve root bucket.
b := tx.Bucket(keys[0])
if b == nil {
b, err = tx.CreateBucket(keys[0])
if err != nil {
panic("create bucket: " + err.Error())
}
}
// Create nested buckets, if necessary.
for _, key := range keys[1 : len(keys)-1] {
child := b.Bucket(key)
if child != nil {
b = child
} else {
b, err = b.CreateBucket(key)
if err != nil {
panic("create bucket: " + err.Error())
}
}
}
// Insert into database.
if err := b.Put(keys[len(keys)-1], value); err != nil {
panic("put: " + err.Error())
}
// Insert into in-memory database.
qdb.Put(keys, value)
}
// QuickDB is an in-memory database that replicates the functionality of the
// Bolt DB type except that it is entirely in-memory. It is meant for testing
// that the Bolt database is consistent.
type QuickDB struct {
sync.RWMutex
m map[string]interface{}
}
// NewQuickDB returns an instance of QuickDB.
func NewQuickDB() *QuickDB {
return &QuickDB{m: make(map[string]interface{})}
}
// Get retrieves the value at a key path.
func (db *QuickDB) Get(keys [][]byte) []byte {
db.RLock()
defer db.RUnlock()
m := db.m
for _, key := range keys[:len(keys)-1] {
value := m[string(key)]
if value == nil {
return nil
}
switch value := value.(type) {
case map[string]interface{}:
m = value
case []byte:
return nil
}
}
// Only return if it's a simple value.
if value, ok := m[string(keys[len(keys)-1])].([]byte); ok {
return value
}
return nil
}
// Put inserts a value into a key path.
func (db *QuickDB) Put(keys [][]byte, value []byte) {
db.Lock()
defer db.Unlock()
// Build buckets all the way down the key path.
m := db.m
for _, key := range keys[:len(keys)-1] {
if _, ok := m[string(key)].([]byte); ok {
return // Keypath intersects with a simple value. Do nothing.
}
if m[string(key)] == nil {
m[string(key)] = make(map[string]interface{})
}
m = m[string(key)].(map[string]interface{})
}
// Insert value into the last key.
m[string(keys[len(keys)-1])] = value
}
// Rand returns a random key path that points to a simple value.
func (db *QuickDB) Rand() [][]byte {
db.RLock()
defer db.RUnlock()
if len(db.m) == 0 {
return nil
}
var keys [][]byte
db.rand(db.m, &keys)
return keys
}
func (db *QuickDB) rand(m map[string]interface{}, keys *[][]byte) {
i, index := 0, rand.Intn(len(m))
for k, v := range m {
if i == index {
*keys = append(*keys, []byte(k))
if v, ok := v.(map[string]interface{}); ok {
db.rand(v, keys)
}
return
}
i++
}
panic("quickdb rand: out-of-range")
}
// Copy copies the entire database.
func (db *QuickDB) Copy() *QuickDB {
db.RLock()
defer db.RUnlock()
return &QuickDB{m: db.copy(db.m)}
}
func (db *QuickDB) copy(m map[string]interface{}) map[string]interface{} {
clone := make(map[string]interface{}, len(m))
for k, v := range m {
switch v := v.(type) {
case map[string]interface{}:
clone[k] = db.copy(v)
default:
clone[k] = v
}
}
return clone
}
func randKey() []byte {
var min, max = 1, 1024
n := rand.Intn(max-min) + min
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}
func randKeys() [][]byte {
var keys [][]byte
var count = rand.Intn(2) + 2
for i := 0; i < count; i++ {
keys = append(keys, randKey())
}
return keys
}
func randValue() []byte {
n := rand.Intn(8192)
b := make([]byte, n)
for i := 0; i < n; i++ {
b[i] = byte(rand.Intn(255))
}
return b
}

View File

@@ -29,14 +29,6 @@ type Tx struct {
pages map[pgid]*page
stats TxStats
commitHandlers []func()
// WriteFlag specifies the flag for write-related methods like WriteTo().
// Tx opens the database file with the specified flag to copy the data.
//
// By default, the flag is unset, which works well for mostly in-memory
// workloads. For databases that are much larger than available RAM,
// set the flag to syscall.O_DIRECT to avoid trashing the page cache.
WriteFlag int
}
// init initializes the transaction.
@@ -95,21 +87,18 @@ func (tx *Tx) Stats() TxStats {
// Bucket retrieves a bucket by name.
// Returns nil if the bucket does not exist.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) Bucket(name []byte) *Bucket {
return tx.root.Bucket(name)
}
// CreateBucket creates a new bucket.
// Returns an error if the bucket already exists, if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) CreateBucket(name []byte) (*Bucket, error) {
return tx.root.CreateBucket(name)
}
// CreateBucketIfNotExists creates a new bucket if it doesn't already exist.
// Returns an error if the bucket name is blank, or if the bucket name is too long.
// The bucket instance is only valid for the lifetime of the transaction.
func (tx *Tx) CreateBucketIfNotExists(name []byte) (*Bucket, error) {
return tx.root.CreateBucketIfNotExists(name)
}
@@ -168,8 +157,6 @@ func (tx *Tx) Commit() error {
// Free the old root bucket.
tx.meta.root.root = tx.root.root
opgid := tx.meta.pgid
// Free the freelist and allocate new pages for it. This will overestimate
// the size of the freelist but not underestimate the size (which would be bad).
tx.db.freelist.free(tx.meta.txid, tx.db.page(tx.meta.freelist))
@@ -184,14 +171,6 @@ func (tx *Tx) Commit() error {
}
tx.meta.freelist = p.id
// If the high water mark has moved up then attempt to grow the database.
if tx.meta.pgid > opgid {
if err := tx.db.grow(int(tx.meta.pgid+1) * tx.db.pageSize); err != nil {
tx.rollback()
return err
}
}
// Write dirty pages to disk.
startTime = time.Now()
if err := tx.write(); err != nil {
@@ -257,8 +236,7 @@ func (tx *Tx) close() {
var freelistPendingN = tx.db.freelist.pending_count()
var freelistAlloc = tx.db.freelist.size()
// Remove transaction ref & writer lock.
tx.db.rwtx = nil
// Remove writer lock.
tx.db.rwlock.Unlock()
// Merge statistics.
@@ -272,16 +250,11 @@ func (tx *Tx) close() {
} else {
tx.db.removeTx(tx)
}
// Clear all references.
tx.db = nil
tx.meta = nil
tx.root = Bucket{tx: tx}
tx.pages = nil
}
// Copy writes the entire database to a writer.
// This function exists for backwards compatibility. Use WriteTo() instead.
// This function exists for backwards compatibility. Use WriteTo() in
func (tx *Tx) Copy(w io.Writer) error {
_, err := tx.WriteTo(w)
return err
@@ -290,18 +263,21 @@ func (tx *Tx) Copy(w io.Writer) error {
// WriteTo writes the entire database to a writer.
// If err == nil then exactly tx.Size() bytes will be written into the writer.
func (tx *Tx) WriteTo(w io.Writer) (n int64, err error) {
// Attempt to open reader with WriteFlag
f, err := os.OpenFile(tx.db.path, os.O_RDONLY|tx.WriteFlag, 0)
if err != nil {
return 0, err
// Attempt to open reader directly.
var f *os.File
if f, err = os.OpenFile(tx.db.path, os.O_RDONLY|odirect, 0); err != nil {
// Fallback to a regular open if that doesn't work.
if f, err = os.OpenFile(tx.db.path, os.O_RDONLY, 0); err != nil {
return 0, err
}
}
defer func() { _ = f.Close() }()
// Copy the meta pages.
tx.db.metalock.Lock()
n, err = io.CopyN(w, f, int64(tx.db.pageSize*2))
tx.db.metalock.Unlock()
if err != nil {
_ = f.Close()
return n, fmt.Errorf("meta copy: %s", err)
}
@@ -309,6 +285,7 @@ func (tx *Tx) WriteTo(w io.Writer) (n int64, err error) {
wn, err := io.CopyN(w, f, tx.Size()-int64(tx.db.pageSize*2))
n += wn
if err != nil {
_ = f.Close()
return n, err
}
@@ -515,7 +492,7 @@ func (tx *Tx) writeMeta() error {
}
// page returns a reference to the page with a given id.
// If page has been written to then a temporary buffered page is returned.
// If page has been written to then a temporary bufferred page is returned.
func (tx *Tx) page(id pgid) *page {
// Check the dirty pages first.
if tx.pages != nil {

456
Godeps/_workspace/src/github.com/boltdb/bolt/tx_test.go generated vendored Normal file
View File

@@ -0,0 +1,456 @@
package bolt_test
import (
"errors"
"fmt"
"os"
"testing"
"github.com/coreos/etcd/Godeps/_workspace/src/github.com/boltdb/bolt"
)
// Ensure that committing a closed transaction returns an error.
func TestTx_Commit_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.CreateBucket([]byte("foo"))
ok(t, tx.Commit())
equals(t, tx.Commit(), bolt.ErrTxClosed)
}
// Ensure that rolling back a closed transaction returns an error.
func TestTx_Rollback_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
ok(t, tx.Rollback())
equals(t, tx.Rollback(), bolt.ErrTxClosed)
}
// Ensure that committing a read-only transaction returns an error.
func TestTx_Commit_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(false)
equals(t, tx.Commit(), bolt.ErrTxNotWritable)
}
// Ensure that a transaction can retrieve a cursor on the root bucket.
func TestTx_Cursor(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.CreateBucket([]byte("woojits"))
c := tx.Cursor()
k, v := c.First()
equals(t, "widgets", string(k))
assert(t, v == nil, "")
k, v = c.Next()
equals(t, "woojits", string(k))
assert(t, v == nil, "")
k, v = c.Next()
assert(t, k == nil, "")
assert(t, v == nil, "")
return nil
})
}
// Ensure that creating a bucket with a read-only transaction returns an error.
func TestTx_CreateBucket_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.View(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("foo"))
assert(t, b == nil, "")
equals(t, bolt.ErrTxNotWritable, err)
return nil
})
}
// Ensure that creating a bucket on a closed transaction returns an error.
func TestTx_CreateBucket_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.Commit()
b, err := tx.CreateBucket([]byte("foo"))
assert(t, b == nil, "")
equals(t, bolt.ErrTxClosed, err)
}
// Ensure that a Tx can retrieve a bucket.
func TestTx_Bucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a Tx retrieving a non-existent key returns nil.
func TestTx_Get_Missing(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
value := tx.Bucket([]byte("widgets")).Get([]byte("no_such_key"))
assert(t, value == nil, "")
return nil
})
}
// Ensure that a bucket can be created and retrieved.
func TestTx_CreateBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
return nil
})
// Read the bucket through a separate transaction.
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a bucket can be created if it doesn't already exist.
func TestTx_CreateBucketIfNotExists(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucketIfNotExists([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
b, err = tx.CreateBucketIfNotExists([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
b, err = tx.CreateBucketIfNotExists([]byte{})
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
b, err = tx.CreateBucketIfNotExists(nil)
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
return nil
})
// Read the bucket through a separate transaction.
db.View(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("widgets"))
assert(t, b != nil, "")
return nil
})
}
// Ensure that a bucket cannot be created twice.
func TestTx_CreateBucket_Exists(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
return nil
})
// Create the same bucket again.
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b == nil, "")
equals(t, bolt.ErrBucketExists, err)
return nil
})
}
// Ensure that a bucket is created with a non-blank name.
func TestTx_CreateBucket_NameRequired(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
b, err := tx.CreateBucket(nil)
assert(t, b == nil, "")
equals(t, bolt.ErrBucketNameRequired, err)
return nil
})
}
// Ensure that a bucket can be deleted.
func TestTx_DeleteBucket(t *testing.T) {
db := NewTestDB()
defer db.Close()
// Create a bucket and add a value.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
return nil
})
// Delete the bucket and make sure we can't get the value.
db.Update(func(tx *bolt.Tx) error {
ok(t, tx.DeleteBucket([]byte("widgets")))
assert(t, tx.Bucket([]byte("widgets")) == nil, "")
return nil
})
db.Update(func(tx *bolt.Tx) error {
// Create the bucket again and make sure there's not a phantom value.
b, err := tx.CreateBucket([]byte("widgets"))
assert(t, b != nil, "")
ok(t, err)
assert(t, tx.Bucket([]byte("widgets")).Get([]byte("foo")) == nil, "")
return nil
})
}
// Ensure that deleting a bucket on a closed transaction returns an error.
func TestTx_DeleteBucket_Closed(t *testing.T) {
db := NewTestDB()
defer db.Close()
tx, _ := db.Begin(true)
tx.Commit()
equals(t, tx.DeleteBucket([]byte("foo")), bolt.ErrTxClosed)
}
// Ensure that deleting a bucket with a read-only transaction returns an error.
func TestTx_DeleteBucket_ReadOnly(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.View(func(tx *bolt.Tx) error {
equals(t, tx.DeleteBucket([]byte("foo")), bolt.ErrTxNotWritable)
return nil
})
}
// Ensure that nothing happens when deleting a bucket that doesn't exist.
func TestTx_DeleteBucket_NotFound(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
equals(t, bolt.ErrBucketNotFound, tx.DeleteBucket([]byte("widgets")))
return nil
})
}
// Ensure that no error is returned when a tx.ForEach function does not return
// an error.
func TestTx_ForEach_NoError(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
equals(t, nil, tx.ForEach(func(name []byte, b *bolt.Bucket) error {
return nil
}))
return nil
})
}
// Ensure that an error is returned when a tx.ForEach function returns an error.
func TestTx_ForEach_WithError(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
err := errors.New("foo")
equals(t, err, tx.ForEach(func(name []byte, b *bolt.Bucket) error {
return err
}))
return nil
})
}
// Ensure that Tx commit handlers are called after a transaction successfully commits.
func TestTx_OnCommit(t *testing.T) {
var x int
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.OnCommit(func() { x += 1 })
tx.OnCommit(func() { x += 2 })
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
equals(t, 3, x)
}
// Ensure that Tx commit handlers are NOT called after a transaction rolls back.
func TestTx_OnCommit_Rollback(t *testing.T) {
var x int
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.OnCommit(func() { x += 1 })
tx.OnCommit(func() { x += 2 })
tx.CreateBucket([]byte("widgets"))
return errors.New("rollback this commit")
})
equals(t, 0, x)
}
// Ensure that the database can be copied to a file path.
func TestTx_CopyFile(t *testing.T) {
db := NewTestDB()
defer db.Close()
var dest = tempfile()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
ok(t, db.View(func(tx *bolt.Tx) error { return tx.CopyFile(dest, 0600) }))
db2, err := bolt.Open(dest, 0600, nil)
ok(t, err)
defer db2.Close()
db2.View(func(tx *bolt.Tx) error {
equals(t, []byte("bar"), tx.Bucket([]byte("widgets")).Get([]byte("foo")))
equals(t, []byte("bat"), tx.Bucket([]byte("widgets")).Get([]byte("baz")))
return nil
})
}
type failWriterError struct{}
func (failWriterError) Error() string {
return "error injected for tests"
}
type failWriter struct {
// fail after this many bytes
After int
}
func (f *failWriter) Write(p []byte) (n int, err error) {
n = len(p)
if n > f.After {
n = f.After
err = failWriterError{}
}
f.After -= n
return n, err
}
// Ensure that Copy handles write errors right.
func TestTx_CopyFile_Error_Meta(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
err := db.View(func(tx *bolt.Tx) error { return tx.Copy(&failWriter{}) })
equals(t, err.Error(), "meta copy: error injected for tests")
}
// Ensure that Copy handles write errors right.
func TestTx_CopyFile_Error_Normal(t *testing.T) {
db := NewTestDB()
defer db.Close()
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
tx.Bucket([]byte("widgets")).Put([]byte("baz"), []byte("bat"))
return nil
})
err := db.View(func(tx *bolt.Tx) error { return tx.Copy(&failWriter{3 * db.Info().PageSize}) })
equals(t, err.Error(), "error injected for tests")
}
func ExampleTx_Rollback() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket.
db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucket([]byte("widgets"))
return err
})
// Set a value for a key.
db.Update(func(tx *bolt.Tx) error {
return tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
})
// Update the key but rollback the transaction so it never saves.
tx, _ := db.Begin(true)
b := tx.Bucket([]byte("widgets"))
b.Put([]byte("foo"), []byte("baz"))
tx.Rollback()
// Ensure that our original value is still set.
db.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value for 'foo' is still: %s\n", value)
return nil
})
// Output:
// The value for 'foo' is still: bar
}
func ExampleTx_CopyFile() {
// Open the database.
db, _ := bolt.Open(tempfile(), 0666, nil)
defer os.Remove(db.Path())
defer db.Close()
// Create a bucket and a key.
db.Update(func(tx *bolt.Tx) error {
tx.CreateBucket([]byte("widgets"))
tx.Bucket([]byte("widgets")).Put([]byte("foo"), []byte("bar"))
return nil
})
// Copy the database to another file.
toFile := tempfile()
db.View(func(tx *bolt.Tx) error { return tx.CopyFile(toFile, 0666) })
defer os.Remove(toFile)
// Open the cloned database.
db2, _ := bolt.Open(toFile, 0666, nil)
defer db2.Close()
// Ensure that the key exists in the copy.
db2.View(func(tx *bolt.Tx) error {
value := tx.Bucket([]byte("widgets")).Get([]byte("foo"))
fmt.Printf("The value for 'foo' in the clone is: %s\n", value)
return nil
})
// Output:
// The value for 'foo' in the clone is: bar
}

View File

@@ -0,0 +1 @@
*~

View File

@@ -0,0 +1,19 @@
# This file is like Go's AUTHORS file: it lists Copyright holders.
# The list of humans who have contributd is in the CONTRIBUTORS file.
#
# To contribute to this project, because it will eventually be folded
# back in to Go itself, you need to submit a CLA:
#
# http://golang.org/doc/contribute.html#copyright
#
# Then you get added to CONTRIBUTORS and you or your company get added
# to the AUTHORS file.
Blake Mizerany <blake.mizerany@gmail.com> github=bmizerany
Daniel Morsing <daniel.morsing@gmail.com> github=DanielMorsing
Gabriel Aszalos <gabriel.aszalos@gmail.com> github=gbbr
Google, Inc.
Keith Rarick <kr@xph.us> github=kr
Matthew Keenan <tank.en.mate@gmail.com> <github@mattkeenan.net> github=mattkeenan
Matt Layher <mdlayher@gmail.com> github=mdlayher
Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com> github=tatsuhiro-t

View File

@@ -0,0 +1,19 @@
# This file is like Go's CONTRIBUTORS file: it lists humans.
# The list of copyright holders (which may be companies) are in the AUTHORS file.
#
# To contribute to this project, because it will eventually be folded
# back in to Go itself, you need to submit a CLA:
#
# http://golang.org/doc/contribute.html#copyright
#
# Then you get added to CONTRIBUTORS and you or your company get added
# to the AUTHORS file.
Blake Mizerany <blake.mizerany@gmail.com> github=bmizerany
Brad Fitzpatrick <bradfitz@golang.org> github=bradfitz
Daniel Morsing <daniel.morsing@gmail.com> github=DanielMorsing
Gabriel Aszalos <gabriel.aszalos@gmail.com> github=gbbr
Keith Rarick <kr@xph.us> github=kr
Matthew Keenan <tank.en.mate@gmail.com> <github@mattkeenan.net> github=mattkeenan
Matt Layher <mdlayher@gmail.com> github=mdlayher
Tatsuhiro Tsujikawa <tatsuhiro.t@gmail.com> github=tatsuhiro-t

View File

@@ -17,15 +17,8 @@ RUN apt-get install -y --no-install-recommends \
libcunit1-dev libssl-dev libxml2-dev libevent-dev \
automake autoconf
# The list of packages nghttp2 recommends for h2load:
RUN apt-get install -y --no-install-recommends make binutils \
autoconf automake autotools-dev \
libtool pkg-config zlib1g-dev libcunit1-dev libssl-dev libxml2-dev \
libev-dev libevent-dev libjansson-dev libjemalloc-dev \
cython python3.4-dev python-setuptools
# Note: setting NGHTTP2_VER before the git clone, so an old git clone isn't cached:
ENV NGHTTP2_VER 895da9a
ENV NGHTTP2_VER af24f8394e43f4
RUN cd /root && git clone https://github.com/tatsuhiro-t/nghttp2.git
WORKDIR /root/nghttp2
@@ -38,9 +31,9 @@ RUN make
RUN make install
WORKDIR /root
RUN wget http://curl.haxx.se/download/curl-7.45.0.tar.gz
RUN tar -zxvf curl-7.45.0.tar.gz
WORKDIR /root/curl-7.45.0
RUN wget http://curl.haxx.se/download/curl-7.40.0.tar.gz
RUN tar -zxvf curl-7.40.0.tar.gz
WORKDIR /root/curl-7.40.0
RUN ./configure --with-ssl --with-nghttp2=/usr/local
RUN make
RUN make install

View File

@@ -0,0 +1,5 @@
We only accept contributions from users who have gone through Go's
contribution process (signed a CLA).
Please acknowledge whether you have (and use the same email) if
sending a pull request.

View File

@@ -0,0 +1,7 @@
Copyright 2014 Google & the Go AUTHORS
Go AUTHORS are:
See https://code.google.com/p/go/source/browse/AUTHORS
Licensed under the terms of Go itself:
https://code.google.com/p/go/source/browse/LICENSE

View File

@@ -10,11 +10,8 @@ Status:
* The client work has just started but shares a lot of code
is coming along much quicker.
Docs are at https://godoc.org/golang.org/x/net/http2
Docs are at https://godoc.org/github.com/bradfitz/http2
Demo test server at https://http2.golang.org/
Help & bug reports welcome!
Contributing: https://golang.org/doc/contribute.html
Bugs: https://golang.org/issue/new?title=x/net/http2:+
Help & bug reports welcome.

View File

@@ -0,0 +1,75 @@
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
import (
"errors"
)
// buffer is an io.ReadWriteCloser backed by a fixed size buffer.
// It never allocates, but moves old data as new data is written.
type buffer struct {
buf []byte
r, w int
closed bool
err error // err to return to reader
}
var (
errReadEmpty = errors.New("read from empty buffer")
errWriteFull = errors.New("write on full buffer")
)
// Read copies bytes from the buffer into p.
// It is an error to read when no data is available.
func (b *buffer) Read(p []byte) (n int, err error) {
n = copy(p, b.buf[b.r:b.w])
b.r += n
if b.closed && b.r == b.w {
err = b.err
} else if b.r == b.w && n == 0 {
err = errReadEmpty
}
return n, err
}
// Len returns the number of bytes of the unread portion of the buffer.
func (b *buffer) Len() int {
return b.w - b.r
}
// Write copies bytes from p into the buffer.
// It is an error to write more data than the buffer can hold.
func (b *buffer) Write(p []byte) (n int, err error) {
if b.closed {
return 0, errors.New("closed")
}
// Slide existing data to beginning.
if b.r > 0 && len(p) > len(b.buf)-b.w {
copy(b.buf, b.buf[b.r:b.w])
b.w -= b.r
b.r = 0
}
// Write new data.
n = copy(b.buf[b.w:], p)
b.w += n
if n < len(p) {
err = errWriteFull
}
return n, err
}
// Close marks the buffer as closed. Future calls to Write will
// return an error. Future calls to Read, once the buffer is
// empty, will return err.
func (b *buffer) Close(err error) {
if !b.closed {
b.closed = true
b.err = err
}
}

View File

@@ -0,0 +1,73 @@
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
import (
"io"
"reflect"
"testing"
)
var bufferReadTests = []struct {
buf buffer
read, wn int
werr error
wp []byte
wbuf buffer
}{
{
buffer{[]byte{'a', 0}, 0, 1, false, nil},
5, 1, nil, []byte{'a'},
buffer{[]byte{'a', 0}, 1, 1, false, nil},
},
{
buffer{[]byte{'a', 0}, 0, 1, true, io.EOF},
5, 1, io.EOF, []byte{'a'},
buffer{[]byte{'a', 0}, 1, 1, true, io.EOF},
},
{
buffer{[]byte{0, 'a'}, 1, 2, false, nil},
5, 1, nil, []byte{'a'},
buffer{[]byte{0, 'a'}, 2, 2, false, nil},
},
{
buffer{[]byte{0, 'a'}, 1, 2, true, io.EOF},
5, 1, io.EOF, []byte{'a'},
buffer{[]byte{0, 'a'}, 2, 2, true, io.EOF},
},
{
buffer{[]byte{}, 0, 0, false, nil},
5, 0, errReadEmpty, []byte{},
buffer{[]byte{}, 0, 0, false, nil},
},
{
buffer{[]byte{}, 0, 0, true, io.EOF},
5, 0, io.EOF, []byte{},
buffer{[]byte{}, 0, 0, true, io.EOF},
},
}
func TestBufferRead(t *testing.T) {
for i, tt := range bufferReadTests {
read := make([]byte, tt.read)
n, err := tt.buf.Read(read)
if n != tt.wn {
t.Errorf("#%d: wn = %d want %d", i, n, tt.wn)
continue
}
if err != tt.werr {
t.Errorf("#%d: werr = %v want %v", i, err, tt.werr)
continue
}
read = read[:n]
if !reflect.DeepEqual(read, tt.wp) {
t.Errorf("#%d: read = %+v want %+v", i, read, tt.wp)
}
if !reflect.DeepEqual(tt.buf, tt.wbuf) {
t.Errorf("#%d: buf = %+v want %+v", i, tt.buf, tt.wbuf)
}
}
}

View File

@@ -1,6 +1,7 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
@@ -75,16 +76,3 @@ func (e StreamError) Error() string {
type goAwayFlowError struct{}
func (goAwayFlowError) Error() string { return "connection exceeded flow control window size" }
// connErrorReason wraps a ConnectionError with an informative error about why it occurs.
// Errors of this type are only returned by the frame parser functions
// and converted into ConnectionError(ErrCodeProtocol).
type connError struct {
Code ErrCode
Reason string
}
func (e connError) Error() string {
return fmt.Sprintf("http2: connection error: %v: %v", e.Code, e.Reason)
}

View File

@@ -1,6 +1,9 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2

View File

@@ -1,6 +1,7 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
// Flow control

View File

@@ -1,6 +1,7 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2

View File

@@ -1,6 +1,7 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// Copyright 2014 The Go Authors.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
@@ -10,7 +11,6 @@ import (
"errors"
"fmt"
"io"
"log"
"sync"
)
@@ -85,8 +85,8 @@ const (
// Continuation Frame
FlagContinuationEndHeaders Flags = 0x4
FlagPushPromiseEndHeaders Flags = 0x4
FlagPushPromisePadded Flags = 0x8
FlagPushPromiseEndHeaders = 0x4
FlagPushPromisePadded = 0x8
)
var flagName = map[FrameType]map[Flags]string{
@@ -172,12 +172,6 @@ func (h FrameHeader) Header() FrameHeader { return h }
func (h FrameHeader) String() string {
var buf bytes.Buffer
buf.WriteString("[FrameHeader ")
h.writeDebug(&buf)
buf.WriteByte(']')
return buf.String()
}
func (h FrameHeader) writeDebug(buf *bytes.Buffer) {
buf.WriteString(h.Type.String())
if h.Flags != 0 {
buf.WriteString(" flags=")
@@ -194,14 +188,15 @@ func (h FrameHeader) writeDebug(buf *bytes.Buffer) {
if name != "" {
buf.WriteString(name)
} else {
fmt.Fprintf(buf, "0x%x", 1<<i)
fmt.Fprintf(&buf, "0x%x", 1<<i)
}
}
}
if h.StreamID != 0 {
fmt.Fprintf(buf, " stream=%d", h.StreamID)
fmt.Fprintf(&buf, " stream=%d", h.StreamID)
}
fmt.Fprintf(buf, " len=%d", h.Length)
fmt.Fprintf(&buf, " len=%d]", h.Length)
return buf.String()
}
func (h *FrameHeader) checkValid() {
@@ -261,11 +256,6 @@ type Frame interface {
type Framer struct {
r io.Reader
lastFrame Frame
errReason string
// lastHeaderStream is non-zero if the last frame was an
// unfinished HEADERS/CONTINUATION.
lastHeaderStream uint32
maxReadSize uint32
headerBuf [frameHeaderLen]byte
@@ -282,29 +272,18 @@ type Framer struct {
wbuf []byte
// AllowIllegalWrites permits the Framer's Write methods to
// write frames that do not conform to the HTTP/2 spec. This
// write frames that do not conform to the HTTP/2 spec. This
// permits using the Framer to test other HTTP/2
// implementations' conformance to the spec.
// If false, the Write methods will prefer to return an error
// rather than comply.
AllowIllegalWrites bool
// AllowIllegalReads permits the Framer's ReadFrame method
// to return non-compliant frames or frame orders.
// This is for testing and permits using the Framer to test
// other HTTP/2 implementations' conformance to the spec.
AllowIllegalReads bool
// TODO: track which type of frame & with which flags was sent
// last. Then return an error (unless AllowIllegalWrites) if
// we're in the middle of a header block and a
// non-Continuation or Continuation on a different stream is
// attempted to be written.
logReads bool
debugFramer *Framer // only use for logging written writes
debugFramerBuf *bytes.Buffer
}
func (f *Framer) startWrite(ftype FrameType, flags Flags, streamID uint32) {
@@ -332,10 +311,6 @@ func (f *Framer) endWrite() error {
byte(length>>16),
byte(length>>8),
byte(length))
if logFrameWrites {
f.logWrite()
}
n, err := f.w.Write(f.wbuf)
if err == nil && n != len(f.wbuf) {
err = io.ErrShortWrite
@@ -343,24 +318,6 @@ func (f *Framer) endWrite() error {
return err
}
func (f *Framer) logWrite() {
if f.debugFramer == nil {
f.debugFramerBuf = new(bytes.Buffer)
f.debugFramer = NewFramer(nil, f.debugFramerBuf)
f.debugFramer.logReads = false // we log it ourselves, saying "wrote" below
// Let us read anything, even if we accidentally wrote it
// in the wrong order:
f.debugFramer.AllowIllegalReads = true
}
f.debugFramerBuf.Write(f.wbuf)
fr, err := f.debugFramer.ReadFrame()
if err != nil {
log.Printf("http2: Framer %p: failed to decode just-written frame", f)
return
}
log.Printf("http2: Framer %p: wrote %v", f, summarizeFrame(fr))
}
func (f *Framer) writeByte(v byte) { f.wbuf = append(f.wbuf, v) }
func (f *Framer) writeBytes(v []byte) { f.wbuf = append(f.wbuf, v...) }
func (f *Framer) writeUint16(v uint16) { f.wbuf = append(f.wbuf, byte(v>>8), byte(v)) }
@@ -376,9 +333,8 @@ const (
// NewFramer returns a Framer that writes frames to w and reads them from r.
func NewFramer(w io.Writer, r io.Reader) *Framer {
fr := &Framer{
w: w,
r: r,
logReads: logFrameReads,
w: w,
r: r,
}
fr.getReadBuf = func(size uint32) []byte {
if cap(fr.readBuf) >= int(size) {
@@ -406,22 +362,10 @@ func (fr *Framer) SetMaxReadFrameSize(v uint32) {
// sends a frame that is larger than declared with SetMaxReadFrameSize.
var ErrFrameTooLarge = errors.New("http2: frame too large")
// terminalReadFrameError reports whether err is an unrecoverable
// error from ReadFrame and no other frames should be read.
func terminalReadFrameError(err error) bool {
if _, ok := err.(StreamError); ok {
return false
}
return err != nil
}
// ReadFrame reads a single frame. The returned Frame is only valid
// until the next call to ReadFrame.
//
// If the frame is larger than previously set with SetMaxReadFrameSize, the
// returned error is ErrFrameTooLarge. Other errors may be of type
// ConnectionError, StreamError, or anything else from from the underlying
// reader.
// If the frame is larger than previously set with SetMaxReadFrameSize,
// the returned error is ErrFrameTooLarge.
func (fr *Framer) ReadFrame() (Frame, error) {
if fr.lastFrame != nil {
fr.lastFrame.invalidate()
@@ -439,66 +383,10 @@ func (fr *Framer) ReadFrame() (Frame, error) {
}
f, err := typeFrameParser(fh.Type)(fh, payload)
if err != nil {
if ce, ok := err.(connError); ok {
return nil, fr.connError(ce.Code, ce.Reason)
}
return nil, err
}
if err := fr.checkFrameOrder(f); err != nil {
return nil, err
}
if fr.logReads {
log.Printf("http2: Framer %p: read %v", fr, summarizeFrame(f))
}
return f, nil
}
// connError returns ConnectionError(code) but first
// stashes away a public reason to the caller can optionally relay it
// to the peer before hanging up on them. This might help others debug
// their implementations.
func (fr *Framer) connError(code ErrCode, reason string) error {
fr.errReason = reason
return ConnectionError(code)
}
// checkFrameOrder reports an error if f is an invalid frame to return
// next from ReadFrame. Mostly it checks whether HEADERS and
// CONTINUATION frames are contiguous.
func (fr *Framer) checkFrameOrder(f Frame) error {
last := fr.lastFrame
fr.lastFrame = f
if fr.AllowIllegalReads {
return nil
}
fh := f.Header()
if fr.lastHeaderStream != 0 {
if fh.Type != FrameContinuation {
return fr.connError(ErrCodeProtocol,
fmt.Sprintf("got %s for stream %d; expected CONTINUATION following %s for stream %d",
fh.Type, fh.StreamID,
last.Header().Type, fr.lastHeaderStream))
}
if fh.StreamID != fr.lastHeaderStream {
return fr.connError(ErrCodeProtocol,
fmt.Sprintf("got CONTINUATION for stream %d; expected stream %d",
fh.StreamID, fr.lastHeaderStream))
}
} else if fh.Type == FrameContinuation {
return fr.connError(ErrCodeProtocol, fmt.Sprintf("unexpected CONTINUATION for stream %d", fh.StreamID))
}
switch fh.Type {
case FrameHeaders, FrameContinuation:
if fh.Flags.Has(FlagHeadersEndHeaders) {
fr.lastHeaderStream = 0
} else {
fr.lastHeaderStream = fh.StreamID
}
}
return nil
return f, nil
}
// A DataFrame conveys arbitrary, variable-length sequences of octets
@@ -529,7 +417,7 @@ func parseDataFrame(fh FrameHeader, payload []byte) (Frame, error) {
// field is 0x0, the recipient MUST respond with a
// connection error (Section 5.4.1) of type
// PROTOCOL_ERROR.
return nil, connError{ErrCodeProtocol, "DATA frame with stream ID 0"}
return nil, ConnectionError(ErrCodeProtocol)
}
f := &DataFrame{
FrameHeader: fh,
@@ -547,7 +435,7 @@ func parseDataFrame(fh FrameHeader, payload []byte) (Frame, error) {
// length of the frame payload, the recipient MUST
// treat this as a connection error.
// Filed: https://github.com/http2/http2-spec/issues/610
return nil, connError{ErrCodeProtocol, "pad size larger than data payload"}
return nil, ConnectionError(ErrCodeProtocol)
}
f.data = payload[:len(payload)-int(padSize)]
return f, nil
@@ -687,8 +575,6 @@ type PingFrame struct {
Data [8]byte
}
func (f *PingFrame) IsAck() bool { return f.Flags.Has(FlagPingAck) }
func parsePingFrame(fh FrameHeader, payload []byte) (Frame, error) {
if len(payload) != 8 {
return nil, ConnectionError(ErrCodeFrameSize)
@@ -777,7 +663,7 @@ func parseUnknownFrame(fh FrameHeader, p []byte) (Frame, error) {
// See http://http2.github.io/http2-spec/#rfc.section.6.9
type WindowUpdateFrame struct {
FrameHeader
Increment uint32 // never read with high bit set
Increment uint32
}
func parseWindowUpdateFrame(fh FrameHeader, p []byte) (Frame, error) {
@@ -854,7 +740,7 @@ func parseHeadersFrame(fh FrameHeader, p []byte) (_ Frame, err error) {
// is received whose stream identifier field is 0x0, the recipient MUST
// respond with a connection error (Section 5.4.1) of type
// PROTOCOL_ERROR.
return nil, connError{ErrCodeProtocol, "HEADERS frame with stream ID 0"}
return nil, ConnectionError(ErrCodeProtocol)
}
var padLength uint8
if fh.Flags.Has(FlagHeadersPadded) {
@@ -984,10 +870,10 @@ func (p PriorityParam) IsZero() bool {
func parsePriorityFrame(fh FrameHeader, payload []byte) (Frame, error) {
if fh.StreamID == 0 {
return nil, connError{ErrCodeProtocol, "PRIORITY frame with stream ID 0"}
return nil, ConnectionError(ErrCodeProtocol)
}
if len(payload) != 5 {
return nil, connError{ErrCodeFrameSize, fmt.Sprintf("PRIORITY frame payload size was %d; want 5", len(payload))}
return nil, ConnectionError(ErrCodeFrameSize)
}
v := binary.BigEndian.Uint32(payload[:4])
streamID := v & 0x7fffffff // mask off high bit
@@ -1057,12 +943,13 @@ type ContinuationFrame struct {
}
func parseContinuationFrame(fh FrameHeader, p []byte) (Frame, error) {
if fh.StreamID == 0 {
return nil, connError{ErrCodeProtocol, "CONTINUATION frame with stream ID 0"}
}
return &ContinuationFrame{fh, p}, nil
}
func (f *ContinuationFrame) StreamEnded() bool {
return f.FrameHeader.Flags.Has(FlagDataEndStream)
}
func (f *ContinuationFrame) HeaderBlockFragment() []byte {
f.checkValid()
return f.headerFragBuf
@@ -1224,46 +1111,3 @@ type streamEnder interface {
type headersEnder interface {
HeadersEnded() bool
}
func summarizeFrame(f Frame) string {
var buf bytes.Buffer
f.Header().writeDebug(&buf)
switch f := f.(type) {
case *SettingsFrame:
n := 0
f.ForeachSetting(func(s Setting) error {
n++
if n == 1 {
buf.WriteString(", settings:")
}
fmt.Fprintf(&buf, " %v=%v,", s.ID, s.Val)
return nil
})
if n > 0 {
buf.Truncate(buf.Len() - 1) // remove trailing comma
}
case *DataFrame:
data := f.Data()
const max = 256
if len(data) > max {
data = data[:max]
}
fmt.Fprintf(&buf, " data=%q", data)
if len(f.Data()) > max {
fmt.Fprintf(&buf, " (%d bytes omitted)", len(f.Data())-max)
}
case *WindowUpdateFrame:
if f.StreamID == 0 {
buf.WriteString(" (conn)")
}
fmt.Fprintf(&buf, " incr=%v", f.Increment)
case *PingFrame:
fmt.Fprintf(&buf, " ping=%q", f.Data[:])
case *GoAwayFrame:
fmt.Fprintf(&buf, " LastStreamID=%v ErrCode=%v Debug=%q",
f.LastStreamID, f.ErrCode, f.debugData)
case *RSTStreamFrame:
fmt.Fprintf(&buf, " ErrCode=%v", f.ErrCode)
}
return buf.String()
}

View File

@@ -6,8 +6,6 @@ package http2
import (
"bytes"
"fmt"
"io"
"reflect"
"strings"
"testing"
@@ -26,25 +24,6 @@ func TestFrameSizes(t *testing.T) {
}
}
func TestFrameTypeString(t *testing.T) {
tests := []struct {
ft FrameType
want string
}{
{FrameData, "DATA"},
{FramePing, "PING"},
{FrameGoAway, "GOAWAY"},
{0xf, "UNKNOWN_FRAME_TYPE_15"},
}
for i, tt := range tests {
got := tt.ft.String()
if got != tt.want {
t.Errorf("%d. String(FrameType %d) = %q; want %q", i, int(tt.ft), got, tt.want)
}
}
}
func TestWriteRST(t *testing.T) {
fr, buf := testFramer()
var streamID uint32 = 1<<24 + 2<<16 + 3<<8 + 4
@@ -266,7 +245,6 @@ func TestWriteContinuation(t *testing.T) {
t.Errorf("test %q: %v", tt.name, err)
continue
}
fr.AllowIllegalReads = true
f, err := fr.ReadFrame()
if err != nil {
t.Errorf("test %q: failed to read the frame back: %v", tt.name, err)
@@ -598,138 +576,3 @@ func TestWritePushPromise(t *testing.T) {
t.Fatalf("parsed back:\n%#v\nwant:\n%#v", f, want)
}
}
// test checkFrameOrder and that HEADERS and CONTINUATION frames can't be intermingled.
func TestReadFrameOrder(t *testing.T) {
head := func(f *Framer, id uint32, end bool) {
f.WriteHeaders(HeadersFrameParam{
StreamID: id,
BlockFragment: []byte("foo"), // unused, but non-empty
EndHeaders: end,
})
}
cont := func(f *Framer, id uint32, end bool) {
f.WriteContinuation(id, end, []byte("foo"))
}
tests := [...]struct {
name string
w func(*Framer)
atLeast int
wantErr string
}{
0: {
w: func(f *Framer) {
head(f, 1, true)
},
},
1: {
w: func(f *Framer) {
head(f, 1, true)
head(f, 2, true)
},
},
2: {
wantErr: "got HEADERS for stream 2; expected CONTINUATION following HEADERS for stream 1",
w: func(f *Framer) {
head(f, 1, false)
head(f, 2, true)
},
},
3: {
wantErr: "got DATA for stream 1; expected CONTINUATION following HEADERS for stream 1",
w: func(f *Framer) {
head(f, 1, false)
},
},
4: {
w: func(f *Framer) {
head(f, 1, false)
cont(f, 1, true)
head(f, 2, true)
},
},
5: {
wantErr: "got CONTINUATION for stream 2; expected stream 1",
w: func(f *Framer) {
head(f, 1, false)
cont(f, 2, true)
head(f, 2, true)
},
},
6: {
wantErr: "unexpected CONTINUATION for stream 1",
w: func(f *Framer) {
cont(f, 1, true)
},
},
7: {
wantErr: "unexpected CONTINUATION for stream 1",
w: func(f *Framer) {
cont(f, 1, false)
},
},
8: {
wantErr: "HEADERS frame with stream ID 0",
w: func(f *Framer) {
head(f, 0, true)
},
},
9: {
wantErr: "CONTINUATION frame with stream ID 0",
w: func(f *Framer) {
cont(f, 0, true)
},
},
10: {
wantErr: "unexpected CONTINUATION for stream 1",
atLeast: 5,
w: func(f *Framer) {
head(f, 1, false)
cont(f, 1, false)
cont(f, 1, false)
cont(f, 1, false)
cont(f, 1, true)
cont(f, 1, false)
},
},
}
for i, tt := range tests {
buf := new(bytes.Buffer)
f := NewFramer(buf, buf)
f.AllowIllegalWrites = true
tt.w(f)
f.WriteData(1, true, nil) // to test transition away from last step
var err error
n := 0
var log bytes.Buffer
for {
var got Frame
got, err = f.ReadFrame()
fmt.Fprintf(&log, " read %v, %v\n", got, err)
if err != nil {
break
}
n++
}
if err == io.EOF {
err = nil
}
ok := tt.wantErr == ""
if ok && err != nil {
t.Errorf("%d. after %d good frames, ReadFrame = %v; want success\n%s", i, n, err, log.Bytes())
continue
}
if !ok && err != ConnectionError(ErrCodeProtocol) {
t.Errorf("%d. after %d good frames, ReadFrame = %v; want ConnectionError(ErrCodeProtocol)\n%s", i, n, err, log.Bytes())
continue
}
if f.errReason != tt.wantErr {
t.Errorf("%d. framer eror = %q; want %q\n%s", i, f.errReason, tt.wantErr, log.Bytes())
}
if n < tt.atLeast {
t.Errorf("%d. framer only read %d frames; want at least %d\n%s", i, n, tt.atLeast, log.Bytes())
}
}
}

View File

@@ -1,6 +1,9 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
// Defensive debug-only utility to track that functions run on the
// goroutine that they're supposed to.
@@ -11,20 +14,16 @@ import (
"bytes"
"errors"
"fmt"
"os"
"runtime"
"strconv"
"sync"
)
var DebugGoroutines = os.Getenv("DEBUG_HTTP2_GOROUTINES") == "1"
var DebugGoroutines = false
type goroutineLock uint64
func newGoroutineLock() goroutineLock {
if !DebugGoroutines {
return 0
}
return goroutineLock(curGoroutineID())
}

View File

@@ -1,6 +1,9 @@
// Copyright 2014 The Go Authors. All rights reserved.
// Use of this source code is governed by a BSD-style
// license that can be found in the LICENSE file.
// See https://code.google.com/p/go/source/browse/CONTRIBUTORS
// Licensed under the same terms as Go itself:
// https://code.google.com/p/go/source/browse/LICENSE
package http2
@@ -11,10 +14,7 @@ import (
)
func TestGoroutineLock(t *testing.T) {
oldDebug := DebugGoroutines
DebugGoroutines = true
defer func() { DebugGoroutines = oldDebug }()
g := newGoroutineLock()
g.check()

Some files were not shown because too many files have changed in this diff Show More