From 13b3c64c10df0ba6b54a99406eee090eedd4e0ed Mon Sep 17 00:00:00 2001 From: Xiang Li Date: Fri, 4 Sep 2015 13:48:00 -0700 Subject: [PATCH] doc: add monitoring section to admin doc --- Documentation/admin_guide.md | 68 ++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) diff --git a/Documentation/admin_guide.md b/Documentation/admin_guide.md index cea3d3c54..26f9eadf0 100644 --- a/Documentation/admin_guide.md +++ b/Documentation/admin_guide.md @@ -39,6 +39,74 @@ If `--wal-dir` flag is set, etcd will write the write ahead log files to the spe If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters. This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other. +#### Monitoring + +It is important to monitor your production etcd cluster for healthy information and runtime metrics. + +##### Health Monitoring + +At lowest level, etcd exposes health information via HTTP at `/health` in JSON format. If it returns `{"health": "true"}`, then the cluster is healthy. Please note the `/health` endpoint is still an experimental one as in etcd 2.2. + +``` +$ curl -L http://127.0.0.1:2379/health + +{"health": "true"} +``` + +You can also use etcdctl to check the cluster-wide health information. It will contact all the members of the cluster and collect the health information for you. + +``` +$./etcdctl cluster-health +member 8211f1d0f64f3269 is healthy: got healthy result from http://127.0.0.1:12379 +member 91bc3c398fb3c146 is healthy: got healthy result from http://127.0.0.1:22379 +member fd422379fda50e48 is healthy: got healthy result from http://127.0.0.1:32379 +cluster is healthy +``` + +##### Runtime Metrics + +etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. You can read more through the runtime metrics [doc](metrics.md). + +#### Debugging + +Debugging a distributed system can be difficult. etcd provides several ways to make debug +easier. + +##### Enabling Debug Logging + +When you want to debug etcd without stopping it, you can enable debug logging at runtime. +etcd exposes logging configuration at `/config/local/log`. + +``` +$ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"DEBUG"}' +$ # debug logging enabled +$ +$ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"INFO"}' +$ # debug logging disabled +``` + +##### Debugging Variables + +Debug variables are exposed for real-time debugging purposes. Developers who are familiar with etcd can utilize these variables to debug unexpected behavior. etcd exposes debug variables via HTTP at `/debug/vars` in JSON format. The debug variables contains +`cmdline`, `file_descriptor_limit`, `memstats` and `raft.status`. + +`cmdline` is the command line arguments passed into etcd. + +`file_descriptor_limit` is the max number of file descriptors etcd can utilize. + +`memstats` is well explained [here](http://golang.org/pkg/runtime/#MemStats). + +`raft.status` is useful when you want to debug low level raft issues if you are familiar with raft internals. In most cases, you do not need to check `raft.status`. + +```json +{ +"cmdline": ["./etcd"], +"file_descriptor_limit": 0, +"memstats": {"Alloc":4105744,"TotalAlloc":42337320,"Sys":12560632,"...":"..."}, +"raft.status": {"id":"ce2a822cea30bfca","term":5,"vote":"ce2a822cea30bfca","commit":23509,"lead":"ce2a822cea30bfca","raftState":"StateLeader","progress":{"ce2a822cea30bfca":{"match":23509,"next":23510,"state":"ProgressStateProbe"}}} +} +``` + #### Optimal Cluster Size The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.