etcd/Documentation/metrics.md

# Metrics

**NOTE: The metrics feature is considered experimental. We may add/change/remove metrics without warning in future releases.**

etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.

The simplest way to see the available metrics is to cURL the metrics endpoint `/metrics` of etcd. The format is described [here](http://prometheus.io/docs/instrumenting/exposition_formats/).


You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Prometheus server and monitor etcd metrics.

The naming of metrics follows the suggested [best practice of Prometheus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).

etcd now exposes the following metrics:

## etcdserver

| Name                                    | Description                                      | Type      |
|-----------------------------------------|--------------------------------------------------|-----------|
| file_descriptors_used_total             | The total number of file descriptors used        | Gauge     |
| proposal_durations_seconds              | The latency distributions of committing proposal | Histogram |
| pending_proposal_total                  | The total number of pending proposals            | Gauge     |
| proposal_failed_total                   | The total number of failed proposals             | Counter   |

High file descriptors (`file_descriptors_used_total`) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.

[Proposal](glossary.md#proposal) durations (`proposal_durations_seconds`) give you a histogram about the proposal commit latency. Latency can be introduced into this process by network and disk IO.

Pending proposal (`pending_proposal_total`) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.

Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.

## wal

| Name                               | Description                                      | Type      |
|------------------------------------|--------------------------------------------------|-----------|
| fsync_durations_seconds            | The latency distributions of fsync called by wal | Histogram |
| last_index_saved                   | The index of the last entry saved by wal         | Gauge     |

Abnormally high fsync duration (`fsync_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.


## http requests

These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total 
incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking
 user-generated traffic hitting the etcd cluster . 

All these metrics are prefixed with `etcd_http_`

| Name                           | Description                                                                         | Type                   |
|--------------------------------|-----------------------------------------------------------------------------------------|--------------------|
| received_total                 | Total number of events after parsing and auth.                                      | Counter(method)        |
| failed_total                   | Total number of failed events.                                                      | Counter(method,error)  |
| successful_duration_second     |  Bucketed handling times of the requests, including raft rounds for writes.          | Histogram(method)      |


Example Prometheus queries that may be useful from these metrics (across all etcd members):
 
 * `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)` 
    
    Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`.
 
 * `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)`
   `sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)`
    
    Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`.
    
 * `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))`
   `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))`
    
    Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`.      

## snapshot

| Name                                       | Description                                                | Type      |
|--------------------------------------------|------------------------------------------------------------|-----------|
| snapshot_save_total_durations_seconds      | The total latency distributions of save called by snapshot | Histogram |

Abnormally high snapshot duration (`snapshot_save_total_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.


## rafthttp

| Name                              | Description                                | Type         | Labels                         |
|-----------------------------------|--------------------------------------------|--------------|--------------------------------|
| message_sent_latency_seconds      | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID |
| message_sent_failed_total         | The total number of failed messages sent   | Summary      | sendingType, msgType, remoteID |


Abnormally high message duration (`message_sent_latency_seconds`) indicates network issues and might cause the cluster to be unstable.

An increase in message failures (`message_sent_failed_total`) indicates more severe network issues and might cause the cluster to be unstable.

Label `sendingType` is the connection type to send messages. `message`, `msgapp` and `msgappv2` use HTTP streaming, while `pipeline` does HTTP request for each message.

Label `msgType` is the type of raft message. `MsgApp` is log replication message; `MsgSnap` is snapshot install message; `MsgProp` is proposal forward message; the others are used to maintain raft internal status. If you have a large snapshot, you would expect a long msgSnap sending latency. For other types of messages, you would expect low latency, which is comparable to your ping latency if you have enough network bandwidth.

Label `remoteID` is the member ID of the message destination.


## proxy

etcd members operating in proxy mode do not do store operations. They forward all requests
 to cluster instances.

Tracking the rate of requests coming from a proxy allows one to pin down which machine is performing most reads/writes.

All these metrics are prefixed with `etcd_proxy_`

| Name                      | Description                                                                         | Type                   |
|---------------------------|-----------------------------------------------------------------------------------------|--------------------|
| requests_total            | Total number of requests by this proxy instance.    .                               | Counter(method)        |
| handled_total             | Total number of fully handled requests, with responses from etcd members.           | Counter(method)        |
| dropped_total             | Total number of dropped requests due to forwarding errors to etcd members.          | Counter(method,error)  |
| handling_duration_seconds | Bucketed handling times by HTTP method, including round trip to member instances.   | Histogram(method)      |  

Example Prometheus queries that may be useful from these metrics (across all etcd servers):

 *  `sum(rate(etcd_proxy_handled_total{job="etcd"}[1m])) by (method)`
    
    Rate of requests (by HTTP method) handled by all proxies, across a window of `1m`. 
 * `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method="GET"}[5m])) by (le))`
   `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method!="GET"}[5m])) by (le))`
    
    Show the 0.90-tile latency (in seconds) of handling of user requestsacross all proxy machines, with a window of `5m`.  
 * `sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)`
    
    Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to etcd cluster.
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								# Metrics
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								**NOTE: The metrics feature is considered experimental. We may add/change/remove metrics without warning in future releases.**
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
 								etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging.
 								The simplest way to see the available metrics is to cURL the metrics endpoint `/metrics` of etcd. The format is described [here](http://prometheus.io/docs/instrumenting/exposition_formats/).
-												Documentation: fix typos

I found some typos. Please let me know if you have any feedback.

Thanks,

Documentation: fix metrics.md typo

Documentation: trim blank lines in metrics.md

											
										
										
											2015-10-09 17:27:03 +03:00
+								You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Prometheus server and monitor etcd metrics.
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												Documentation: fix typos

I found some typos. Please let me know if you have any feedback.

Thanks,

Documentation: fix metrics.md typo

Documentation: trim blank lines in metrics.md

											
										
										
											2015-10-09 17:27:03 +03:00
+								The naming of metrics follows the suggested [best practice of Prometheus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`).
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
 								etcd now exposes the following metrics:
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## etcdserver
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												etcdserver: use Histogram for proposal_durations

											
										
										
											2015-10-17 22:48:25 +03:00
+								| Name                                    | Description                                      | Type      |
 								|-----------------------------------------|--------------------------------------------------|-----------|
 								| file_descriptors_used_total             | The total number of file descriptors used        | Gauge     |
 								| proposal_durations_seconds              | The latency distributions of committing proposal | Histogram |
 								| pending_proposal_total                  | The total number of pending proposals            | Gauge     |
 								| proposal_failed_total                   | The total number of failed proposals             | Counter   |
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
 								High file descriptors (`file_descriptors_used_total`) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics.
-												raft: use HistogramVec for message_sent_latency

											
										
										
											2015-10-17 23:03:46 +03:00
+								[Proposal](glossary.md#proposal) durations (`proposal_durations_seconds`) give you a histogram about the proposal commit latency. Latency can be introduced into this process by network and disk IO.
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
 								Pending proposal (`pending_proposal_total`) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster.
 								Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## wal
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												wal: use Histogram for syncDuration

											
										
										
											2015-10-17 22:25:26 +03:00
+								| Name                               | Description                                      | Type      |
 								|------------------------------------|--------------------------------------------------|-----------|
 								| fsync_durations_seconds            | The latency distributions of fsync called by wal | Histogram |
 								| last_index_saved                   | The index of the last entry saved by wal         | Gauge     |
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												wal: use Histogram for syncDuration

											
										
										
											2015-10-17 22:25:26 +03:00
+								Abnormally high fsync duration (`fsync_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## http requests
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												metrics: add `events` metrics in etcdhttp.

											
										
										
											2015-07-02 17:03:39 +03:00
+								These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total
 								incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking
 								 user-generated traffic hitting the etcd cluster .
 								All these metrics are prefixed with `etcd_http_`
 								| Name                           | Description                                                                         | Type                   |
 								|--------------------------------|-----------------------------------------------------------------------------------------|--------------------|
 								| received_total                 | Total number of events after parsing and auth.                                      | Counter(method)        |
 								| failed_total                   | Total number of failed events.                                                      | Counter(method,error)  |
 								| successful_duration_second     |  Bucketed handling times of the requests, including raft rounds for writes.          | Histogram(method)      |
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												metrics: add `events` metrics in etcdhttp.

											
										
										
											2015-07-02 17:03:39 +03:00
+								Example Prometheus queries that may be useful from these metrics (across all etcd members):
 								 * `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)`
 								    Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`.
 								 * `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)`
 								   `sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)`
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												*: fix spelling issues (codespell).

Signed-off-by: Dmitry Smirnov <onlyjob@member.fsf.org>

											
										
										
											2015-09-11 01:06:56 +03:00
+								    Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`.
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												metrics: add `events` metrics in etcdhttp.

											
										
										
											2015-07-02 17:03:39 +03:00
+								 * `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))`
 								   `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))`
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												metrics: add `events` metrics in etcdhttp.

											
										
										
											2015-07-02 17:03:39 +03:00
+								    Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`.
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## snapshot
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												snap: use Histogram for snap metrics

											
										
										
											2015-10-17 22:57:18 +03:00
+								| Name                                       | Description                                                | Type      |
 								|--------------------------------------------|------------------------------------------------------------|-----------|
 								| snapshot_save_total_durations_seconds      | The total latency distributions of save called by snapshot | Histogram |
-												doc: add doc for metrics feature

											
										
										
											2015-06-16 20:40:55 +03:00
-												snap: use Histogram for snap metrics

											
										
										
											2015-10-17 22:57:18 +03:00
+								Abnormally high snapshot duration (`snapshot_save_total_durations_seconds`) indicates disk issues and might cause the cluster to be unstable.
-												docs: doc metrics used in rafthttp package

											
										
										
											2015-06-25 02:04:46 +03:00
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## rafthttp
-												docs: doc metrics used in rafthttp package

											
										
										
											2015-06-25 02:04:46 +03:00
-												raft: use HistogramVec for message_sent_latency

											
										
										
											2015-10-17 23:03:46 +03:00
+								| Name                              | Description                                | Type         | Labels                         |
 								|-----------------------------------|--------------------------------------------|--------------|--------------------------------|
 								| message_sent_latency_seconds      | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID |
 								| message_sent_failed_total         | The total number of failed messages sent   | Summary      | sendingType, msgType, remoteID |
-												docs: doc metrics used in rafthttp package

											
										
										
											2015-06-25 02:04:46 +03:00
-												raft: use HistogramVec for message_sent_latency

											
										
										
											2015-10-17 23:03:46 +03:00
+								Abnormally high message duration (`message_sent_latency_seconds`) indicates network issues and might cause the cluster to be unstable.
-												docs: doc metrics used in rafthttp package

											
										
										
											2015-06-25 02:04:46 +03:00
 								An increase in message failures (`message_sent_failed_total`) indicates more severe network issues and might cause the cluster to be unstable.
-												docs: explain label in rafthttp metrics

											
										
										
											2015-06-25 21:50:53 +03:00
-												rafthttp: message_sent_latency metrics: channel -> sendingType

Better naming.

											
										
										
											2015-06-29 20:42:51 +03:00
+								Label `sendingType` is the connection type to send messages. `message`, `msgapp` and `msgappv2` use HTTP streaming, while `pipeline` does HTTP request for each message.
-												docs: explain label in rafthttp metrics

											
										
										
											2015-06-25 21:50:53 +03:00
 								Label `msgType` is the type of raft message. `MsgApp` is log replication message; `MsgSnap` is snapshot install message; `MsgProp` is proposal forward message; the others are used to maintain raft internal status. If you have a large snapshot, you would expect a long msgSnap sending latency. For other types of messages, you would expect low latency, which is comparable to your ping latency if you have enough network bandwidth.
 								Label `remoteID` is the member ID of the message destination.
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
-												Documentation: Fix heading hierarchy.

Correct the hierarchy of Markdown symbols in document headings.

											
										
										
											2015-10-21 01:26:49 +03:00
+								## proxy
-												*: add metrics to `store` and `proxy`.

											
										
										
											2015-06-17 16:32:13 +03:00
 								etcd members operating in proxy mode do not do store operations. They forward all requests
 								 to cluster instances.
 								Tracking the rate of requests coming from a proxy allows one to pin down which machine is performing most reads/writes.
 								All these metrics are prefixed with `etcd_proxy_`
 								| Name                      | Description                                                                         | Type                   |
 								|---------------------------|-----------------------------------------------------------------------------------------|--------------------|
 								| requests_total            | Total number of requests by this proxy instance.    .                               | Counter(method)        |
 								| handled_total             | Total number of fully handled requests, with responses from etcd members.           | Counter(method)        |
 								| dropped_total             | Total number of dropped requests due to forwarding errors to etcd members.          | Counter(method,error)  |
 								| handling_duration_seconds | Bucketed handling times by HTTP method, including round trip to member instances.   | Histogram(method)      |
 								Example Prometheus queries that may be useful from these metrics (across all etcd servers):
 								 *  `sum(rate(etcd_proxy_handled_total{job="etcd"}[1m])) by (method)`
 								    Rate of requests (by HTTP method) handled by all proxies, across a window of `1m`.
 								 * `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method="GET"}[5m])) by (le))`
 								   `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method!="GET"}[5m])) by (le))`
 								    Show the 0.90-tile latency (in seconds) of handling of user requestsacross all proxy machines, with a window of `5m`.
 								 * `sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)`
 								    Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to etcd cluster.
-												Documentation: fix typos

I found some typos. Please let me know if you have any feedback.

Thanks,

Documentation: fix metrics.md typo

Documentation: trim blank lines in metrics.md

											
										
										
											2015-10-09 17:27:03 +03:00