# Metrics **NOTE: The metrics feature is considered experimental. We may add/change/remove metrics without warning in future releases.** etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. The metrics can be used for real-time monitoring and debugging. The simplest way to see the available metrics is to cURL the metrics endpoint `/metrics` of etcd. The format is described [here](http://prometheus.io/docs/instrumenting/exposition_formats/). You can also follow the doc [here](http://prometheus.io/docs/introduction/getting_started/) to start a Prometheus server and monitor etcd metrics. The naming of metrics follows the suggested [best practice of Prometheus](http://prometheus.io/docs/practices/naming/). A metric name has an `etcd` prefix as its namespace and a subsystem prefix (for example `wal` and `etcdserver`). etcd now exposes the following metrics: ## etcdserver | Name | Description | Type | |-----------------------------------------|--------------------------------------------------|-----------| | file_descriptors_used_total | The total number of file descriptors used | Gauge | | proposal_durations_seconds | The latency distributions of committing proposal | Histogram | | pending_proposal_total | The total number of pending proposals | Gauge | | proposal_failed_total | The total number of failed proposals | Counter | High file descriptors (`file_descriptors_used_total`) usage (near the file descriptors limitation of the process) indicates a potential out of file descriptors issue. That might cause etcd fails to create new WAL files and panics. [Proposal](glossary.md#proposal) durations (`proposal_durations_seconds`) give you a histogram about the proposal commit latency. Latency can be introduced into this process by network and disk IO. Pending proposal (`pending_proposal_total`) gives you an idea about how many proposal are in the queue and waiting for commit. An increasing pending number indicates a high client load or an unstable cluster. Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster. ## wal | Name | Description | Type | |------------------------------------|--------------------------------------------------|-----------| | fsync_durations_seconds | The latency distributions of fsync called by wal | Histogram | | last_index_saved | The index of the last entry saved by wal | Gauge | Abnormally high fsync duration (`fsync_durations_seconds`) indicates disk issues and might cause the cluster to be unstable. ## http requests These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking user-generated traffic hitting the etcd cluster . All these metrics are prefixed with `etcd_http_` | Name | Description | Type | |--------------------------------|-----------------------------------------------------------------------------------------|--------------------| | received_total | Total number of events after parsing and auth. | Counter(method) | | failed_total | Total number of failed events.   | Counter(method,error) | | successful_duration_second | Bucketed handling times of the requests, including raft rounds for writes. | Histogram(method) | Example Prometheus queries that may be useful from these metrics (across all etcd members): * `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)` Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`. * `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)` `sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)` Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`. * `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))` `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))` Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`. ## snapshot | Name | Description | Type | |--------------------------------------------|------------------------------------------------------------|-----------| | snapshot_save_total_durations_seconds | The total latency distributions of save called by snapshot | Histogram | Abnormally high snapshot duration (`snapshot_save_total_durations_seconds`) indicates disk issues and might cause the cluster to be unstable. ## rafthttp | Name | Description | Type | Labels | |-----------------------------------|--------------------------------------------|--------------|--------------------------------| | message_sent_latency_seconds | The latency distributions of messages sent | HistogramVec | sendingType, msgType, remoteID | | message_sent_failed_total | The total number of failed messages sent | Summary | sendingType, msgType, remoteID | Abnormally high message duration (`message_sent_latency_seconds`) indicates network issues and might cause the cluster to be unstable. An increase in message failures (`message_sent_failed_total`) indicates more severe network issues and might cause the cluster to be unstable. Label `sendingType` is the connection type to send messages. `message`, `msgapp` and `msgappv2` use HTTP streaming, while `pipeline` does HTTP request for each message. Label `msgType` is the type of raft message. `MsgApp` is log replication message; `MsgSnap` is snapshot install message; `MsgProp` is proposal forward message; the others are used to maintain raft internal status. If you have a large snapshot, you would expect a long msgSnap sending latency. For other types of messages, you would expect low latency, which is comparable to your ping latency if you have enough network bandwidth. Label `remoteID` is the member ID of the message destination. ## proxy etcd members operating in proxy mode do not do store operations. They forward all requests to cluster instances. Tracking the rate of requests coming from a proxy allows one to pin down which machine is performing most reads/writes. All these metrics are prefixed with `etcd_proxy_` | Name | Description | Type | |---------------------------|-----------------------------------------------------------------------------------------|--------------------| | requests_total | Total number of requests by this proxy instance. . | Counter(method) | | handled_total | Total number of fully handled requests, with responses from etcd members. | Counter(method) | | dropped_total | Total number of dropped requests due to forwarding errors to etcd members.  | Counter(method,error) | | handling_duration_seconds | Bucketed handling times by HTTP method, including round trip to member instances. | Histogram(method) | Example Prometheus queries that may be useful from these metrics (across all etcd servers): * `sum(rate(etcd_proxy_handled_total{job="etcd"}[1m])) by (method)` Rate of requests (by HTTP method) handled by all proxies, across a window of `1m`. * `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method="GET"}[5m])) by (le))` `histogram_quantile(0.9, sum(increase(etcd_proxy_events_handling_time_seconds_bucket{job="etcd",method!="GET"}[5m])) by (le))` Show the 0.90-tile latency (in seconds) of handling of user requestsacross all proxy machines, with a window of `5m`. * `sum(rate(etcd_proxy_dropped_total{job="etcd"}[1m])) by (proxying_error)` Number of failed request on the proxy. This should be 0, spikes here indicate connectivity issues to etcd cluster.