9.6 KiB
Metrics
etcd uses Prometheus for metrics reporting. The metrics can be used for real-time monitoring and debugging. etcd does not persist its metrics; if a member restarts, the metrics will be reset.
The simplest way to see the available metrics is to cURL the metrics endpoint /metrics
. The format is described here.
Follow the Prometheus getting started doc to spin up a Prometheus server to collect etcd metrics.
The naming of metrics follows the suggested Prometheus best practices. A metric name has an etcd
or etcd_debugging
prefix as its namespace and a subsystem prefix (for example wal
and etcdserver
).
etcd namespace metrics
The metrics under the etcd
prefix are for monitoring and alerting. They are stable high level metrics. If there is any change of these metrics, it will be included in release notes.
Metrics that are etcd2 related are documented v2 metrics guide.
server
These metrics describe the status of the etcd server. In order to detect outages or problems for troubleshooting, the server metrics of every production etcd cluster should be closely monitored.
All these metrics are prefixed with etcd_server_
Name | Description | Type |
---|---|---|
has_leader | Whether or not a leader exists. 1 is existence, 0 is not. | Gauge |
leader_changes_seen_total | The number of leader changes seen. | Counter |
proposals_committed_total | The total number of consensus proposals committed. | Gauge |
has_leader
indicates whether the member has a leader. If a member does not have a leader, it is
totally unavailable. If all the members in the cluster do not have any leader, the entire cluster
is totally unavailable.
leader_changes_seen_total
counts the number of leader changes the member has seen since its start. Rapid leadership changes impact the performance of etcd significantly. It also signals that the leader is unstable, perhaps due to network connectivity issues or excessive load hitting the etcd cluster.
proposals_committed_total
records the total number of consensus proposals committed. This gauge should increase over time if the cluster is healthy. Several healthy members of an etcd cluster may have different total committed proposals at once. This discrepancy may be due to recovering from peers after starting, lagging behind the leader, or being the leader and therefore having the most commits. It is important to monitor this metric across all the members in the cluster; a consistently large lag between a single member and its leader indicates that member is slow or unhealthy.
disk
These metrics describe the status of the disk operations.
All these metrics are prefixed with etcd_disk_
.
Name | Description | Type |
---|---|---|
wal_fsync_duration_seconds | The latency distributions of fsync called by wal | Histogram |
backend_commit_duration_seconds | The latency distributions of commit called by backend. | Histogram |
A wal_fsync
is called when etcd persists its log entries to disk before applying them.
A backend_commit
is called when etcd commits an incremental snapshot of its most recent changes to disk.
High disk operation latencies (wal_fsync_duration_seconds
or backend_commit_duration_seconds
) often indicate disk issues. It may cause high request latency or make the cluster unstable.
network
These metrics describe the status of the network.
All these metrics are prefixed with etcd_network_
Name | Description | Type |
---|---|---|
sent_bytes_total | The total number of bytes sent to the member with ID TO . |
Counter(To) |
received_bytes_total | The total number of bytes received from the member with ID From . |
Counter(From) |
round_trip_time_seconds | Round-Trip-Time histogram between members. | Histogram(To) |
sent_bytes_total
counts the total number of bytes sent to a specific member. Usually the leader member sends more data than other members since it is responsible for transmitting replicated data.
received_bytes_total
counts the total number of bytes received from a specific member. Usually follower members receive data only from the leader member.
gRPC requests
These metrics describe the requests served by a specific etcd member: total received requests, total failed requests, and processing latency. They are useful for tracking user-generated traffic hitting the etcd cluster.
All these metrics are prefixed with etcd_grpc_
Name | Description | Type |
---|---|---|
requests_total | Total number of received requests | Counter(method) |
requests_failed_total | Total number of failed requests. | Counter(method,error) |
unary_requests_duration_seconds | Bucketed handling duration of the requests. | Histogram(method) |
Example Prometheus queries that may be useful from these metrics (across all etcd members):
-
sum(rate(etcd_grpc_requests_failed_total{job="etcd"}[1m]) by (grpc_method) / sum(rate(etcd_grpc_total{job="etcd"})[1m]) by (grpc_method)
Shows the fraction of events that failed by gRPC method across all members, across a time window of
1m
. -
sum(rate(etcd_grpc_requests_total{job="etcd",grpc_method="PUT"})[1m]) by (grpc_method)
Shows the rate of PUT requests across all members, across a time window of
1m
. -
histogram_quantile(0.9, sum(rate(etcd_grpc_unary_requests_duration_seconds{job="etcd",grpc_method="PUT"}[5m]) ) by (le))
Show the 0.90-tile latency (in seconds) of PUT request handling across all members, with a window of
5m
.
etcd_debugging namespace metrics
The metrics under the etcd_debugging
prefix are for debugging. They are very implementation dependent and volatile. They might be changed or removed without any warning in new etcd releases. Some of the metrics might be moved to the etcd
prefix when they become more stable.
etcdserver
Name | Description | Type |
---|---|---|
proposal_duration_seconds | The latency distributions of committing proposal | Histogram |
proposals_pending | The current number of pending proposals | Gauge |
proposals_failed_total | The total number of failed proposals | Counter |
Proposal duration (proposal_duration_seconds
) provides a proposal commit latency histogram. The reported latency reflects network and disk IO delays in etcd.
Proposals pending (proposals_pending
) indicates how many proposals are queued for commit. Rising pending proposals suggests there is a high client load or the cluster is unstable.
Failed proposals (proposals_failed_total
) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
snapshot
Name | Description | Type |
---|---|---|
snapshot_save_total_duration_seconds | The total latency distributions of save called by snapshot | Histogram |
Abnormally high snapshot duration (snapshot_save_total_duration_seconds
) indicates disk issues and might cause the cluster to be unstable.
Prometheus supplied metrics
The Prometheus client library provides a number of metrics under the go
and process
namespaces. There are a few that are particlarly interesting.
Name | Description | Type |
---|---|---|
process_open_fds | Number of open file descriptors. | Gauge |
process_max_fds | Maximum number of open file descriptors. | Gauge |
Heavy file descriptor (process_open_fds
) usage (i.e., near the process's file descriptor limit, process_max_fds
) indicates a potential file descriptor exhaustion issue. If the file descriptors are exhausted, etcd may panic because it cannot create new WAL files.