Skip to content

Conversation

@tjungblu
Copy link
Contributor

With the help of @dgoodwin we were able to identify better threshold based on our fleet telemetry. Along with it, I wanted to contribute some minor improvements we did to our alerts over the years.

Here's a summary by Claude:

Alert Severity Changes

  • etcdMembersDown: Increased severity from warning to critical (alerts/alerts.libsonnet:10)

Improved Alert Descriptions

  • etcdInsufficientMembers: Enhanced description with detailed troubleshooting guidance about control plane nodes, network connectivity, and the impact on Kubernetes APIs (alerts/alerts.libsonnet:20-21)

Alert Query Improvements

  • etcdHighNumberOfLeaderChanges: Rewrote query to use changes(etcd_server_is_leader) instead of increase(etcd_server_leader_changes_seen_total), changed time window from 15m to 10m (alerts/alerts.libsonnet:30)

More Aggressive Disk Performance Thresholds

  • etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47)
  • etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56)
  • etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65)
  • etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87)

Database Quota Alerts

  • etcdDatabaseQuotaLowSpace: Added tiered alerts at 65% (info), 75% (warning), and lowered critical from 95% to 85% (alerts/alerts.libsonnet:89-121)

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

With the help of @dgoodwin we were able to identify better threshold
based on our fleet telemetry. Along with it, I wanted to contribute some
minor improvements we did to our alerts over the years.

Here's a summary by Claude:

Alert Severity Changes
- etcdMembersDown: Increased severity from warning to critical (alerts/alerts.libsonnet:10)

Improved Alert Descriptions
- etcdInsufficientMembers: Enhanced description with detailed troubleshooting guidance about control plane nodes, network connectivity, and the impact on
   Kubernetes APIs (alerts/alerts.libsonnet:20-21)

Alert Query Improvements

- etcdHighNumberOfLeaderChanges: Rewrote query to use changes(etcd_server_is_leader) instead of increase(etcd_server_leader_changes_seen_total), changed
  time window from 15m to 10m (alerts/alerts.libsonnet:30)

More Aggressive Disk Performance Thresholds

- etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47)
- etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56)
- etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65)
- etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87)

Database Quota Alerts
- etcdDatabaseQuotaLowSpace: Added tiered alerts at 65% (info), 75% (warning), and lowered critical from 95% to 85% (alerts/alerts.libsonnet:89-121)

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@k8s-ci-robot
Copy link

@tjungblu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-coverage-report 3d537d1 link true /test pull-etcd-coverage-report

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link

codecov bot commented Nov 11, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.07%. Comparing base (1fa64e5) to head (3d537d1).

Additional details and impacted files

see 72 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20917      +/-   ##
==========================================
+ Coverage   66.22%   70.07%   +3.85%     
==========================================
  Files         422      406      -16     
  Lines       34839    34289     -550     
==========================================
+ Hits        23071    24029     +958     
+ Misses      10357     8867    -1490     
+ Partials     1411     1393      -18     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fa64e5...3d537d1. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ahrtr
Copy link
Member

ahrtr commented Nov 14, 2025

  • etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47)
  • etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56)
  • etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65)
  • etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87)

I am not sure whether this will increase the volume of noises with such lower thresholds. What's the real effect in your production?

alert: 'etcdHighNumberOfLeaderChanges',
expr: |||
increase((max without (%(etcd_instance_labels)s) (etcd_server_leader_changes_seen_total{%(etcd_selector)s}) or 0*absent(etcd_server_leader_changes_seen_total{%(etcd_selector)s}))[15m:1m]) >= 4
avg by (job) (changes(etcd_server_is_leader{%(etcd_selector)s}[10m])) > 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think etcd_server_leader_changes_seen_total might be a little better, it's exactly designed for this alert.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants