Update contrib/mixin alerting thresholds #20917

tjungblu · 2025-11-11T14:16:09Z

With the help of @dgoodwin we were able to identify better threshold based on our fleet telemetry. Along with it, I wanted to contribute some minor improvements we did to our alerts over the years.

Here's a summary by Claude:

Alert Severity Changes

etcdMembersDown: Increased severity from warning to critical (alerts/alerts.libsonnet:10)

Improved Alert Descriptions

etcdInsufficientMembers: Enhanced description with detailed troubleshooting guidance about control plane nodes, network connectivity, and the impact on Kubernetes APIs (alerts/alerts.libsonnet:20-21)

Alert Query Improvements

etcdHighNumberOfLeaderChanges: Rewrote query to use changes(etcd_server_is_leader) instead of increase(etcd_server_leader_changes_seen_total), changed time window from 15m to 10m (alerts/alerts.libsonnet:30)

More Aggressive Disk Performance Thresholds

etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47)
etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56)
etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65)
etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87)

Database Quota Alerts

etcdDatabaseQuotaLowSpace: Added tiered alerts at 65% (info), 75% (warning), and lowered critical from 95% to 85% (alerts/alerts.libsonnet:89-121)

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

k8s-ci-robot · 2025-11-11T14:16:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tjungblu
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dgoodwin

With the help of @dgoodwin we were able to identify better threshold based on our fleet telemetry. Along with it, I wanted to contribute some minor improvements we did to our alerts over the years. Here's a summary by Claude: Alert Severity Changes - etcdMembersDown: Increased severity from warning to critical (alerts/alerts.libsonnet:10) Improved Alert Descriptions - etcdInsufficientMembers: Enhanced description with detailed troubleshooting guidance about control plane nodes, network connectivity, and the impact on Kubernetes APIs (alerts/alerts.libsonnet:20-21) Alert Query Improvements - etcdHighNumberOfLeaderChanges: Rewrote query to use changes(etcd_server_is_leader) instead of increase(etcd_server_leader_changes_seen_total), changed time window from 15m to 10m (alerts/alerts.libsonnet:30) More Aggressive Disk Performance Thresholds - etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47) - etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56) - etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65) - etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87) Database Quota Alerts - etcdDatabaseQuotaLowSpace: Added tiered alerts at 65% (info), 75% (warning), and lowered critical from 95% to 85% (alerts/alerts.libsonnet:89-121) Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>

k8s-ci-robot · 2025-11-11T14:53:32Z

@tjungblu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-coverage-report	`3d537d1`	link	true	`/test pull-etcd-coverage-report`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

codecov · 2025-11-11T14:53:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.07%. Comparing base (1fa64e5) to head (3d537d1).

Additional details and impacted files

see 72 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #20917      +/-   ##
==========================================
+ Coverage   66.22%   70.07%   +3.85%     
==========================================
  Files         422      406      -16     
  Lines       34839    34289     -550     
==========================================
+ Hits        23071    24029     +958     
+ Misses      10357     8867    -1490     
+ Partials     1411     1393      -18

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1fa64e5...3d537d1. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ahrtr · 2025-11-14T11:33:34Z

etcdHighFsyncDurations (warning): Lowered threshold from 0.5s to 0.05s (alerts/alerts.libsonnet:47)

etcdHighFsyncDurations (critical): Lowered threshold from 1s to 0.07s (alerts/alerts.libsonnet:56)

etcdHighCommitDurations (warning): Lowered threshold from 0.25s to 0.08s (alerts/alerts.libsonnet:65)

etcdHighCommitDurations (critical): Added new critical alert at 0.1s threshold (alerts/alerts.libsonnet:74-87)

I am not sure whether this will increase the volume of noises with such lower thresholds. What's the real effect in your production?

ahrtr · 2025-11-14T11:26:22Z

contrib/mixin/alerts/alerts.libsonnet

            alert: 'etcdHighNumberOfLeaderChanges',
            expr: |||
-              increase((max without (%(etcd_instance_labels)s) (etcd_server_leader_changes_seen_total{%(etcd_selector)s}) or 0*absent(etcd_server_leader_changes_seen_total{%(etcd_selector)s}))[15m:1m]) >= 4
+              avg by (job) (changes(etcd_server_is_leader{%(etcd_selector)s}[10m])) > 5


I think etcd_server_leader_changes_seen_total might be a little better, it's exactly designed for this alert.

k8s-ci-robot added area/contrib area/observability labels Nov 11, 2025

k8s-ci-robot added the size/L label Nov 11, 2025

tjungblu mentioned this pull request Nov 11, 2025

OCPBUGS-64729: Update etcd alerts to match observed real world data openshift/cluster-etcd-operator#1511

Open

tjungblu force-pushed the upstream_alerts branch from 54e9e68 to 3d537d1 Compare November 11, 2025 14:20

ahrtr reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update contrib/mixin alerting thresholds #20917

Update contrib/mixin alerting thresholds #20917

Uh oh!

tjungblu commented Nov 11, 2025

Uh oh!

k8s-ci-robot commented Nov 11, 2025

Uh oh!

k8s-ci-robot commented Nov 11, 2025

Uh oh!

codecov bot commented Nov 11, 2025 •

edited

Loading

Uh oh!

ahrtr commented Nov 14, 2025

Uh oh!

ahrtr Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Update contrib/mixin alerting thresholds #20917

Are you sure you want to change the base?

Update contrib/mixin alerting thresholds #20917

Uh oh!

Conversation

tjungblu commented Nov 11, 2025

Uh oh!

k8s-ci-robot commented Nov 11, 2025

Uh oh!

k8s-ci-robot commented Nov 11, 2025

Uh oh!

codecov bot commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ahrtr commented Nov 14, 2025

Uh oh!

ahrtr Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 11, 2025 •

edited

Loading