Skip to content

Conversation

@guettli
Copy link
Collaborator

@guettli guettli commented Nov 17, 2025

Baremetal PR

Deprecated: This field is deprecated and is
going to be removed when support for v1beta1
will be dropped. Please see
https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md
for more details.

Related HCloud PR 1717

Here’s a shorter, clean PR description:


Summary

This PR removes all usage of the deprecated FailureReason and FailureMessage fields from the bare-metal (BM) provider. Status handling is now fully aligned with the newer CAPI conditions-based model.

Why

Upstream CAPI is deprecating these fields. Removing them keeps our provider consistent with the updated status API and prevents reliance on legacy status patterns.

Changes

  • BareMetalMachine.SetFailure() --> s.scope.SetErrorAndRemediate(). This sets the clusterv1.RemediateMachineAnnotation on the capiMachine.
  • During remediation the controller checks if the remediation was started by SetErrorAndRemediate(). If "yes", then no reboot gets done. The remediation stops, so that the machine gets deleted.
  • Removed all references to FailureReason and FailureMessage in BM controllers and API types.
  • Updated reconciliation logic to use standard Conditions exclusively.
  • Adjusted tests to match the new status behavior.
  • Test code was adapted to no longer use a fixed bmMachineName. Capi-machine and hbmm are exptected to have the same name. This was adapted, so test data structures match real structures.
  • The bm remediation does not take the Namespace from the InfrastructureRef. Cross namespace references are not allowed, because user1 of namespace1 must not reference namespace2.
  • Improved logging of unusual cases (conflictError, hbmm is deleting).
  • Duplicate code was removed and getAssociatedHost() was refactored to getAassociatedHostAndPatchHelper()
  • Remediation code expects that the capi-machine name and hbmm name are equal. Test code was adapted accordingly.

@github-actions github-actions bot added size/M Denotes a PR that changes 50-200 lines, ignoring generated files. area/code Changes made in the code directory area/api Changes made in the api directory labels Nov 17, 2025
@github-actions github-actions bot added size/L Denotes a PR that changes 200-800 lines, ignoring generated files. and removed size/M Denotes a PR that changes 50-200 lines, ignoring generated files. labels Nov 20, 2025
Base automatically changed from tg/fix-hbmm-delete-test to main November 27, 2025 06:28
@guettli guettli changed the base branch from main to tg/pre--remove-failure-reason--baremetal--on-top-main November 27, 2025 12:31
@github-actions github-actions bot added size/L Denotes a PR that changes 200-800 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 800-2000 lines, ignoring generated files. labels Nov 27, 2025
Base automatically changed from tg/pre--remove-failure-reason--baremetal--on-top-main to main December 3, 2025 12:04
const (
// RemediationSucceededCondition is:
// - False when the corresponding CAPI Machine has the "cluster.x-k8s.io/remediate-machine" annotation set and will be remediated by CAPI soon.
// - True otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't be true otherwise. It should be only true IMO after it succeeded. Otherwise it should not be set at all

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, what about deleting the condition 6 hours after it got set to True?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think setting it to true is fine, once it actually got remediated. Then we can think about what you propose, but that would be an optimization that we could do later IMO. However, what I really would not like, and I think we currently do it, is to set the condition on true, even though we didn't even ever remediate the machine.

ctx = ctrl.LoggerInto(ctx, log)

remediateConditionOfHbmm := conditions.Get(hetznerBareMetalMachine, infrav1.NoRemediateMachineAnnotationCondition)
if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

remediateConditionOfHbmm := conditions.Get(hbmMachine, infrav1.RemediationSucceededCondition)
if remediateConditionOfHbmm != nil && remediateConditionOfHbmm.Status == corev1.ConditionFalse {
// The hbmm will be deleted. Do not reconcile it.
log.Info("hbmm has RemediationSucceededCondition=False. Waiting for deletion")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the condition is also set on false if it is still in progress, no? I don't understand this code snippet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janiskemper maybe we should rename it to DeleteInfraMachineSucceeded. Because that is the intention. We do not want to reboot the machine.

This will be done via remediation, but this is just the path which need to be followed, because the controller can't delete the capi-machine. Remediation is not really the intention, the machine should get deleted.

Does this make more sense now?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand it yet. We have the RemediationSucceeded condition, which is set on false with the reason "remediation ongoing". At least this is how we do it on hbmh. Do we have some different logic here on hbmm?

return reconcile.Result{}, s.setOwnerRemediatedConditionToFailed(ctx,
"exit remediation because hbmm has no host annotation")
}
// if SetErrorAndRemediate() was used to stop provisioning, do not try to reboot server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the name of the function. SetErrorAndRemediate sounds to me as if you explicitly want this remediation file to trigger. However, you instead do the opposite. Shouldn't we rename this function SetErrorAndRemediate so that readers are not let to the wrong direction?

If I understand correctly, you want to set an error so that remediation via reboot is skipped right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SetErrorAndDeleteMachine() might be a better name. Do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Just to confirm again: You wanna use this function in the cases when a remediation via reboot should not be done?

@guettli
Copy link
Collaborator Author

guettli commented Dec 3, 2025

One question: Does it work properly in case the hmbh is mirroring from hbmm, because in the other direction the hbmm is already mirroring from hbmh

@janiskemper

This condition is only mirrored from hbmm to hbmh, so I think this is fine.

@guettli guettli requested a review from janiskemper December 3, 2025 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/api Changes made in the api directory area/code Changes made in the code directory area/github Changes made in the github directory area/hack Changes made in the hack directory size/L Denotes a PR that changes 200-800 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants