Skip to content

Conversation

@michaelawyu
Copy link
Collaborator

Description of your changes

This PR fixes a variety of issues in the current option that enables usage of priority queue in the work applier, specifically:

  • priority queue is not registered with the controller
  • double enqueueing
  • priority settings are not applied for enqueued items
  • missing support for reportDiff mode
  • incorrent priority assigment

I have:

  • Run make reviewable to ensure this PR is ready for review.

How has this code been tested

  • Unit tests

Special notes for your reviewer

N/A

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
@michaelawyu
Copy link
Collaborator Author

Closing #112 as this PR precedes the earlier draft.

@michaelawyu
Copy link
Collaborator Author

Note: currently the work applier test suite still uses the default work queue. To verify both setup, certain re-arrangement of the test specs is needed -> will submit a separate PR.

// Update implements the TypedEventHandler interface.
func (h *priorityBasedWorkObjEventHandler) Update(_ context.Context, updateEvent event.TypedUpdateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) {
// Do a sanity check.
if h.qm.PriorityQueue() == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if priority queue is nil, it means somehow the controller is not even doing reconciling? I am wondering if we should just crash the controller given that its primary functionality is compromised

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! The use of priority queues is optional; if not enabled the system will use a default, FIFO work queue. For this specific part, we only set up the priorityBasedWorkObjEventHandler when the use of priority queues is enabled, and since all of such setup are private variables, we do not do any additional checkup/validation.

Copy link
Collaborator Author

@michaelawyu michaelawyu Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sanity check is added because:

  • The Create, Update, Delete, and Generic event handlers are invoked by the controller-runtime package when the controller runs (i.e., it's something out of our control);
  • The priority queue is only initialized after the controller has been set up with the controller manager; if the event handlers are called before such setup is done, the queue variable would be nil.

Of course, logically speaking the event handlers can only run after the informer gets ready so the controller must have already been set up with the manager; so this is only added as sanity check. Normally the branch will never run.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally the branch will never run.

If something not normal happens, shouldn't we crash rather than not process events at all?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's true... Though for this case, if the event handler does get called before the controller finishes setup, there's no side effect and the situation can still recover (as soon as the controller gets back).

To me this is a little bit like the interface casting checks we do in event handlers (e.g., the one in the scheduler member cluster watcher, or the rollout controller), which are almost guaranteed to work, we just keep the checks there for completeness reasons 😂

func (h *priorityBasedWorkObjEventHandler) Create(_ context.Context, createEvent event.TypedCreateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) {
// Do a sanity check.
//
// Normally when this method is called, the priority queue has been initialized.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the priority queue not be initialized when this method is called?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! This is also connected to the reply above; if there's any concern, please let me know.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on what "normal" means here:

  1. if "normal" means that if this happens, there is a bug, then I think we should crash
  2. if "normal" means that even if there is no bug, this can still happen from time to time (e.g. there is a race inherent to how controller runtime is implemented), then I think we should comment on under what situation this can happen and say this is the best we can do given how controller runtime is implemented

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but I guess the "out of our control" argument is valid. Maybe just comment on exactly why you need the sanity check - because it is out of our control while admitting the risk of silently dropping events?

}

func (h *priorityBasedWorkObjEventHandler) determineUpdateEventPriority(oldWorkObj, newWorkObj *fleetv1beta1.Work) int {
// If the work object is recently created (its age is within the given threshold),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why we would want to process a new work object before an old work object?

Copy link
Collaborator Author

@michaelawyu michaelawyu Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Wei! A bit background on this PR:

The use of priority queues is first proposed by one of our upstream users, whose major concern/complaint was that, when there are a lot of work objects (i.e., a lot of CRPs/RPs), the work applier will spend a lot of cycles re-processing existing work objects as we periodically requeues objects, and as a result it might take a long while for a newly created work object to be processed. This is esp. true when our agents restart; the user finds that sometimes it may take 30 minutes or more before the system becomes responsive to newly created CRPs/RPs again.

For obvious reasons most of the requeues for older work objects do not actually make any difference; we have to do them just to ensure correctness. The use of priority queues will allow the work applier for prioritize new objects first and only return to the mundance processing for requeues when there's no more work to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! I think this is good context worth a comment in code because a future contributor wouldn't understand why this threshold exists in the first place just by reading code


woAgeForPrioritizedProcessing := workObjAgeForPrioritizedProcessing
if usePriorityQueue && woAgeForPrioritizedProcessing < minWorkObjAgeForPrioritizedQueueing {
klog.V(2).InfoS("Work object age for prioritized processing is too short; set to the longer default", "workObjAgeForPrioritizedProcessing", woAgeForPrioritizedProcessing)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why 30-minute is a recommended minimum?

Also suggest just crash the controller if the user input is invalid. Users might not read logs and expect whatever they specify as workObjAgeForPrioritizedProcessing will be respected by the controller

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this is during startup, which is a good time to fail fast

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the current value in use in the main branch is 1-hour, which was set by the user who originally proposed priority queues, as aforementioned (this PR does not change this default value).

The value has to strike a balance between responsiveness and fairness; if set too high, all requests will be assigned high priority, which defeats the purpose of having priority queues (this mostly applies to those users that use KubeFleet to run batch work), and if set too low new work objects might not get its fair chance to become ready.

The proper value is kind of user-dependent, which is why in this PR I am only blocking this value from being set too low.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this being said, admitted we do not have enough empirical data on what a good default value really is at the moment.

And for the check part, I do not have a strong preference on the subject (and I 100% agree that many do not actually read logs 😂). This is currently following the pattern with the work applier backoff rate limiter (which, too, will overwrite invalid values with defaults and emit an error log).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I do find it strange that if a user provides an invalid input, we just set it to a closest valid value rather than tell the user that your input is invalid please try again with a valid value because this gives the user the impression that their input is perfectly fine.

But I wouldn't block the PR if the team agree that this is best practice

Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants