fix: correctly set up priority queue in work applier #362

michaelawyu · 2025-12-04T10:00:29Z

Description of your changes

This PR fixes a variety of issues in the current option that enables usage of priority queue in the work applier, specifically:

priority queue is not registered with the controller
double enqueueing
priority settings are not applied for enqueued items
missing support for reportDiff mode
incorrent priority assigment

I have:

Run make reviewable to ensure this PR is ready for review.

How has this code been tested

Unit tests

Special notes for your reviewer

N/A

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

michaelawyu · 2025-12-04T10:01:41Z

Closing #112 as this PR precedes the earlier draft.

michaelawyu · 2025-12-04T10:02:26Z

Note: currently the work applier test suite still uses the default work queue. To verify both setup, certain re-arrangement of the test specs is needed -> will submit a separate PR.

pkg/controllers/workapplier/pq.go

weng271190436 · 2025-12-04T15:57:33Z

pkg/controllers/workapplier/pq.go

+// Update implements the TypedEventHandler interface.
+func (h *priorityBasedWorkObjEventHandler) Update(_ context.Context, updateEvent event.TypedUpdateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) {
+	// Do a sanity check.
+	if h.qm.PriorityQueue() == nil {


if priority queue is nil, it means somehow the controller is not even doing reconciling? I am wondering if we should just crash the controller given that its primary functionality is compromised

Hi Wei! The use of priority queues is optional; if not enabled the system will use a default, FIFO work queue. For this specific part, we only set up the priorityBasedWorkObjEventHandler when the use of priority queues is enabled, and since all of such setup are private variables, we do not do any additional checkup/validation.

The sanity check is added because:

The Create, Update, Delete, and Generic event handlers are invoked by the controller-runtime package when the controller runs (i.e., it's something out of our control);

The priority queue is only initialized after the controller has been set up with the controller manager; if the event handlers are called before such setup is done, the queue variable would be nil.

Of course, logically speaking the event handlers can only run after the informer gets ready so the controller must have already been set up with the manager; so this is only added as sanity check. Normally the branch will never run.

Normally the branch will never run.

If something not normal happens, shouldn't we crash rather than not process events at all?

Yeah, it's true... Though for this case, if the event handler does get called before the controller finishes setup, there's no side effect and the situation can still recover (as soon as the controller gets back).

To me this is a little bit like the interface casting checks we do in event handlers (e.g., the one in the scheduler member cluster watcher, or the rollout controller), which are almost guaranteed to work, we just keep the checks there for completeness reasons 😂

weng271190436 · 2025-12-04T15:59:43Z

pkg/controllers/workapplier/pq.go

+func (h *priorityBasedWorkObjEventHandler) Create(_ context.Context, createEvent event.TypedCreateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) {
+	// Do a sanity check.
+	//
+	// Normally when this method is called, the priority queue has been initialized.


Why would the priority queue not be initialized when this method is called?

Hi Wei! This is also connected to the reply above; if there's any concern, please let me know.

I think it depends on what "normal" means here:

if "normal" means that if this happens, there is a bug, then I think we should crash

if "normal" means that even if there is no bug, this can still happen from time to time (e.g. there is a race inherent to how controller runtime is implemented), then I think we should comment on under what situation this can happen and say this is the best we can do given how controller runtime is implemented

yeah but I guess the "out of our control" argument is valid. Maybe just comment on exactly why you need the sanity check - because it is out of our control while admitting the risk of silently dropping events?

weng271190436 · 2025-12-04T16:03:13Z

pkg/controllers/workapplier/pq.go

+}
+
+func (h *priorityBasedWorkObjEventHandler) determineUpdateEventPriority(oldWorkObj, newWorkObj *fleetv1beta1.Work) int {
+	// If the work object is recently created (its age is within the given threshold),


Can you comment on why we would want to process a new work object before an old work object?

Hi Wei! A bit background on this PR:

The use of priority queues is first proposed by one of our upstream users, whose major concern/complaint was that, when there are a lot of work objects (i.e., a lot of CRPs/RPs), the work applier will spend a lot of cycles re-processing existing work objects as we periodically requeues objects, and as a result it might take a long while for a newly created work object to be processed. This is esp. true when our agents restart; the user finds that sometimes it may take 30 minutes or more before the system becomes responsive to newly created CRPs/RPs again.

For obvious reasons most of the requeues for older work objects do not actually make any difference; we have to do them just to ensure correctness. The use of priority queues will allow the work applier for prioritize new objects first and only return to the mundance processing for requeues when there's no more work to do.

Got it! I think this is good context worth a comment in code because a future contributor wouldn't understand why this threshold exists in the first place just by reading code

pkg/controllers/workapplier/pq_test.go

weng271190436 · 2025-12-04T16:22:10Z

pkg/controllers/workapplier/controller.go


+	woAgeForPrioritizedProcessing := workObjAgeForPrioritizedProcessing
+	if usePriorityQueue && woAgeForPrioritizedProcessing < minWorkObjAgeForPrioritizedQueueing {
+		klog.V(2).InfoS("Work object age for prioritized processing is too short; set to the longer default", "workObjAgeForPrioritizedProcessing", woAgeForPrioritizedProcessing)


Can you comment on why 30-minute is a recommended minimum?

Also suggest just crash the controller if the user input is invalid. Users might not read logs and expect whatever they specify as workObjAgeForPrioritizedProcessing will be respected by the controller

and this is during startup, which is a good time to fail fast

Ah, the current value in use in the main branch is 1-hour, which was set by the user who originally proposed priority queues, as aforementioned (this PR does not change this default value).

The value has to strike a balance between responsiveness and fairness; if set too high, all requests will be assigned high priority, which defeats the purpose of having priority queues (this mostly applies to those users that use KubeFleet to run batch work), and if set too low new work objects might not get its fair chance to become ready.

The proper value is kind of user-dependent, which is why in this PR I am only blocking this value from being set too low.

With this being said, admitted we do not have enough empirical data on what a good default value really is at the moment.

And for the check part, I do not have a strong preference on the subject (and I 100% agree that many do not actually read logs 😂). This is currently following the pattern with the work applier backoff rate limiter (which, too, will overwrite invalid values with defaults and emit an error log).

Got it. I do find it strange that if a user provides an invalid input, we just set it to a closest valid value rather than tell the user that your input is invalid please try again with a valid value because this gives the user the impression that their input is perfectly fine.

But I wouldn't block the PR if the team agree that this is best practice

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

Fixed the priority queue implementation

3d06304

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

weng271190436 reviewed Dec 4, 2025

View reviewed changes

Minor fixes

28fe672

Signed-off-by: michaelawyu <chenyu1@microsoft.com>

fix: correctly set up priority queue in work applier #362

Are you sure you want to change the base?

fix: correctly set up priority queue in work applier #362

Conversation

michaelawyu commented Dec 4, 2025

Description of your changes

How has this code been tested

Special notes for your reviewer

Uh oh!

michaelawyu commented Dec 4, 2025

Uh oh!

michaelawyu commented Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michaelawyu Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelawyu Dec 4, 2025 •

edited

Loading

michaelawyu Dec 4, 2025 •

edited

Loading