-
Notifications
You must be signed in to change notification settings - Fork 20
fix: correctly set up priority queue in work applier #362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: correctly set up priority queue in work applier #362
Conversation
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
|
Closing #112 as this PR precedes the earlier draft. |
|
Note: currently the work applier test suite still uses the default work queue. To verify both setup, certain re-arrangement of the test specs is needed -> will submit a separate PR. |
| // Update implements the TypedEventHandler interface. | ||
| func (h *priorityBasedWorkObjEventHandler) Update(_ context.Context, updateEvent event.TypedUpdateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) { | ||
| // Do a sanity check. | ||
| if h.qm.PriorityQueue() == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if priority queue is nil, it means somehow the controller is not even doing reconciling? I am wondering if we should just crash the controller given that its primary functionality is compromised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wei! The use of priority queues is optional; if not enabled the system will use a default, FIFO work queue. For this specific part, we only set up the priorityBasedWorkObjEventHandler when the use of priority queues is enabled, and since all of such setup are private variables, we do not do any additional checkup/validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sanity check is added because:
- The
Create,Update,Delete, andGenericevent handlers are invoked by the controller-runtime package when the controller runs (i.e., it's something out of our control); - The priority queue is only initialized after the controller has been set up with the controller manager; if the event handlers are called before such setup is done, the queue variable would be nil.
Of course, logically speaking the event handlers can only run after the informer gets ready so the controller must have already been set up with the manager; so this is only added as sanity check. Normally the branch will never run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally the branch will never run.
If something not normal happens, shouldn't we crash rather than not process events at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's true... Though for this case, if the event handler does get called before the controller finishes setup, there's no side effect and the situation can still recover (as soon as the controller gets back).
To me this is a little bit like the interface casting checks we do in event handlers (e.g., the one in the scheduler member cluster watcher, or the rollout controller), which are almost guaranteed to work, we just keep the checks there for completeness reasons 😂
| func (h *priorityBasedWorkObjEventHandler) Create(_ context.Context, createEvent event.TypedCreateEvent[client.Object], _ workqueue.TypedRateLimitingInterface[reconcile.Request]) { | ||
| // Do a sanity check. | ||
| // | ||
| // Normally when this method is called, the priority queue has been initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the priority queue not be initialized when this method is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wei! This is also connected to the reply above; if there's any concern, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it depends on what "normal" means here:
- if "normal" means that if this happens, there is a bug, then I think we should crash
- if "normal" means that even if there is no bug, this can still happen from time to time (e.g. there is a race inherent to how controller runtime is implemented), then I think we should comment on under what situation this can happen and say this is the best we can do given how controller runtime is implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but I guess the "out of our control" argument is valid. Maybe just comment on exactly why you need the sanity check - because it is out of our control while admitting the risk of silently dropping events?
| } | ||
|
|
||
| func (h *priorityBasedWorkObjEventHandler) determineUpdateEventPriority(oldWorkObj, newWorkObj *fleetv1beta1.Work) int { | ||
| // If the work object is recently created (its age is within the given threshold), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why we would want to process a new work object before an old work object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Wei! A bit background on this PR:
The use of priority queues is first proposed by one of our upstream users, whose major concern/complaint was that, when there are a lot of work objects (i.e., a lot of CRPs/RPs), the work applier will spend a lot of cycles re-processing existing work objects as we periodically requeues objects, and as a result it might take a long while for a newly created work object to be processed. This is esp. true when our agents restart; the user finds that sometimes it may take 30 minutes or more before the system becomes responsive to newly created CRPs/RPs again.
For obvious reasons most of the requeues for older work objects do not actually make any difference; we have to do them just to ensure correctness. The use of priority queues will allow the work applier for prioritize new objects first and only return to the mundance processing for requeues when there's no more work to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! I think this is good context worth a comment in code because a future contributor wouldn't understand why this threshold exists in the first place just by reading code
|
|
||
| woAgeForPrioritizedProcessing := workObjAgeForPrioritizedProcessing | ||
| if usePriorityQueue && woAgeForPrioritizedProcessing < minWorkObjAgeForPrioritizedQueueing { | ||
| klog.V(2).InfoS("Work object age for prioritized processing is too short; set to the longer default", "workObjAgeForPrioritizedProcessing", woAgeForPrioritizedProcessing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why 30-minute is a recommended minimum?
Also suggest just crash the controller if the user input is invalid. Users might not read logs and expect whatever they specify as workObjAgeForPrioritizedProcessing will be respected by the controller
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this is during startup, which is a good time to fail fast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the current value in use in the main branch is 1-hour, which was set by the user who originally proposed priority queues, as aforementioned (this PR does not change this default value).
The value has to strike a balance between responsiveness and fairness; if set too high, all requests will be assigned high priority, which defeats the purpose of having priority queues (this mostly applies to those users that use KubeFleet to run batch work), and if set too low new work objects might not get its fair chance to become ready.
The proper value is kind of user-dependent, which is why in this PR I am only blocking this value from being set too low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this being said, admitted we do not have enough empirical data on what a good default value really is at the moment.
And for the check part, I do not have a strong preference on the subject (and I 100% agree that many do not actually read logs 😂). This is currently following the pattern with the work applier backoff rate limiter (which, too, will overwrite invalid values with defaults and emit an error log).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I do find it strange that if a user provides an invalid input, we just set it to a closest valid value rather than tell the user that your input is invalid please try again with a valid value because this gives the user the impression that their input is perfectly fine.
But I wouldn't block the PR if the team agree that this is best practice
Signed-off-by: michaelawyu <chenyu1@microsoft.com>
Description of your changes
This PR fixes a variety of issues in the current option that enables usage of priority queue in the work applier, specifically:
I have:
make reviewableto ensure this PR is ready for review.How has this code been tested
Special notes for your reviewer
N/A