-
Notifications
You must be signed in to change notification settings - Fork 7k
[train] De-flake Data Resource Cleanup Tests #59097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train] De-flake Data Resource Cleanup Tests #59097
Conversation
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to de-flake the data resource cleanup tests by making autoscaling requests explicit and waiting for them with a timeout. This is a good approach to improve test stability. My review identifies a potential race condition in the new waiting logic that could still lead to flakiness and provides a set of suggestions to make the test more robust. The changes also include some beneficial cleanups.
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The autoscaling requester is not a threaded actor, so only one actor task should be running at a time, and the order is in task submission order. Is it possible to just make sure that the get_resources() remote call is submitted after the request_resources call?
| # Explicity trigger autoscaling | ||
| ray.get( | ||
| coord.__ray_call__.remote( | ||
| lambda coord: coord._executor._cluster_autoscaler.try_trigger_scaling() | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we only need to do this, and remove the other loop. See the top-level review comment.
This will force the request_resources task to get submitted before submitting the ray_call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this forces the remote call, it is still unknown how long this call will take so the loop in get_requests is still useful. The loop is also used for the second set of assertions after shutdown. See my response to your below comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we know that the request_resources task will be scheduled as long as we wait for try_trigger_scaling to finish.
Then, since the Actor's task queue is executed in order, we know that the task to get the current requests will happen after request_resources finishes.
See https://docs.ray.io/en/latest/ray-core/actors/task-orders.html#synchronous-single-threaded-actor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I see. So I tried removing the loop and it is actually still flaky. I believe this is because "actors do not guarantee the execution order of the tasks from different submitters" from the page you linked. Even though we call try_trigger_scaling() first, it is submitted from the Cluster Autoscaler whereas the following ray_call is submitted on the driver. Because these calls are from two different submitters, there is no guarantee on the ordering. So I believe we still need the loop- it is not flaky when I include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah good catch, sounds good
| requests = ray.get( | ||
| requester.__ray_call__.remote(lambda requester: requester._resource_requests) | ||
| ) | ||
| requests = get_resources(requester) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might run introduce another flaky issue if the resource requests didn't get cleared in time after shutdown_executor?
Since we break out of the loop if the length of the requests is 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this will force the test to run for at least 3*0.05 seconds right? Since we expect an empty list so it'll wait for the full deadline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Good catch, even though I'm using ray.get, it is possible that this remote call can take longer. It seems complicated to guarantee that the request is complete on the actor side so instead I just modified the get_resources helper to wait for
timeoutuntil the new request is different than the previous. Also for clarification, the first request won't be "cleared" but rather "replaced" ({"CPU":1} -> {}). - No, it's not expecting an empty list. There will be a request made but just with a {} bundle so it results in a non-empty request dict. Also, the max is ~3 seconds rather than 3*0.05 (very short).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think we can avoid get_resources here for the same reason as above, since shutdown queues the request_resources({}) task before our __ray_call__.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded above
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The data resource cleanup tests experienced flaky behavior due to relying on
SplitCoordinator.start_epoch.remote()to indirectly request resources to theAutoscalingRequester. There is no guarantee by the time of assertion that a scaling request has been made yet.This PR:
try_trigger_scalingcalls instead of relying onSplitCoordinator.start_epoch