Skip to content

Conversation

@JasonLi1909
Copy link
Contributor

The data resource cleanup tests experienced flaky behavior due to relying on SplitCoordinator.start_epoch.remote() to indirectly request resources to the AutoscalingRequester. There is no guarantee by the time of assertion that a scaling request has been made yet.

This PR:

  • Implements a 3 second timeout to wait for resource requests to be made
  • Makes explicit autoscaling requests via try_trigger_scaling calls instead of relying on SplitCoordinator.start_epoch
  • Basic cleanup

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
@JasonLi1909 JasonLi1909 requested a review from a team as a code owner December 1, 2025 23:35
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to de-flake the data resource cleanup tests by making autoscaling requests explicit and waiting for them with a timeout. This is a good approach to improve test stability. My review identifies a potential race condition in the new waiting logic that could still lead to flakiness and provides a set of suggestions to make the test more robust. The changes also include some beneficial cleanups.

@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Dec 2, 2025
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The autoscaling requester is not a threaded actor, so only one actor task should be running at a time, and the order is in task submission order. Is it possible to just make sure that the get_resources() remote call is submitted after the request_resources call?

Comment on lines +126 to +130
# Explicity trigger autoscaling
ray.get(
coord.__ray_call__.remote(
lambda coord: coord._executor._cluster_autoscaler.try_trigger_scaling()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we only need to do this, and remove the other loop. See the top-level review comment.

This will force the request_resources task to get submitted before submitting the ray_call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this forces the remote call, it is still unknown how long this call will take so the loop in get_requests is still useful. The loop is also used for the second set of assertions after shutdown. See my response to your below comment.

Copy link
Contributor

@justinvyu justinvyu Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we know that the request_resources task will be scheduled as long as we wait for try_trigger_scaling to finish.

Then, since the Actor's task queue is executed in order, we know that the task to get the current requests will happen after request_resources finishes.

See https://docs.ray.io/en/latest/ray-core/actors/task-orders.html#synchronous-single-threaded-actor

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see. So I tried removing the loop and it is actually still flaky. I believe this is because "actors do not guarantee the execution order of the tasks from different submitters" from the page you linked. Even though we call try_trigger_scaling() first, it is submitted from the Cluster Autoscaler whereas the following ray_call is submitted on the driver. Because these calls are from two different submitters, there is no guarantee on the ordering. So I believe we still need the loop- it is not flaky when I include it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good catch, sounds good

requests = ray.get(
requester.__ray_call__.remote(lambda requester: requester._resource_requests)
)
requests = get_resources(requester)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might run introduce another flaky issue if the resource requests didn't get cleared in time after shutdown_executor?

Since we break out of the loop if the length of the requests is 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this will force the test to run for at least 3*0.05 seconds right? Since we expect an empty list so it'll wait for the full deadline.

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good catch, even though I'm using ray.get, it is possible that this remote call can take longer. It seems complicated to guarantee that the request is complete on the actor side so instead I just modified the get_resources helper to wait for timeout until the new request is different than the previous. Also for clarification, the first request won't be "cleared" but rather "replaced" ({"CPU":1} -> {}).
  2. No, it's not expecting an empty list. There will be a request made but just with a {} bundle so it results in a non-empty request dict. Also, the max is ~3 seconds rather than 3*0.05 (very short).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think we can avoid get_resources here for the same reason as above, since shutdown queues the request_resources({}) task before our __ray_call__.

Copy link
Contributor Author

@JasonLi1909 JasonLi1909 Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded above

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu enabled auto-merge (squash) December 4, 2025 20:02
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 4, 2025
@github-actions github-actions bot disabled auto-merge December 4, 2025 20:07
@justinvyu justinvyu merged commit bc1c27d into ray-project:master Dec 5, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants