[train] De-flake Data Resource Cleanup Tests #59097

JasonLi1909 · 2025-12-01T23:35:31Z

The data resource cleanup tests experienced flaky behavior due to relying on SplitCoordinator.start_epoch.remote() to indirectly request resources to the AutoscalingRequester. There is no guarantee by the time of assertion that a scaling request has been made yet.

This PR:

Implements a 3 second timeout to wait for resource requests to be made
Makes explicit autoscaling requests via try_trigger_scaling calls instead of relying on SplitCoordinator.start_epoch
Basic cleanup

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

gemini-code-assist

Code Review

This pull request aims to de-flake the data resource cleanup tests by making autoscaling requests explicit and waiting for them with a timeout. This is a good approach to improve test stability. My review identifies a potential race condition in the new waiting logic that could still lead to flakiness and provides a set of suggestions to make the test more robust. The changes also include some beneficial cleanups.

python/ray/train/v2/tests/test_data_resource_cleanup.py

justinvyu

The autoscaling requester is not a threaded actor, so only one actor task should be running at a time, and the order is in task submission order. Is it possible to just make sure that the get_resources() remote call is submitted after the request_resources call?

justinvyu · 2025-12-02T19:57:12Z

python/ray/train/v2/tests/test_data_resource_cleanup.py

+    # Explicity trigger autoscaling
+    ray.get(
+        coord.__ray_call__.remote(
+            lambda coord: coord._executor._cluster_autoscaler.try_trigger_scaling()
+        )


Maybe we only need to do this, and remove the other loop. See the top-level review comment.

This will force the request_resources task to get submitted before submitting the ray_call.

While this forces the remote call, it is still unknown how long this call will take so the loop in get_requests is still useful. The loop is also used for the second set of assertions after shutdown. See my response to your below comment.

But we know that the request_resources task will be scheduled as long as we wait for try_trigger_scaling to finish.

Then, since the Actor's task queue is executed in order, we know that the task to get the current requests will happen after request_resources finishes.

See https://docs.ray.io/en/latest/ray-core/actors/task-orders.html#synchronous-single-threaded-actor

Ahh I see. So I tried removing the loop and it is actually still flaky. I believe this is because "actors do not guarantee the execution order of the tasks from different submitters" from the page you linked. Even though we call try_trigger_scaling() first, it is submitted from the Cluster Autoscaler whereas the following ray_call is submitted on the driver. Because these calls are from two different submitters, there is no guarantee on the ordering. So I believe we still need the loop- it is not flaky when I include it.

ah good catch, sounds good

justinvyu · 2025-12-02T19:58:58Z

python/ray/train/v2/tests/test_data_resource_cleanup.py

-    requests = ray.get(
-        requester.__ray_call__.remote(lambda requester: requester._resource_requests)
-    )
+    requests = get_resources(requester)


This might run introduce another flaky issue if the resource requests didn't get cleared in time after shutdown_executor?

Since we break out of the loop if the length of the requests is 1.

Also, this will force the test to run for at least 3*0.05 seconds right? Since we expect an empty list so it'll wait for the full deadline.

Good catch, even though I'm using ray.get, it is possible that this remote call can take longer. It seems complicated to guarantee that the request is complete on the actor side so instead I just modified the get_resources helper to wait for timeout until the new request is different than the previous. Also for clarification, the first request won't be "cleared" but rather "replaced" ({"CPU":1} -> {}).

No, it's not expecting an empty list. There will be a request made but just with a {} bundle so it results in a non-empty request dict. Also, the max is ~3 seconds rather than 3*0.05 (very short).

I see. I think we can avoid get_resources here for the same reason as above, since shutdown queues the request_resources({}) task before our __ray_call__.

Responded above

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

justinvyu

Thanks!

fixed tests

502bb99

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

JasonLi1909 requested a review from a team as a code owner December 1, 2025 23:35

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

python/ray/train/v2/tests/test_data_resource_cleanup.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_data_resource_cleanup.py Outdated Show resolved Hide resolved

python/ray/train/v2/tests/test_data_resource_cleanup.py Outdated Show resolved Hide resolved

cursor bot reviewed Dec 1, 2025

View reviewed changes

python/ray/train/v2/tests/test_data_resource_cleanup.py Show resolved Hide resolved

ray-gardener bot added the train Ray Train Related Issue label Dec 2, 2025

justinvyu reviewed Dec 2, 2025

View reviewed changes

test update

79153e1

Signed-off-by: JasonLi1909 <jasli1909@gmail.com>

justinvyu approved these changes Dec 4, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) December 4, 2025 20:02

github-actions bot added the go add ONLY when ready to merge, run all tests label Dec 4, 2025

Merge branch 'master' into fix-data-resource-cleanup-test

2b47e5c

github-actions bot disabled auto-merge December 4, 2025 20:07

justinvyu merged commit bc1c27d into ray-project:master Dec 5, 2025
6 checks passed

[train] De-flake Data Resource Cleanup Tests #59097

[train] De-flake Data Resource Cleanup Tests #59097

Uh oh!

Conversation

JasonLi1909 commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

justinvyu Dec 2, 2025 •

edited

Loading

JasonLi1909 Dec 3, 2025 •

edited

Loading

JasonLi1909 Dec 2, 2025 •

edited

Loading

JasonLi1909 Dec 3, 2025 •

edited

Loading