* Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)"
This reverts commit 00595653ed.
Failure in windows has been addressed by conditionally registering the signal handler if available.
Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted.
In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage.
This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.
We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes.
Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).
This PR addresses recent failures in the tune cloud tests.
In particular, this PR changes the following:
The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).
Local release test runs succeeded.
https://buildkite.com/ray-project/release-tests-branch/builds/189https://buildkite.com/ray-project/release-tests-branch/builds/191