This PR exposes the new checkpoint interface, implemented in #22691, to end users. It does this by replacing the old external facing TrialCheckpoint class with a merged class that supports the old TrialCheckpoint API (upload, download, save) as well as the new Checkpoint API.
With this PR, users can use the new Checkpoint interface for downstream processing of their Ray Tune results. In a follow-up PR, the new Checkpoint interface will be used internally within Ray Tune and Train for bookkeeping, however, that is not required to unblock the Ray ML use case.
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7
The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.
This commit updates the Horovod RayExecutor init args.
Co-authored-by: Kai Fricke <kai@anyscale.com>
This PR consists of the following clean-up items for KubeRay autoscaler integration:
Remove the docker/kuberay directory
Move the Python files formerly in docker/kuberay to the autoscaler directory.
Use a rayproject/ray image for the autoscaler.
Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.
Slightly simplify the code that starts the autoscaler.
Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.
By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.
Add the autoscaler configuration test to the CI.
Update development documentation to reflect the changes in this PR.
`test_deploy` has become [flakey](https://flakey-tests.ray.io/#) due to timeout. Since `test_deploy` is already a "large" test, this change splits it into two testing files instead of simply increasing the timeout.
This PR splits up the changes in #22393 and introduces an implementation of the ML Checkpoint interface used by Ray Tune.
This means, the TuneCheckpoint class implements the to/from_[bytes|dict|directory|object_ref|uri] conversion functions, as well as more high-level functions to transition between the different TuneCheckpoint classes. It also includes test cases for Tune's main conversion modes, i.e. dict - intermediate - dict and fs - intermediate - fs.
These changes will be the basis for refactoring the tune interface to use TuneCheckpoint objects instead of TrialCheckpoints (externally) and instead of paths/objects (internally).
The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
* refactor resource data structure in gcs
* fix comment
* fix lint error
* fix
* DISABLED_TestRejectedRequestWorkerLeaseReply as it depends on the update of normal task
Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
Follow-up to #22748, enabling tests in CI.
Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.
Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
test_plasma_unlimited::test_task_unlimited is flaky because one of the assertions is race-y and can trigger after the condition is no longer true (see #22883). This fixes the flake by:
- adding an assertion in between two object allocations to force the object store queue to flush
- keeping one of the ObjectRefs in scope to make sure that the object is still fallback-allocated by the time we reach the failing assertion
We essentially use a hack to determine whether shuffling should be enabled in prepare_data_loader. I've clarified the documentation so the hack is easier to understand.
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.
Co-authored-by: Alex Wu <alex@anyscale.com>
This PR migrates scalability tests to the new infra.
I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
To support mixed precision (see #20643), we need to store a GradScaler instance that is accessibly by both prepare_optimizer and backward functions (these functions will be added later).
This PR introduces the Accelerator, an object that implements methods to perform backend-specific training optimizations.