This PR consists of the following clean-up items for KubeRay autoscaler integration:
Remove the docker/kuberay directory
Move the Python files formerly in docker/kuberay to the autoscaler directory.
Use a rayproject/ray image for the autoscaler.
Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.
Slightly simplify the code that starts the autoscaler.
Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.
By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.
Add the autoscaler configuration test to the CI.
Update development documentation to reflect the changes in this PR.
`test_deploy` has become [flakey](https://flakey-tests.ray.io/#) due to timeout. Since `test_deploy` is already a "large" test, this change splits it into two testing files instead of simply increasing the timeout.
This PR splits up the changes in #22393 and introduces an implementation of the ML Checkpoint interface used by Ray Tune.
This means, the TuneCheckpoint class implements the to/from_[bytes|dict|directory|object_ref|uri] conversion functions, as well as more high-level functions to transition between the different TuneCheckpoint classes. It also includes test cases for Tune's main conversion modes, i.e. dict - intermediate - dict and fs - intermediate - fs.
These changes will be the basis for refactoring the tune interface to use TuneCheckpoint objects instead of TrialCheckpoints (externally) and instead of paths/objects (internally).
The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
* refactor resource data structure in gcs
* fix comment
* fix lint error
* fix
* DISABLED_TestRejectedRequestWorkerLeaseReply as it depends on the update of normal task
Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
Follow-up to #22748, enabling tests in CI.
Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.
Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
test_plasma_unlimited::test_task_unlimited is flaky because one of the assertions is race-y and can trigger after the condition is no longer true (see #22883). This fixes the flake by:
- adding an assertion in between two object allocations to force the object store queue to flush
- keeping one of the ObjectRefs in scope to make sure that the object is still fallback-allocated by the time we reach the failing assertion
We essentially use a hack to determine whether shuffling should be enabled in prepare_data_loader. I've clarified the documentation so the hack is easier to understand.
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.
Co-authored-by: Alex Wu <alex@anyscale.com>
This PR migrates scalability tests to the new infra.
I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
To support mixed precision (see #20643), we need to store a GradScaler instance that is accessibly by both prepare_optimizer and backward functions (these functions will be added later).
This PR introduces the Accelerator, an object that implements methods to perform backend-specific training optimizations.
If you don't add `ray.init("auto")` to your training script, then your training script might complain that there aren't enough resources, even if `ray status` shows that there are.
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Comments to be noted from the discussion below,
https://github.com/ray-project/ray/pull/22113#discussion_r802512907
> Problem - We cannot always delegate call to cls.__init__ or modified_cls.__init__. Because if always delegate call to cls.__init__ from here, then user defined class's __init__ method will be ignore leading to issues like, https://github.com/ray-project/ray/issues/21868. If we always delegate call to modified_cls.__init__ then it will allow inheriting from actor classes leading to failure of test_actor_inheritance. So, I have added this if-else check to figure out which __init__ method should be called. If "__module__", "__qualname__" and "__init__" are present in args[-1] then it would mean an actor class is being inherited so cls.__init__ should be called. However, if no such signal is received in args then user defined class's __init__ i.e., modified_class.__init__ should be called.
https://github.com/ray-project/ray/pull/22113#discussion_r808696261
> So I noted that ActorClass.__init__ will anyway raise a TypeError whenever it will be inherited. To exactly figure out whether the exception is due to inheritance of ActorClass, I created a new class ActorClassInheritanceException(TypeError). Now, whenever this will be raised, then DerivedActorClass will get a clear signal about inheritance of ActorClass. In other cases, it will be safe to conclude (AFAICT) that user called __init__ method of their class and we will proceed normally. IMHO, this is a better and more robust solution which just depends on a simple signal i.e., raising a particular exception in a specific event. It doesn't matter how inheritance is prevented as in the end we just need to raise ActorClassInheritanceException and all other code will be able to detect that easily.
https://github.com/ray-project/ray/pull/22113#issuecomment-1048527387