The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
* refactor resource data structure in gcs
* fix comment
* fix lint error
* fix
* DISABLED_TestRejectedRequestWorkerLeaseReply as it depends on the update of normal task
Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
Follow-up to #22748, enabling tests in CI.
Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.
Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
test_plasma_unlimited::test_task_unlimited is flaky because one of the assertions is race-y and can trigger after the condition is no longer true (see #22883). This fixes the flake by:
- adding an assertion in between two object allocations to force the object store queue to flush
- keeping one of the ObjectRefs in scope to make sure that the object is still fallback-allocated by the time we reach the failing assertion
We essentially use a hack to determine whether shuffling should be enabled in prepare_data_loader. I've clarified the documentation so the hack is easier to understand.
Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now.
Co-authored-by: Alex Wu <alex@anyscale.com>
This PR migrates scalability tests to the new infra.
I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
To support mixed precision (see #20643), we need to store a GradScaler instance that is accessibly by both prepare_optimizer and backward functions (these functions will be added later).
This PR introduces the Accelerator, an object that implements methods to perform backend-specific training optimizations.
If you don't add `ray.init("auto")` to your training script, then your training script might complain that there aren't enough resources, even if `ray status` shows that there are.
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
Comments to be noted from the discussion below,
https://github.com/ray-project/ray/pull/22113#discussion_r802512907
> Problem - We cannot always delegate call to cls.__init__ or modified_cls.__init__. Because if always delegate call to cls.__init__ from here, then user defined class's __init__ method will be ignore leading to issues like, https://github.com/ray-project/ray/issues/21868. If we always delegate call to modified_cls.__init__ then it will allow inheriting from actor classes leading to failure of test_actor_inheritance. So, I have added this if-else check to figure out which __init__ method should be called. If "__module__", "__qualname__" and "__init__" are present in args[-1] then it would mean an actor class is being inherited so cls.__init__ should be called. However, if no such signal is received in args then user defined class's __init__ i.e., modified_class.__init__ should be called.
https://github.com/ray-project/ray/pull/22113#discussion_r808696261
> So I noted that ActorClass.__init__ will anyway raise a TypeError whenever it will be inherited. To exactly figure out whether the exception is due to inheritance of ActorClass, I created a new class ActorClassInheritanceException(TypeError). Now, whenever this will be raised, then DerivedActorClass will get a clear signal about inheritance of ActorClass. In other cases, it will be safe to conclude (AFAICT) that user called __init__ method of their class and we will proceed normally. IMHO, this is a better and more robust solution which just depends on a simple signal i.e., raising a particular exception in a specific event. It doesn't matter how inheritance is prevented as in the end we just need to raise ActorClassInheritanceException and all other code will be able to detect that easily.
https://github.com/ray-project/ray/pull/22113#issuecomment-1048527387
There is a bug in combining the results from map_batches: if we create two dataset out of the same data, but with different num of partitions, we may get different results when run the same map_batches() on them. That is, num of partitions is affecting the map_batches() results, which should not.
To prepare for additional changes in pubsub to fix#22339 and #22340,
- Use structs instead of std::pair to hold per-subscription data, in case we need to expand the data fields.
- Rename variables in tests to indicate non-object pubsub testing.
- Pass full request to long poll handler in Publisher.
- Simplify logic when possible.
There should be no behavior change. Most of the code changes are based on #20276
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.
Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).