Why are these changes needed?
The Parquet file sampling PR (#26868) caused nightly test regression - #26995 . Turning off the feature by default, and keep debugging the root cause, to unblock nightly test.
This PR:
- Updates the KubeRay operator commit used in the Ray CI autoscaling test
- Uses the RayCluster autoscaling sample config from the KubeRay repo in place of of a config from the Ray repo
- Turns the autoscaler RPC worker drain back on, as I saw some dead node messages from the GCS, and the RPC drain is supposed to avoid those.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Why are these changes needed?
The DLAMI moved underneath us and broke for 2 reasons.
The AMI's snapshot size increased to 140 GB which was more than our hardcoded max EBS volume size of 100GB
The AMI dropped support for python 3.7 and only has 3.8 now.
The solutions short term solutions are simple.
Allocate a bigger EBS volume.
Use the tensorflow 3.8 env.
Related issue number
Closes#26368
Co-authored-by: Alex <alex@anyscale.com>
Why are these changes needed?
This PR updates all autoscaler yaml examples/defaults to not use the legacy head_node and worker_node fields and deletes the explicit example-full-legacy yamls and the corresponding tests.
Related issue number
For ease of review, this PR is purely cosmetic/yaml editing (plus minor test changes to keep CI happy). It partially satisfies #20837. There will be 2 more follow up PRs (one to make the schema change and update the configs baked into unit tests, and another to clean up the legacy code).
Co-authored-by: Alex <alex@anyscale.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
## Why are these changes needed?
This PR does 2 things.
1. When `--detail` is specified, set the default formatting as yaml.
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server.
## Related issue number
Closes https://github.com/ray-project/ray/issues/26939
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* Revert "Revert "[Train] Add support for handling multiple batch data types for prepare_data_loader (#26386)" (#26483)"
This reverts commit e6c04031fd.
## Why are these changes needed?
This PR ensures that workflow can work properly with Ray client.
Regular workflow tests will (also) be running under client mode (as a pytest parameter). Some tests are moved and reorganized, because the Ray client tests requires starting the cluster, so some tests requires isolation or related changes.
Tests that literally take down the cluster are not tested with Ray client, since Ray client would fail in this scenario.
Limitations of Ray Workflow under Ray client are noted in the doc.
## Related issue number
Closes#21595
Currently, restoring from cloud URIs does not work for Tuner() objects. With this PR, e.g. `Tuner.restore("s3://bucket/exp")` will work.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For KubeRay,
Disables autoscaler's RPC drain of worker nodes prior to termination.
Disables autoscaler's termination of nodes disconnected from the GCS.
Signed-off-by: rickyyx rickyx@anyscale.com
# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:
Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.
This PR distinguishes these scenarios and prompt warnings accordingly.
# Related issue number
Closes#26570Closes#26923
When a task failed, it'll wait for the death info and then fail. The waiting is 1s and the checking is every 1s. This is good for usability, but it causes issues for some cases because it'll delay the task return at most 2s and at least 1s.
This PR introduce an early cut where when the timeout is set to be 0, it'll just return immediately. The semantics doesn't change and for most users they are still going to get the message.
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.
Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.
See #26714
Signed-off-by: Kai Fricke <kai@anyscale.com>
Calling e.g. `os.path.exists(checkpoint)` currently raises an TypeError, but we should make it more explicit and guide users towards the correct API.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR just applies the changes from the following PRs:
[Datasets] Automatically cast tensor columns when building Pandas blocks. #26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." #26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. #26904
This fixes the test failures introduced in the originally reverted PRs.
Update cluster_activities endpoint to use pydantic so we have better data validation.
Make timestamp a required field.
Add pydantic to ray[default] requirements
Why are these changes needed?
Resubmitting #26869.
This PR was reverted due to failing tests; however, those failures were actually due to a dependency: #26950
Signed-off-by: Matthew Deng matt@anyscale.com
Note: This aims to mitigate the errors of the failing tests, but a follow-up is needed for a long-term solution.
Why are these changes needed?
A bunch of CI tests started failing on 7/23.
Quick sanity check shows only werkzeug was upgraded from 2.1.2 to 2.2.0. The new version was released on 7.23.
Verified that running pip install -U Werkzeug==2.1.2 fixes (at least) test_dataset_formats.
This PR does 3 things.
1. Warn if callsite is disabled when `ray list objects` and `ray summary objects`
2. Decode owner_id for ray list actors
3. Support raise_on_missing_output
Signed-off-by: Yi Cheng <chengyidna@gmail.com>
## Why are these changes needed?
Right now, only cpp layer in ray is connecting to redis which means we don't need pip redis to connect to a redis db.
The blocking part is that we are doing some sharding in redis right now. But this feature is not actually used and the shard is always 1. So to make things simple, this feature is just disabled.
Test is added to make sure we can start ray with a redis db without pip redis.
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for all examples included in python/ray/tune/examples
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: xwjiang2010 <xwjiang2010@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke coding@kaifricke.com
Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig
Signed-off-by: Yi Cheng <chengyidna@gmail.com>
## Why are these changes needed?
When actor died, it'll send notification to core workers. Right now, sometimes, core worker will queue the task waiting for actor death info and pop it up for better usability. But in async cases, this is going to cause issues.
The callback might submit tasks which require holding the lock. But it's already being held. This is going to cause a deadlock.
This PR fixed this by moving the failure part out of the lock.