This PR:
- Updates the KubeRay operator commit used in the Ray CI autoscaling test
- Uses the RayCluster autoscaling sample config from the KubeRay repo in place of of a config from the Ray repo
- Turns the autoscaler RPC worker drain back on, as I saw some dead node messages from the GCS, and the RPC drain is supposed to avoid those.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:
```
#0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
from /lib64/libstdc++.so.6
#1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```
The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).
It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.
The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.
BTW, I've tried different approaches:
1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Why are these changes needed?
The DLAMI moved underneath us and broke for 2 reasons.
The AMI's snapshot size increased to 140 GB which was more than our hardcoded max EBS volume size of 100GB
The AMI dropped support for python 3.7 and only has 3.8 now.
The solutions short term solutions are simple.
Allocate a bigger EBS volume.
Use the tensorflow 3.8 env.
Related issue number
Closes#26368
Co-authored-by: Alex <alex@anyscale.com>
Why are these changes needed?
This PR updates all autoscaler yaml examples/defaults to not use the legacy head_node and worker_node fields and deletes the explicit example-full-legacy yamls and the corresponding tests.
Related issue number
For ease of review, this PR is purely cosmetic/yaml editing (plus minor test changes to keep CI happy). It partially satisfies #20837. There will be 2 more follow up PRs (one to make the schema change and update the configs baked into unit tests, and another to clean up the legacy code).
Co-authored-by: Alex <alex@anyscale.com>
When cleaning up after the k8s operator tests, we should always delete the k8s cluster even if something went wrong (in fact, it's not clear we even need to clean up the resources within the cluster.
Signed-off-by: Alex Wu <itswu.alex@gmail.com>
This PR puts the Ray Clusters (under construction) docs section (see #26754) under Ray Clusters as a subpage.
This makes the master branch docs clean and presentable for users
Ray Clusters doc writers can use existing CI to iterate on the docs, without having a massive PR once we're done.
Signed-off-by: Cade Daniel <cade@anyscale.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
## Why are these changes needed?
This PR does 2 things.
1. When `--detail` is specified, set the default formatting as yaml.
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server.
## Related issue number
Closes https://github.com/ray-project/ray/issues/26939
This PR:
Creates a new chapter in the docs titled "Ray Clusters (Under Construction)".
The new chapter makes the Ray Clusters docs follow the same structure as the other docs (https://diataxis.fr/)
The new chapter will eventually replace the old chapter.
I want to merge this now so that @DmitriGekhtman can put his Kuberay docs into the new structure.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* Revert "Revert "[Train] Add support for handling multiple batch data types for prepare_data_loader (#26386)" (#26483)"
This reverts commit e6c04031fd.
## Why are these changes needed?
This PR ensures that workflow can work properly with Ray client.
Regular workflow tests will (also) be running under client mode (as a pytest parameter). Some tests are moved and reorganized, because the Ray client tests requires starting the cluster, so some tests requires isolation or related changes.
Tests that literally take down the cluster are not tested with Ray client, since Ray client would fail in this scenario.
Limitations of Ray Workflow under Ray client are noted in the doc.
## Related issue number
Closes#21595
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Latest Pytorch version has wheels for CUDA 11.6. Per user request, adding a 11.6 image as part of our build pipeline.
Currently, restoring from cloud URIs does not work for Tuner() objects. With this PR, e.g. `Tuner.restore("s3://bucket/exp")` will work.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.
Also cleans up actor cache by removing dead actors from the cache.
Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">
## Related issue number
fixes#23680fixes#22027fixes#24272
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For KubeRay,
Disables autoscaler's RPC drain of worker nodes prior to termination.
Disables autoscaler's termination of nodes disconnected from the GCS.
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.
# Related issue number
Closes#26332
Signed-off-by: rickyyx rickyx@anyscale.com
# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:
Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.
This PR distinguishes these scenarios and prompt warnings accordingly.
# Related issue number
Closes#26570Closes#26923
When a task failed, it'll wait for the death info and then fail. The waiting is 1s and the checking is every 1s. This is good for usability, but it causes issues for some cases because it'll delay the task return at most 2s and at least 1s.
This PR introduce an early cut where when the timeout is set to be 0, it'll just return immediately. The semantics doesn't change and for most users they are still going to get the message.
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.
Currently, trainables will try to sync up/down temporary checkpoints from cloud storage, leading to errors. These erros come up e.g. with PBT, which heavily uses saving/restoring from objects.
Instead, we should not sync these temporary checkpoints up at all, and we should generally not sync down if a local checkpoint directory exists, which will prevent us also from trying to sync down non-existent temporary checkpoint directories.
See #26714
Signed-off-by: Kai Fricke <kai@anyscale.com>
Calling e.g. `os.path.exists(checkpoint)` currently raises an TypeError, but we should make it more explicit and guide users towards the correct API.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This PR just applies the changes from the following PRs:
[Datasets] Automatically cast tensor columns when building Pandas blocks. #26684
reverted by Revert "[Datasets] Automatically cast tensor columns when building Pandas blocks." #26921
[AIR - Datasets] Fix TensorDtype construction from string and fix example. #26904
This fixes the test failures introduced in the originally reverted PRs.