## Why are these changes needed?
- Fixes the jobs tab in the new dashboard. Previously it didn't load.
- Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs.
- Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input.
- Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs.
- Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver)
This PR has 2 deprecations
- The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated.
- The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated.
**This PR has 4 backwards incompatible changes:**
- list_jobs sdk now returns a list instead of a dictionary
- the `ray job list` CLI now prints a list instead of a dictionary
- The `/api/jobs` endpoint returns a list instead of a dictionary
- The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.
Why are these changes needed?
object_store_memory as task option has already been removed and object_store_memory as actor option is actually broken since at least last year (#20344).
Given we now have object spilling and fallback allocation, this option is no longer useful.
Why are these changes needed?
The Parquet file sampling PR (#26868) caused nightly test regression - #26995 . Turning off the feature by default, and keep debugging the root cause, to unblock nightly test.
This PR:
- Updates the KubeRay operator commit used in the Ray CI autoscaling test
- Uses the RayCluster autoscaling sample config from the KubeRay repo in place of of a config from the Ray repo
- Turns the autoscaler RPC worker drain back on, as I saw some dead node messages from the GCS, and the RPC drain is supposed to avoid those.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is:
```
#0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) ()
from /lib64/libstdc++.so.6
#1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] ()
from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so
#6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2
#11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2
#12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6
#13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6
#14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2
#15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2
#16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>)
at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369
```
The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`).
It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`.
The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though.
BTW, I've tried different approaches:
1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well.
2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.
Why are these changes needed?
The DLAMI moved underneath us and broke for 2 reasons.
The AMI's snapshot size increased to 140 GB which was more than our hardcoded max EBS volume size of 100GB
The AMI dropped support for python 3.7 and only has 3.8 now.
The solutions short term solutions are simple.
Allocate a bigger EBS volume.
Use the tensorflow 3.8 env.
Related issue number
Closes#26368
Co-authored-by: Alex <alex@anyscale.com>
Why are these changes needed?
This PR updates all autoscaler yaml examples/defaults to not use the legacy head_node and worker_node fields and deletes the explicit example-full-legacy yamls and the corresponding tests.
Related issue number
For ease of review, this PR is purely cosmetic/yaml editing (plus minor test changes to keep CI happy). It partially satisfies #20837. There will be 2 more follow up PRs (one to make the schema change and update the configs baked into unit tests, and another to clean up the legacy code).
Co-authored-by: Alex <alex@anyscale.com>
When cleaning up after the k8s operator tests, we should always delete the k8s cluster even if something went wrong (in fact, it's not clear we even need to clean up the resources within the cluster.
Signed-off-by: Alex Wu <itswu.alex@gmail.com>
This PR puts the Ray Clusters (under construction) docs section (see #26754) under Ray Clusters as a subpage.
This makes the master branch docs clean and presentable for users
Ray Clusters doc writers can use existing CI to iterate on the docs, without having a massive PR once we're done.
Signed-off-by: Cade Daniel <cade@anyscale.com>
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
## Why are these changes needed?
This PR does 2 things.
1. When `--detail` is specified, set the default formatting as yaml.
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server.
## Related issue number
Closes https://github.com/ray-project/ray/issues/26939
This PR:
Creates a new chapter in the docs titled "Ray Clusters (Under Construction)".
The new chapter makes the Ray Clusters docs follow the same structure as the other docs (https://diataxis.fr/)
The new chapter will eventually replace the old chapter.
I want to merge this now so that @DmitriGekhtman can put his Kuberay docs into the new structure.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* Revert "Revert "[Train] Add support for handling multiple batch data types for prepare_data_loader (#26386)" (#26483)"
This reverts commit e6c04031fd.
## Why are these changes needed?
This PR ensures that workflow can work properly with Ray client.
Regular workflow tests will (also) be running under client mode (as a pytest parameter). Some tests are moved and reorganized, because the Ray client tests requires starting the cluster, so some tests requires isolation or related changes.
Tests that literally take down the cluster are not tested with Ray client, since Ray client would fail in this scenario.
Limitations of Ray Workflow under Ray client are noted in the doc.
## Related issue number
Closes#21595
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Latest Pytorch version has wheels for CUDA 11.6. Per user request, adding a 11.6 image as part of our build pipeline.
Currently, restoring from cloud URIs does not work for Tuner() objects. With this PR, e.g. `Tuner.restore("s3://bucket/exp")` will work.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.
Also cleans up actor cache by removing dead actors from the cache.
Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">
## Related issue number
fixes#23680fixes#22027fixes#24272
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For KubeRay,
Disables autoscaler's RPC drain of worker nodes prior to termination.
Disables autoscaler's termination of nodes disconnected from the GCS.
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.
# Related issue number
Closes#26332
Signed-off-by: rickyyx rickyx@anyscale.com
# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:
Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.
This PR distinguishes these scenarios and prompt warnings accordingly.
# Related issue number
Closes#26570Closes#26923
When a task failed, it'll wait for the death info and then fail. The waiting is 1s and the checking is every 1s. This is good for usability, but it causes issues for some cases because it'll delay the task return at most 2s and at least 1s.
This PR introduce an early cut where when the timeout is set to be 0, it'll just return immediately. The semantics doesn't change and for most users they are still going to get the message.
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.