Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method
Adds validation for TrainingArguments.load_best_model_at_end (will throw an error down the line if set to True), fixes validation for *_steps, adds test.
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator.
To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code.
These notes are linked from the new docs.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
This change adds launch failures to the recent failures section of ray status when a node provider provides structured error information. For node providers which don't provide this optional information, there is now change in behavior.
For reference, when trying to launch a node type with a quota issue, it looks like the following. InsufficientInstanceCapacity is the standard term for this issue..
```
======== Autoscaler status: 2022-08-11 22:22:10.735647 ========
Node status
---------------------------------------------------------------
Healthy:
1 cpu_4_ondemand
Pending:
quota, 1 launching
Recent failures:
quota: InsufficientInstanceCapacity (last_attempt: 22:22:00)
Resources
---------------------------------------------------------------
Usage:
0.0/4.0 CPU
0.00/9.079 GiB memory
0.00/4.539 GiB object_store_memory
Demands:
(no resource demands)
```
```
available_node_types:
cpu_4_ondemand:
node_config:
InstanceType: m4.xlarge
ImageId: latest_dlami
resources: {}
min_workers: 0
max_workers: 0
quota:
node_config:
InstanceType: p4d.24xlarge
ImageId: latest_dlami
resources: {}
min_workers: 1
max_workers: 1
```
Co-authored-by: Alex <alex@anyscale.com>
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)
Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup
Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)
The approach is to use environment variables to induce failures in various places. The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be. I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
The test was written incorrectly. This root cause was that the trainer & worker both requires 1 CPU, meaning pg requires {CPU: 1} * 2 resources.
And when the max fraction is 0.001, we only allow up to 1 CPU for pg, so we cannot schedule the requested pgs in any case.
# Why are these changes needed?
- Promote APIs to PublicAPI(alpha)
- Change pre-alpha -> alpha
- Fix a bug ray_logs is displayed to ray --help
Release test result: #26610
Some APIs are subject to change at the beta stage (e.g., ray list jobs or ray logs).
Why are these changes needed?
This PR fixes the edge cases when the max_cpu_fraction argument is used by the placement group. There was specifically an edge case where the placement group cannot be scheduled when a task or actor is scheduled and occupies the resource.
The original logic to decide if the bundle scheduling exceed CPU fraction was as follow.
Calculate max_reservable_cpus of the node.
Calculate currently_used_cpus + bundle_cpu_request (per bundle) == total_allocation of the node.
Don't schedule if total_allocation > max_reservable_cpus for the node.
However, the following scenarios caused issues because currently_used_cpus can include resources that are not allocated by placement groups (e.g., actors). As a result, when the actor was already occupying the resource, the total_allocation was incorrect. For example,
4 CPUs
0.999 max fraction (so it can reserve up to 3 CPUs)
1 Actor already created (1 CPU)
PG with CPU: 3
Now pg cannot be scheduled because total_allocation == 1 actor (1 CPU) + 3 bundles (3 CPUs) == 4 CPUs > 3 CPUs (max frac CPUs)
However, this should work because the pg can use up to 3 CPUs, and we have enough resources.
The root cause is that when we calculate the max_fraction, we should only take into account of resources allocated by bundles. To fix this, I change the logic as follow.
Calculate max_reservable_cpus of the node.
Calculate **currently_used_cpus_by_pg_bundles** + **bundle_cpu_request (sum of all bundles)** == total_allocation_from_pgs_and_bundles of the node.
Don't schedule if total_allocation_from_pgs_and_bundles > max_reservable_cpus for the node.
Serve relies on being able to do quiet application-level retries, and this info-level logging is resulting in log spam hitting users. This PR demotes this log statement to debug-level to prevent this log spam.
Co-authored-by: simon-mo <simon.mo@hey.com>
This PR is to improve in-memory data size estimation of image folder data source. Before this PR, we use on-disk file size as estimation of in-memory data size of image folder data source. This can be inaccurate due to image compression and in-memory image resizing.
Given `size` and `mode` is set to be optional in https://github.com/ray-project/ray/pull/27295, so change this PR to tackle the simple case when `size` and `mode` are both provided.
* `size` and `mode` is provided: just calculate the in-memory size based on the dimensions, not need to read any image (this PR)
* `size` or `mode` is not provided: need sampling to determine the in-memory size (will do in another followup PR).
Here is example of estiamted size for our test image folder
```
>>> import ray
>>> from ray.data.datasource.image_folder_datasource import ImageFolderDatasource
>>> root = "example://image-folders/different-sizes"
>>> ds = ray.data.read_datasource(ImageFolderDatasource(), root=root, size=(64, 64), mode="RGB")
>>> ds.size_bytes()
40310
>>> ds.fully_executed().size_bytes()
37428
```
Without this PR:
```
>>> ds.size_bytes()
18978
```
The "Monitoring Ray Serve" page explains how to inspect your Ray Serve applications. This change updates the page to remove outdated metrics that Serve no longer exposes and to upgrade code samples to use 2.0 APIs. It also improves the content's readability and organization.
Link to updated "Monitoring Ray Serve" page: https://ray--27777.org.readthedocs.build/en/27777/serve/monitoring.html
The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress.
This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).
This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.
Signed-off-by: Cheng Su <scnju13@gmail.com>
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu
Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number
Closes#27021.
The tensor extension import is a bit expensive since it will go through Arrow's and Pandas' extension type registration logic. This PR delays the tensor extension type import until Parquet reading, which is the only case in which we need to explicitly register the type.
I have confirmed that the Parquet reading in doc/source/data/doc_code/tensor.py passes with this change.
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.
- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.
To enable Serve HA, we also need to setup:
- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5
which we should set in KubeRay.
JobCounter is not working with storage namespace right now because the key is the same across namespaces.
This PR fixed it by just adding it there because this add the minimal changes which is safer.
A follow up PR is needed to cleanup redis storage in cpp.
Object freed by the manual and internal free call previously would not get reconstructed. This PR introduces the following semantics after a free call:
If no failures occurs, and the object is needed by a downstream task, an ObjectFreedError will be thrown.
If a failure occurs, causing a downstream task to be re-executed, the freed object will get reconstructed as usual.
Also fixes some incidental bugs:
Don't crash on failure to contact local raylet during object recovery. This will produce a nicer error message because we will instead throw an application-level error when someone tries to get an object.
Fix a circular lock dependency between task failure <> task dependency resolution.
Related issue number
Closes#27265.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Yi Cheng <chengyidna@gmail.com>
## Why are these changes needed?
This test timeout. Move it to large.
```
WARNING: //python/ray/workflow:tests/test_error_handling: Test execution time (288.7s excluding execution overhead) outside of range for MODERATE tests. Consider setting timeout="long" or size="large".
```