According to https://peps.python.org/pep-0338/
> The -m switch provides a benefit here, as it inserts the current directory into sys.path, instead of the directory contain the main module.
We should follow this and don't add the driver script directory to worker's sys.path. I couldn't find a way to detect that the driver is run via `python -m` but instead we don't add the script directory to worker's sys.path if it doesn't exist in driver's sys.path.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch.
Adds a sentence explaining that the Ray container has to go first in the container list.
Adds a sentence to config guide mention min/max replicas and linking to autoscaling.
Documents a bug related to GPU auto-detection in KubeRay 0.3.0.
For people who want to have better control over the node failures, and handle the error such as RayActorError by themselves. I think it's necessary to make things like actor_id as an attributed of the error.
Signed-off-by: Jiajie Li <ljjsalt@gmail.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
For a more balanced table of contents, makes CloudWatch instructions a subsection of AWS instructions.
This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.
Updates KubeRay version used in CI to v0.3.0-rc.2 (which we expect to be identical to the final v0.3.0).
Also removes a couple of old files.
Will open a corresponding cherry pick in the Ray 2.0.0 branch.
The key thing to verify is that the CI autoscaling test passes here and in the PR and in the PR against the 2.0.0 branch.
This PR makes the autoscaler event system for node launches more detailed. In particular, it does 4 related things:
Less verbose logging for node provider exceptions (printed to logs only, not driver)
Don't print to driver "adding 1 node(s) of type ..." when nodes don't launch (still print it if the node launch is successful).
Print to driver "Failed to launch ..."
Don't log a full exception to the driver.
The full driver event looks like this
```
Failed to launch 1 node(s) of type quota. (InsufficientInstanceCapacity): We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-west-2a). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-2b, us-west-2c.
```
Co-authored-by: Alex <alex@anyscale.com>
Take out the CLI reference from the core API subsection. It follows the same CLI reference pattern as other library (e.g., Serve has Serve CLI under Serve API section).
There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime.
This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.
This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821.
Signed-off-by: Cheng Su <scnju13@gmail.com>
Move the code to doc_code
Fix the code example to make batching faster than serial run.
Related issue number
#27048
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
An attempt at making the docs shorter and sweeter including various small cleanup items.
- Reorder the TOC on the sidebar for the user guides to be more linear based on a user's journey.
- Put the batching content under the performance guide.
- Remove the AIR guide (AIR users already have a serving guide).
- Combine the `ServeHandle` and model composition pages into a single guide. We may want to revisit this in the future but for now better to have it in a single place instead of duplicated (with links going to both).
- Fix the index page for the user guides to match the TOC sidebar.
- Rename a few pages for clarity & consistency.
- Remove some now-redundant content (old ML models user guide).