Failed node launch can lead to an extra unexpected error in the node launcher due to the definition of a mock prometheus metric method.
This failure leads to a permanently hanging autoscaler with "launching nodes" never cleared out and the autoscaler unable to proceed to launch nodes.
This PR fixes the method signature leading to the unexpected failure.
Signed-off-by: Alex Wu <alex@anyscale.io>
This is a minor QoL improvement to bump the hardcoded limit for number of aws keys per account. The limit is arbitrary and has been bumped before. AFAICT the fundamental aws limit is a 5000 key per region limit which we are not close to.
Root cause:
https://www.shell-tips.com/bash/source-dot-command/#gsc.tab=0
Using . will execute the command in the "current shell" in a bash script. It looks like removing . command from ci.sh init means that we will lose the set -eo command used within ci.sh init applied to next test running commands because set -eo is called within a child process, not the current shell (so the future command won't have the set -eo configured).
* Save work
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* Update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* consistency
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* fixes
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* simplify
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* fix
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* wording
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections.
The main file to review is here: doc/source/ray-air/trainer.rst.
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Now c++ worker doesn't support `ActorHandle` type parameter.
When we pass an `ActorHandle` object to a task, it will incur this error:

The reason is that caller just deserializes the actor handle but doesn't register it to core worker, so if we call tasks of the actor, it will not be found in local.
1. If a user reads a folder with grayscale and color images, ImageFolderDatasource errors.
2. There's no way to retain image shapes.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
This PR revamps and aligns the README and Ray intro doc page:
New "What is Ray" diagram that introduces AIR vs Ray core (diagram TBD finalized, this is the working placeholder)
Update the description of Ray
Link out to the user guides for key libraries and key concepts
Remove old / broken links, as well as the inline library descriptions from the README
- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
- Remove "Router" actor
- Update description of ServeHandle
- Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
This PR migrates the old Community Supported Cluster Launcher docs to the new Ray Clusters doc structure.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Removes deprecated APIs:
- serve.start()
- get_handle()
Rewrites the ServeHandle doc snippet to use the recommended workflow for ServeHandles (only access them from other deployments, pass Deployments in as input args to `.bind()`, which get resolved to ServeHandles at runtime)
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.
- Currently not all code under ray-core/doc_code is covered by CI.
- tf_example.py and torch_example.py are not used anywhere.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
This PR
adds a page of guidance on GPU deployment with Ray/K8s. This page is a modified and slightly expanded version of the existing page https://docs.ray.io/en/latest/cluster/kubernetes-gpu.html
moves managed K8s service intro links to their own page
Support a GPU column for the new dashboard
Have first node be default expanded
Signed-off-by: Alan Guo aguo@anyscale.comfixes#13889
Addresses comment from #26996
Converting a Pandas DataFrame column to an ndarray (e.g. via df[col].values) can often result in a full copy of the column in order to construct the ndarray due to Pandas' 2D block management. This PR ports tensor extension type checking to checking the dtype, which is always an O(1) check.
Signed-off-by: Clark Zinzow <clarkzinzow@gmail.com>