Update autoscaler configuration docs for VM stack.
Removed the video, after looking at it it fits better in overview / and is possibly outdated
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This PR adds a guide on RayCluster configuration and a page of discussion about autoscaling.
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
* Save work
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* Update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* consistency
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* fixes
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* simplify
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* fix
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* wording
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
* update
Signed-off-by: Philipp Moritz <pcmoritz@gmail.com>
This PR migrates the old Community Supported Cluster Launcher docs to the new Ray Clusters doc structure.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
This PR
adds a page of guidance on GPU deployment with Ray/K8s. This page is a modified and slightly expanded version of the existing page https://docs.ray.io/en/latest/cluster/kubernetes-gpu.html
moves managed K8s service intro links to their own page
This PR puts the Ray Clusters (under construction) docs section (see #26754) under Ray Clusters as a subpage.
This makes the master branch docs clean and presentable for users
Ray Clusters doc writers can use existing CI to iterate on the docs, without having a massive PR once we're done.
Signed-off-by: Cade Daniel <cade@anyscale.com>
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.
# Related issue number
Closes#26332
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.
This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.
This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```
Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This PR
Adds a warning about a known issue to the KubeRay section of the Ray docs.
Updates the description of the feature state of KubeRay integration.
Adds some links to the KubeRay docs.
Adds notes explaining that Ray's support on Azure, Aliyun, and SLURM is community-maintained.
Rephrases the mention of K8s support in the intro.
This PR replaces https://github.com/ray-project/ray/pull/25504.
- Closes#23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
- Adds links to Job Submission from existing library tutorials where `ray submit` is used. When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
the include of content for md files like our central getting started page didn't render. fixed here.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).
I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs
Co-authored-by: Scott Graham <scgraham@microsoft.com>
This PR consists of the following clean-up items for KubeRay autoscaler integration:
Remove the docker/kuberay directory
Move the Python files formerly in docker/kuberay to the autoscaler directory.
Use a rayproject/ray image for the autoscaler.
Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config.
Slightly simplify the code that starts the autoscaler.
Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days.
By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config.
Add the autoscaler configuration test to the CI.
Update development documentation to reflect the changes in this PR.
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).
In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs.
- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before.
- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.