Page structure changes:
Deploying a Ray Cluster on Kubernetes
Getting Started -> links to jobs
Deploying a Ray Cluster on VMs
Getting started -> links to jobs
User Guides
Autoscaling (moved more content here in favor of the Getting started page)
Running Applications on Ray Clusters
Ray Jobs
Quickstart Using the Ray Jobs CLI
Python SDK
REST API
Ray Job Submission API Reference
Ray Client
Content changes:
modified "Deploying a Ray Cluster ..." quickstart pages to briefly summarize ad-hoc command execution, then link to jobs
modified Ray Jobs example to be more incremental - start with a simple example, then show long-running script, then show example with a runtime env, instead of all of them at once
center Ray Jobs quickstart around using the CLI. Made some minor changes to the Python SDK page to match it
remove "Ray Jobs Architecture"
moved "Autoscaling" content away from Kubernetes "Getting started" page into its own user guide. I think it's too complicated for "Getting Started". No content cuts.
Cut "Viewing the dashboard" and "Ray Client" from Kubernetes "Getting started" page.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
A new feature was recently added, where Serve replicas are not restarted if only `num_replicas`, `autoscaling_config`, and/or `user_config` is updated in the config file that's redeployed. Updating docs to talk about this feature.
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).
This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu
Various cleanups around docs on Ray cluster "Monitoring and observability". After #27723, we will move these to a common page outside of VMs/k8s subsections:
Add links to the more comprehensive observability section.
Move and clean up cluster-specific content from Prometheus metrics to the new Ray Cluster page. I also modified a bunch of text here because previously we were not very clear about what the recommended approach was.
Include more specific instructions about setting up observability tools for VMs vs k8s.
User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.
Signed-off-by: Cheng Su <scnju13@gmail.com>
This adds the structure described here, namely adding a new section under Ray Clusters which is focused on running applications on Ray clusters.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
*This PR:
Copies the existing clusters API reference to the new structure. The reference docs are split out into Ray Clusters (common between vms and k8s) and Ray Clusters on VMs (specific to vms). Notably, there is also a reference section for k8s, but not in this PR.
Move the three job submission user guides back into a single one. Jules had suggested that we break them out into rest/sdk/cli, but that's not P0 right now.
Fix some bugs in the left navigation bar. There should be less duplication of TOC entries. I'll keep working on related fixes in a different PR.
Signed-off-by: Cade Daniel <cade@anyscale.com>
When we deserialize actor handle via pickle, we will register it with an outer object ref equaling to itself which is wrong. For out-of-band deserialization, there should be no outer object ref.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu
Cluster address is now written to a temp file. Previously we raised an error if ray start --head tried to reuse the old cluster address in the temp file, even if Ray was no longer running. This PR allows ray start --head to continue if it can't find any GCS process associated with the recorded cluster address.
Related issue number
Closes#27021.
This PR is an edit pass on the Performance Tuning page after reading it with fresh eyes. None of the content was out of date so it's mostly nits and rewording some parts that were slightly confusing.