Adds a page describing a development workflow for Serve applications.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
The "Monitoring Ray Serve" page explains how to inspect your Ray Serve applications. This change updates the page to remove outdated metrics that Serve no longer exposes and to upgrade code samples to use 2.0 APIs. It also improves the content's readability and organization.
Link to updated "Monitoring Ray Serve" page: https://ray--27777.org.readthedocs.build/en/27777/serve/monitoring.html
Refactor Datasets API docs for easier navigation: [Ray Datasets API](https://ray--27592.org.readthedocs.build/en/27592/data/api/api.html)
### Changes
1. Create a new Datasets API base page.
2. Split existing APIs into separate pages.
3. Split `Dataset` and `DatasetPipeline` methods into separate sections.
1. Used `autosummary` to generate overview tables at the top of each of these pages. Open to other suggestions e.g. moving the summary to the top of each section instead.
2. **Note:** Every time we add a new method we need to explicitly add it here as well.
4. Add Input/Output APIs.
1. I chose to split these primarily by data format rather than type, since it's easier to navigate, and the existing [Creating Datasets](https://docs.ray.io/en/master/data/creating-datasets.html) User Guide already does the latter.
6. Add `Block` and `DataBatch` (should we add these aliases?)
7. Remove existing `package-ref`.
The actor handle held at Ray client will become dangling if the Ray cluster is shutdown, and in such case if the user tries to get the actor again it will result in crash. This happened in a real user and blocked them from making progress.
This change makes the stats actor detached, and instead of keeping a handle, we access it via its name. This way we can make sure re-create this actor if the cluster gets restarted.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
Page structure changes:
Deploying a Ray Cluster on Kubernetes
Getting Started -> links to jobs
Deploying a Ray Cluster on VMs
Getting started -> links to jobs
User Guides
Autoscaling (moved more content here in favor of the Getting started page)
Running Applications on Ray Clusters
Ray Jobs
Quickstart Using the Ray Jobs CLI
Python SDK
REST API
Ray Job Submission API Reference
Ray Client
Content changes:
modified "Deploying a Ray Cluster ..." quickstart pages to briefly summarize ad-hoc command execution, then link to jobs
modified Ray Jobs example to be more incremental - start with a simple example, then show long-running script, then show example with a runtime env, instead of all of them at once
center Ray Jobs quickstart around using the CLI. Made some minor changes to the Python SDK page to match it
remove "Ray Jobs Architecture"
moved "Autoscaling" content away from Kubernetes "Getting started" page into its own user guide. I think it's too complicated for "Getting Started". No content cuts.
Cut "Viewing the dashboard" and "Ray Client" from Kubernetes "Getting started" page.
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
A new feature was recently added, where Serve replicas are not restarted if only `num_replicas`, `autoscaling_config`, and/or `user_config` is updated in the config file that's redeployed. Updating docs to talk about this feature.
Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
When the node id of the controller died, GSC will try to reschedule the controller to the same node. But GCS will only mark the node as failure after 120s when GCS restarts (or 30s if only raylet died).
This PR fixed it by unpin it to the head node. So as long as GCS is alive, it'll reschedule it immediately. But we can't turn it on by default, so we introduce an internal flag for this.
Signed-off-by: Stephanie Wang swang@cs.berkeley.edu
Various cleanups around docs on Ray cluster "Monitoring and observability". After #27723, we will move these to a common page outside of VMs/k8s subsections:
Add links to the more comprehensive observability section.
Move and clean up cluster-specific content from Prometheus metrics to the new Ray Cluster page. I also modified a bunch of text here because previously we were not very clear about what the recommended approach was.
Include more specific instructions about setting up observability tools for VMs vs k8s.
User raised issue in #26605, where the user found the error message was quite non-actionable when partition filtering input files, and no files with required extension being found.
Signed-off-by: Cheng Su <scnju13@gmail.com>
This adds the structure described here, namely adding a new section under Ray Clusters which is focused on running applications on Ray clusters.
Signed-off-by: Cade Daniel <cade@anyscale.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
*This PR:
Copies the existing clusters API reference to the new structure. The reference docs are split out into Ray Clusters (common between vms and k8s) and Ray Clusters on VMs (specific to vms). Notably, there is also a reference section for k8s, but not in this PR.
Move the three job submission user guides back into a single one. Jules had suggested that we break them out into rest/sdk/cli, but that's not P0 right now.
Fix some bugs in the left navigation bar. There should be less duplication of TOC entries. I'll keep working on related fixes in a different PR.
Signed-off-by: Cade Daniel <cade@anyscale.com>