[kuberay][docs] Experimental features (#27898)

2025-03-04 17:41:43 -05:00 · 2022-08-18 11:37:06 -07:00 · 2022-08-18 11:37:06 -07:00 · f8c74c8de5
commit f8c74c8de5
parent 17a8db048f
4 changed files with 50 additions and 0 deletions
--- a/doc/source/_toc.yml
+++ b/doc/source/_toc.yml
@ -267,6 +267,7 @@ parts:
            - file: cluster/kubernetes/user-guides/configuring-autoscaling.md
            - file: cluster/kubernetes/user-guides/logging.md
            - file: cluster/kubernetes/user-guides/gpu.md
+            - file: cluster/kubernetes/user-guides/experimental.md
        - file: cluster/kubernetes/examples
          sections:
            - file: cluster/kubernetes/examples/ml-example.md
--- a/doc/source/cluster/kubernetes/user-guides.md
+++ b/doc/source/cluster/kubernetes/user-guides.md
@ -13,3 +13,4 @@ deployments of Ray on Kubernetes.
 * {ref}`kuberay-autoscaling`
 * {ref}`kuberay-gpu`
 * {ref}`kuberay-logging`
+* {ref}`kuberay-experimental`
--- a/doc/source/cluster/kubernetes/user-guides/experimental.md
+++ b/doc/source/cluster/kubernetes/user-guides/experimental.md
@ -0,0 +1,47 @@
+(kuberay-experimental)=
+
+# Experimental Features
+
+We provide an overview of new and experimental features available
+for deployments of Ray on Kubernetes.
+
+## RayServices
+
+The `RayService` controller enables fault-tolerant deployments of
+{ref}`Ray Serve <rayserve>` applications on Kubernetes.
+
+If your Ray Serve application enters an unhealthy state, the RayService controller will create a new Ray Cluster.
+Once the new cluster is ready, Ray Serve traffic will be re-routed to the new Ray cluster.
+
+For details, see the guide on {ref}`Kubernetes-based RayServe deployments <serve-in-production-kubernetes>`.
+
+## GCS Fault Tolerance
+
+In addition to the application-level fault-tolerance provided by the RayService controller,
+Ray now supports infrastructure-level fault tolerance for the Ray head pod.
+
+You can set up an external Redis instance as a data store for the Ray head. If the Ray head crashes,
+a new head will be created without restarting the Ray cluster.
+The Ray head's GCS will recover its state from the external Redis instance.
+
+See the {ref}`Ray Serve documentation <serve-head-node-failure>` for more information and
+the [KubeRay docs on GCS Fault Tolerance][KubeFT] for a detailed guide.
+
+## RayJobs
+
+The `RayJob` custom resource consists of two elements:
+1. Configuration for a Ray cluster.
+2. A job, i.e. a Ray program to be executed on the Ray cluster.
+
+To run a Ray job, you create a RayJob CR:
+```shell
+kubectl apply -f rayjob.yaml
+```
+The RayJob controller then creates the Ray cluster and runs the job.
+If you wish, you may configure the Ray cluster to be deleted when the job finishes.
+
+See the [KubeRay docs on RayJobs][KubeJob] for details.
+
+[KubeServe]: https://ray-project.github.io/kuberay/guidance/rayservice/
+[KubeFT]: https://ray-project.github.io/kuberay/guidance/gcs-ft/
+[KubeJob]: https://ray-project.github.io/kuberay/guidance/rayjob/
--- a/doc/source/serve/production-guide/failures.md
+++ b/doc/source/serve/production-guide/failures.md
@ -31,6 +31,7 @@ You can also customize how frequently the health check is run and the timeout af
 >             raise RuntimeError("uh-oh, DB connection is broken.")
 > ```

+(serve-head-node-failure)=
 ## Head node failures

 By default the Ray head node is a single point of failure: if it crashes, the entire cluster crashes and needs to be restarted.