[kuberay][docs] Experimental features (#27898)

This commit is contained in:
Dmitri Gekhtman 2022-08-18 11:37:06 -07:00 committed by scv119
parent 17a8db048f
commit f8c74c8de5
4 changed files with 50 additions and 0 deletions

View file

@ -267,6 +267,7 @@ parts:
- file: cluster/kubernetes/user-guides/configuring-autoscaling.md
- file: cluster/kubernetes/user-guides/logging.md
- file: cluster/kubernetes/user-guides/gpu.md
- file: cluster/kubernetes/user-guides/experimental.md
- file: cluster/kubernetes/examples
sections:
- file: cluster/kubernetes/examples/ml-example.md

View file

@ -13,3 +13,4 @@ deployments of Ray on Kubernetes.
* {ref}`kuberay-autoscaling`
* {ref}`kuberay-gpu`
* {ref}`kuberay-logging`
* {ref}`kuberay-experimental`

View file

@ -0,0 +1,47 @@
(kuberay-experimental)=
# Experimental Features
We provide an overview of new and experimental features available
for deployments of Ray on Kubernetes.
## RayServices
The `RayService` controller enables fault-tolerant deployments of
{ref}`Ray Serve <rayserve>` applications on Kubernetes.
If your Ray Serve application enters an unhealthy state, the RayService controller will create a new Ray Cluster.
Once the new cluster is ready, Ray Serve traffic will be re-routed to the new Ray cluster.
For details, see the guide on {ref}`Kubernetes-based RayServe deployments <serve-in-production-kubernetes>`.
## GCS Fault Tolerance
In addition to the application-level fault-tolerance provided by the RayService controller,
Ray now supports infrastructure-level fault tolerance for the Ray head pod.
You can set up an external Redis instance as a data store for the Ray head. If the Ray head crashes,
a new head will be created without restarting the Ray cluster.
The Ray head's GCS will recover its state from the external Redis instance.
See the {ref}`Ray Serve documentation <serve-head-node-failure>` for more information and
the [KubeRay docs on GCS Fault Tolerance][KubeFT] for a detailed guide.
## RayJobs
The `RayJob` custom resource consists of two elements:
1. Configuration for a Ray cluster.
2. A job, i.e. a Ray program to be executed on the Ray cluster.
To run a Ray job, you create a RayJob CR:
```shell
kubectl apply -f rayjob.yaml
```
The RayJob controller then creates the Ray cluster and runs the job.
If you wish, you may configure the Ray cluster to be deleted when the job finishes.
See the [KubeRay docs on RayJobs][KubeJob] for details.
[KubeServe]: https://ray-project.github.io/kuberay/guidance/rayservice/
[KubeFT]: https://ray-project.github.io/kuberay/guidance/gcs-ft/
[KubeJob]: https://ray-project.github.io/kuberay/guidance/rayjob/

View file

@ -31,6 +31,7 @@ You can also customize how frequently the health check is run and the timeout af
> raise RuntimeError("uh-oh, DB connection is broken.")
> ```
(serve-head-node-failure)=
## Head node failures
By default the Ray head node is a single point of failure: if it crashes, the entire cluster crashes and needs to be restarted.