mirror of
https://github.com/vale981/ray
synced 2025-03-04 17:41:43 -05:00
[kuberay][docs] Experimental features (#27898)
This commit is contained in:
parent
17a8db048f
commit
f8c74c8de5
4 changed files with 50 additions and 0 deletions
|
@ -267,6 +267,7 @@ parts:
|
|||
- file: cluster/kubernetes/user-guides/configuring-autoscaling.md
|
||||
- file: cluster/kubernetes/user-guides/logging.md
|
||||
- file: cluster/kubernetes/user-guides/gpu.md
|
||||
- file: cluster/kubernetes/user-guides/experimental.md
|
||||
- file: cluster/kubernetes/examples
|
||||
sections:
|
||||
- file: cluster/kubernetes/examples/ml-example.md
|
||||
|
|
|
@ -13,3 +13,4 @@ deployments of Ray on Kubernetes.
|
|||
* {ref}`kuberay-autoscaling`
|
||||
* {ref}`kuberay-gpu`
|
||||
* {ref}`kuberay-logging`
|
||||
* {ref}`kuberay-experimental`
|
||||
|
|
47
doc/source/cluster/kubernetes/user-guides/experimental.md
Normal file
47
doc/source/cluster/kubernetes/user-guides/experimental.md
Normal file
|
@ -0,0 +1,47 @@
|
|||
(kuberay-experimental)=
|
||||
|
||||
# Experimental Features
|
||||
|
||||
We provide an overview of new and experimental features available
|
||||
for deployments of Ray on Kubernetes.
|
||||
|
||||
## RayServices
|
||||
|
||||
The `RayService` controller enables fault-tolerant deployments of
|
||||
{ref}`Ray Serve <rayserve>` applications on Kubernetes.
|
||||
|
||||
If your Ray Serve application enters an unhealthy state, the RayService controller will create a new Ray Cluster.
|
||||
Once the new cluster is ready, Ray Serve traffic will be re-routed to the new Ray cluster.
|
||||
|
||||
For details, see the guide on {ref}`Kubernetes-based RayServe deployments <serve-in-production-kubernetes>`.
|
||||
|
||||
## GCS Fault Tolerance
|
||||
|
||||
In addition to the application-level fault-tolerance provided by the RayService controller,
|
||||
Ray now supports infrastructure-level fault tolerance for the Ray head pod.
|
||||
|
||||
You can set up an external Redis instance as a data store for the Ray head. If the Ray head crashes,
|
||||
a new head will be created without restarting the Ray cluster.
|
||||
The Ray head's GCS will recover its state from the external Redis instance.
|
||||
|
||||
See the {ref}`Ray Serve documentation <serve-head-node-failure>` for more information and
|
||||
the [KubeRay docs on GCS Fault Tolerance][KubeFT] for a detailed guide.
|
||||
|
||||
## RayJobs
|
||||
|
||||
The `RayJob` custom resource consists of two elements:
|
||||
1. Configuration for a Ray cluster.
|
||||
2. A job, i.e. a Ray program to be executed on the Ray cluster.
|
||||
|
||||
To run a Ray job, you create a RayJob CR:
|
||||
```shell
|
||||
kubectl apply -f rayjob.yaml
|
||||
```
|
||||
The RayJob controller then creates the Ray cluster and runs the job.
|
||||
If you wish, you may configure the Ray cluster to be deleted when the job finishes.
|
||||
|
||||
See the [KubeRay docs on RayJobs][KubeJob] for details.
|
||||
|
||||
[KubeServe]: https://ray-project.github.io/kuberay/guidance/rayservice/
|
||||
[KubeFT]: https://ray-project.github.io/kuberay/guidance/gcs-ft/
|
||||
[KubeJob]: https://ray-project.github.io/kuberay/guidance/rayjob/
|
|
@ -31,6 +31,7 @@ You can also customize how frequently the health check is run and the timeout af
|
|||
> raise RuntimeError("uh-oh, DB connection is broken.")
|
||||
> ```
|
||||
|
||||
(serve-head-node-failure)=
|
||||
## Head node failures
|
||||
|
||||
By default the Ray head node is a single point of failure: if it crashes, the entire cluster crashes and needs to be restarted.
|
||||
|
|
Loading…
Add table
Reference in a new issue