diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml index 3396fc461..1b37568c7 100644 --- a/doc/source/_toc.yml +++ b/doc/source/_toc.yml @@ -267,6 +267,7 @@ parts: - file: cluster/kubernetes/user-guides/configuring-autoscaling.md - file: cluster/kubernetes/user-guides/logging.md - file: cluster/kubernetes/user-guides/gpu.md + - file: cluster/kubernetes/user-guides/experimental.md - file: cluster/kubernetes/examples sections: - file: cluster/kubernetes/examples/ml-example.md diff --git a/doc/source/cluster/kubernetes/user-guides.md b/doc/source/cluster/kubernetes/user-guides.md index 3cba63bb5..fd7a7be6d 100644 --- a/doc/source/cluster/kubernetes/user-guides.md +++ b/doc/source/cluster/kubernetes/user-guides.md @@ -13,3 +13,4 @@ deployments of Ray on Kubernetes. * {ref}`kuberay-autoscaling` * {ref}`kuberay-gpu` * {ref}`kuberay-logging` +* {ref}`kuberay-experimental` diff --git a/doc/source/cluster/kubernetes/user-guides/experimental.md b/doc/source/cluster/kubernetes/user-guides/experimental.md new file mode 100644 index 000000000..c2a0b7118 --- /dev/null +++ b/doc/source/cluster/kubernetes/user-guides/experimental.md @@ -0,0 +1,47 @@ +(kuberay-experimental)= + +# Experimental Features + +We provide an overview of new and experimental features available +for deployments of Ray on Kubernetes. + +## RayServices + +The `RayService` controller enables fault-tolerant deployments of +{ref}`Ray Serve ` applications on Kubernetes. + +If your Ray Serve application enters an unhealthy state, the RayService controller will create a new Ray Cluster. +Once the new cluster is ready, Ray Serve traffic will be re-routed to the new Ray cluster. + +For details, see the guide on {ref}`Kubernetes-based RayServe deployments `. + +## GCS Fault Tolerance + +In addition to the application-level fault-tolerance provided by the RayService controller, +Ray now supports infrastructure-level fault tolerance for the Ray head pod. + +You can set up an external Redis instance as a data store for the Ray head. If the Ray head crashes, +a new head will be created without restarting the Ray cluster. +The Ray head's GCS will recover its state from the external Redis instance. + +See the {ref}`Ray Serve documentation ` for more information and +the [KubeRay docs on GCS Fault Tolerance][KubeFT] for a detailed guide. + +## RayJobs + +The `RayJob` custom resource consists of two elements: +1. Configuration for a Ray cluster. +2. A job, i.e. a Ray program to be executed on the Ray cluster. + +To run a Ray job, you create a RayJob CR: +```shell +kubectl apply -f rayjob.yaml +``` +The RayJob controller then creates the Ray cluster and runs the job. +If you wish, you may configure the Ray cluster to be deleted when the job finishes. + +See the [KubeRay docs on RayJobs][KubeJob] for details. + +[KubeServe]: https://ray-project.github.io/kuberay/guidance/rayservice/ +[KubeFT]: https://ray-project.github.io/kuberay/guidance/gcs-ft/ +[KubeJob]: https://ray-project.github.io/kuberay/guidance/rayjob/ diff --git a/doc/source/serve/production-guide/failures.md b/doc/source/serve/production-guide/failures.md index f98eaa7b8..147b9d422 100644 --- a/doc/source/serve/production-guide/failures.md +++ b/doc/source/serve/production-guide/failures.md @@ -31,6 +31,7 @@ You can also customize how frequently the health check is run and the timeout af > raise RuntimeError("uh-oh, DB connection is broken.") > ``` +(serve-head-node-failure)= ## Head node failures By default the Ray head node is a single point of failure: if it crashes, the entire cluster crashes and needs to be restarted.