diff --git a/doc/source/cluster/kubernetes/index.md b/doc/source/cluster/kubernetes/index.md index 66e2d7552..581570466 100644 --- a/doc/source/cluster/kubernetes/index.md +++ b/doc/source/cluster/kubernetes/index.md @@ -90,7 +90,7 @@ the project. and discussion of new and upcoming features. ```{note} -The KubeRay operator replaces the older Ray operator hosted in the [Ray repository](https://github.com/ray-project/ray/tree/releases/2.0.0/python/ray/ray_operator). +The KubeRay operator replaces the older Ray operator hosted in the [Ray repository](https://github.com/ray-project/ray/tree/master/python/ray/ray_operator). Check the linked README for migration notes. If you have used the legacy Ray operator in the past, diff --git a/doc/source/cluster/kubernetes/user-guides/config.md b/doc/source/cluster/kubernetes/user-guides/config.md index 4a5edfae0..4fe223259 100644 --- a/doc/source/cluster/kubernetes/user-guides/config.md +++ b/doc/source/cluster/kubernetes/user-guides/config.md @@ -121,7 +121,8 @@ specified under `headGroupSpec`, while configuration for worker pods is specified under `workerGroupSpecs`. There may be multiple worker groups, each group with its own configuration. The `replicas` field of a `workerGroupSpec` specifies the number of worker pods of that group to -keep in the cluster. +keep in the cluster. Each `workerGroupSpec` also has optional `minReplicas` and +`maxReplicas` fields; these fields are important if you wish to enable {ref}`autoscaling `. ### Pod templates The bulk of the configuration for a `headGroupSpec` or @@ -129,6 +130,14 @@ The bulk of the configuration for a `headGroupSpec` or template which determines the configuration for the pods in the group. Here are some of the subfields of the pod `template` to pay attention to: +#### containers +A Ray pod template specifies at minimum one container, namely the container +that runs the Ray processes. A Ray pod template may also specify additional sidecar +containers, for purposes such as {ref}`log processing `. However, the KubeRay operator assumes that +the first container in the containers list is the main Ray container. +Therefore, make sure to specify any sidecar containers +**after** the main Ray container. In other words, the Ray container should be the **first** +in the `containers` list. #### resources It’s important to specify container CPU and memory requests and limits for @@ -153,8 +162,20 @@ Note that CPU quantities will be rounded up to the nearest integer before being relayed to Ray. The resource capacities advertised to Ray may be overridden in the {ref}`rayStartParams`. +:::{warning} +Due to a [bug](https://github.com/ray-project/kuberay/pull/497) in KubeRay 0.3.0, +the following piece of configuration is required to advertise the presence of GPUs +to Ray. +```yaml +rayStartParams: + num-gpus: "1" +``` +Future releases of KubeRay will not require this. (GPU quantities will be correctly auto-detected +from container limits.) +::: + On the other hand CPU, GPU, and memory **requests** will be ignored by Ray. -For this reason, it is best when possible to set resource requests equal to resource limits. +For this reason, it is best when possible to **set resource requests equal to resource limits**. #### nodeSelector and tolerations You can control the scheduling of worker groups' Ray pods by setting the `nodeSelector` and @@ -209,9 +230,8 @@ Note that the values of all Ray start parameters, including `num-cpus`, must be supplied as **strings**. ### num-gpus -This optional field specifies the number of GPUs available to the Ray container. -In KubeRay versions since 0.3.0, the number of GPUs can be auto-detected from Ray container resource limits. -For certain advanced use-cases, you may wish to use `num-gpus` to set an {ref}`override `. +This field specifies the number of GPUs available to the Ray container. +In future KubeRay versions, the number of GPUs will be auto-detected from Ray container resource limits. Note that the values of all Ray start parameters, including `num-gpus`, must be supplied as **strings**. diff --git a/doc/source/cluster/kubernetes/user-guides/gpu.rst b/doc/source/cluster/kubernetes/user-guides/gpu.rst index a9408688a..48d347d01 100644 --- a/doc/source/cluster/kubernetes/user-guides/gpu.rst +++ b/doc/source/cluster/kubernetes/user-guides/gpu.rst @@ -35,6 +35,8 @@ to 5 GPU workers. .. code-block:: yaml groupName: gpu-group + rayStartParams: + num-gpus: "1" # Advertise GPUs to Ray. replicas: 0 minReplicas: 0 maxReplicas: 5 @@ -47,17 +49,30 @@ to 5 GPU workers. image: rayproject/ray-ml:2.0.0-gpu ... resources: - cpu: 3 - memory: 50Gi nvidia.com/gpu: 1 # Optional, included just for documentation. - limits: cpu: 3 memory: 50Gi + limits: nvidia.com/gpu: 1 # Required to use GPU. + cpu: 3 + memory: 50Gi ... Each of the Ray pods in the group can be scheduled on an AWS `p2.xlarge` instance (1 GPU, 4vCPU, 61Gi RAM). +.. warning:: + + Not the following piece of required configuration: + + .. code-block:: yaml + + rayStartParams: + num-gpus: "1" + + This extra configuration is required due to a `bug`_ in KubeRay 0.3.0. + KubeRay master does not require this piece of configuration, nor will future KubeRay releases; + the GPU Ray start parameters will be auto-detected from container resource limits. + .. tip:: GPU instances are expensive -- consider setting up autoscaling for your GPU Ray workers, @@ -215,3 +230,4 @@ and about Nvidia's GPU plugin for Kubernetes `here