mirror of
https://github.com/vale981/ray
synced 2025-03-05 10:01:43 -05:00
[Docs][Kubernetes] Fix link, add a bit of content (#28017)
Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com> Fixes the "legacy operator" link to point to master, rather than the 2.0.0 branch. The migration README exists in master but not in the 2.0.0 branch. Adds a sentence explaining that the Ray container has to go first in the container list. Adds a sentence to config guide mention min/max replicas and linking to autoscaling. Documents a bug related to GPU auto-detection in KubeRay 0.3.0.
This commit is contained in:
parent
96d579a4fe
commit
ce99cf1b71
3 changed files with 45 additions and 9 deletions
|
@ -90,7 +90,7 @@ the project.
|
|||
and discussion of new and upcoming features.
|
||||
|
||||
```{note}
|
||||
The KubeRay operator replaces the older Ray operator hosted in the [Ray repository](https://github.com/ray-project/ray/tree/releases/2.0.0/python/ray/ray_operator).
|
||||
The KubeRay operator replaces the older Ray operator hosted in the [Ray repository](https://github.com/ray-project/ray/tree/master/python/ray/ray_operator).
|
||||
Check the linked README for migration notes.
|
||||
|
||||
If you have used the legacy Ray operator in the past,
|
||||
|
|
|
@ -121,7 +121,8 @@ specified under `headGroupSpec`, while configuration for worker pods is
|
|||
specified under `workerGroupSpecs`. There may be multiple worker groups,
|
||||
each group with its own configuration. The `replicas` field
|
||||
of a `workerGroupSpec` specifies the number of worker pods of that group to
|
||||
keep in the cluster.
|
||||
keep in the cluster. Each `workerGroupSpec` also has optional `minReplicas` and
|
||||
`maxReplicas` fields; these fields are important if you wish to enable {ref}`autoscaling <kuberay-autoscaling-config>`.
|
||||
|
||||
### Pod templates
|
||||
The bulk of the configuration for a `headGroupSpec` or
|
||||
|
@ -129,6 +130,14 @@ The bulk of the configuration for a `headGroupSpec` or
|
|||
template which determines the configuration for the pods in the group.
|
||||
Here are some of the subfields of the pod `template` to pay attention to:
|
||||
|
||||
#### containers
|
||||
A Ray pod template specifies at minimum one container, namely the container
|
||||
that runs the Ray processes. A Ray pod template may also specify additional sidecar
|
||||
containers, for purposes such as {ref}`log processing <kuberay-logging>`. However, the KubeRay operator assumes that
|
||||
the first container in the containers list is the main Ray container.
|
||||
Therefore, make sure to specify any sidecar containers
|
||||
**after** the main Ray container. In other words, the Ray container should be the **first**
|
||||
in the `containers` list.
|
||||
|
||||
#### resources
|
||||
It’s important to specify container CPU and memory requests and limits for
|
||||
|
@ -153,8 +162,20 @@ Note that CPU quantities will be rounded up to the nearest integer
|
|||
before being relayed to Ray.
|
||||
The resource capacities advertised to Ray may be overridden in the {ref}`rayStartParams`.
|
||||
|
||||
:::{warning}
|
||||
Due to a [bug](https://github.com/ray-project/kuberay/pull/497) in KubeRay 0.3.0,
|
||||
the following piece of configuration is required to advertise the presence of GPUs
|
||||
to Ray.
|
||||
```yaml
|
||||
rayStartParams:
|
||||
num-gpus: "1"
|
||||
```
|
||||
Future releases of KubeRay will not require this. (GPU quantities will be correctly auto-detected
|
||||
from container limits.)
|
||||
:::
|
||||
|
||||
On the other hand CPU, GPU, and memory **requests** will be ignored by Ray.
|
||||
For this reason, it is best when possible to set resource requests equal to resource limits.
|
||||
For this reason, it is best when possible to **set resource requests equal to resource limits**.
|
||||
|
||||
#### nodeSelector and tolerations
|
||||
You can control the scheduling of worker groups' Ray pods by setting the `nodeSelector` and
|
||||
|
@ -209,9 +230,8 @@ Note that the values of all Ray start parameters, including `num-cpus`,
|
|||
must be supplied as **strings**.
|
||||
|
||||
### num-gpus
|
||||
This optional field specifies the number of GPUs available to the Ray container.
|
||||
In KubeRay versions since 0.3.0, the number of GPUs can be auto-detected from Ray container resource limits.
|
||||
For certain advanced use-cases, you may wish to use `num-gpus` to set an {ref}`override <kuberay-gpu-override>`.
|
||||
This field specifies the number of GPUs available to the Ray container.
|
||||
In future KubeRay versions, the number of GPUs will be auto-detected from Ray container resource limits.
|
||||
Note that the values of all Ray start parameters, including `num-gpus`,
|
||||
must be supplied as **strings**.
|
||||
|
||||
|
|
|
@ -35,6 +35,8 @@ to 5 GPU workers.
|
|||
.. code-block:: yaml
|
||||
|
||||
groupName: gpu-group
|
||||
rayStartParams:
|
||||
num-gpus: "1" # Advertise GPUs to Ray.
|
||||
replicas: 0
|
||||
minReplicas: 0
|
||||
maxReplicas: 5
|
||||
|
@ -47,17 +49,30 @@ to 5 GPU workers.
|
|||
image: rayproject/ray-ml:2.0.0-gpu
|
||||
...
|
||||
resources:
|
||||
cpu: 3
|
||||
memory: 50Gi
|
||||
nvidia.com/gpu: 1 # Optional, included just for documentation.
|
||||
limits:
|
||||
cpu: 3
|
||||
memory: 50Gi
|
||||
limits:
|
||||
nvidia.com/gpu: 1 # Required to use GPU.
|
||||
cpu: 3
|
||||
memory: 50Gi
|
||||
...
|
||||
|
||||
Each of the Ray pods in the group can be scheduled on an AWS `p2.xlarge` instance (1 GPU, 4vCPU, 61Gi RAM).
|
||||
|
||||
.. warning::
|
||||
|
||||
Not the following piece of required configuration:
|
||||
|
||||
.. code-block:: yaml
|
||||
|
||||
rayStartParams:
|
||||
num-gpus: "1"
|
||||
|
||||
This extra configuration is required due to a `bug`_ in KubeRay 0.3.0.
|
||||
KubeRay master does not require this piece of configuration, nor will future KubeRay releases;
|
||||
the GPU Ray start parameters will be auto-detected from container resource limits.
|
||||
|
||||
.. tip::
|
||||
|
||||
GPU instances are expensive -- consider setting up autoscaling for your GPU Ray workers,
|
||||
|
@ -215,3 +230,4 @@ and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8
|
|||
.. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
|
||||
.. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration
|
||||
.. _`Kubernetes docs`: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
|
||||
.. _`bug`: https://github.com/ray-project/kuberay/pull/497/
|
||||
|
|
Loading…
Add table
Reference in a new issue