ray/doc/source/cluster/kubernetes-gpu.rst

:orphan:

.. include:: we_are_hiring.rst

.. _k8s-gpus:

GPU Usage with Kubernetes
=========================
This document provides some notes on GPU usage with Kubernetes.

To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.

For relevant documentation for GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_.

The `Ray Docker Hub <https://hub.docker.com/r/rayproject/>`_ hosts CUDA-based images packaged with Ray for use in Kubernetes pods.
For example, the image ``rayproject/ray-ml:nightly-gpu`` is ideal for running GPU-based ML workloads with the most recent nightly build of Ray.
Read :ref:`here<docker-images>` for further details on Ray images.

Using Nvidia GPUs requires specifying the relevant resource `limits` in the container fields of your Kubernetes configurations.
(Kubernetes `sets <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_
the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and
using one Nvidia GPU looks like this:

.. code-block:: yaml

  apiVersion: v1
  kind: Pod
  metadata:
   generateName: example-cluster-ray-worker
   spec:
    ...
    containers:
     - name: ray-node
       image: rayproject/ray:nightly-gpu
       ...
       resources:
        cpu: 1000m
        memory: 512Mi
       limits:
        memory: 512Mi
        nvidia.com/gpu: 1

GPU taints and tolerations
--------------------------
.. note::

  Users using a managed Kubernetes service probably don't need to worry about this section.

The `Nvidia gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes.
Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_
to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_.
If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,

.. code-block:: yaml

  apiVersion: v1
  kind: Pod
  metadata:
   generateName: example-cluster-ray-worker
   spec:
   ...
   tolerations:
   - effect: NoSchedule
     key: nvidia.com/gpu
     operator: Exists
   ...
   containers:
   - name: ray-node
     image: rayproject/ray:nightly-gpu
     ...

Further reference and discussion
--------------------------------
Read about Kubernetes device plugins `here <https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/>`__,
about Kubernetes GPU plugins `here <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus>`__,
and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8s-device-plugin>`__.

If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at `<https://discuss.ray.io>`_.

Questions or Issues?
--------------------

.. include:: /_includes/_help.rst

.. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
.. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
.. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

.. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _`Nvidia gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin
.. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
.. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration
[autoscaler][kubernetes][docs] Updated Kubernetes Documentation (#14016) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2021-02-11 23:00:25 -08:00			`:orphan:`

Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) This reverts commit 38e46c9fb3b46810ea1c270cf95d6e65dc8ccaee. 2022-01-20 15:30:56 -08:00			`.. include:: we_are_hiring.rst`
[docs] autoscaler/K8s hiring roles (#20621) * we are hiring * fixes as philipp requested 2021-11-23 14:56:22 -08:00
[autoscaler][kubernetes][docs] Updated Kubernetes Documentation (#14016) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2021-02-11 23:00:25 -08:00			`.. _k8s-gpus:`

			`GPU Usage with Kubernetes`
			`=========================`
			`This document provides some notes on GPU usage with Kubernetes.`

			`To use GPUs on Kubernetes, you will need to configure both your Kubernetes setup and add additional values to your Ray cluster configuration.`

			For relevant documentation for GPU usage on different clouds, see instructions for `GKE`_, for `EKS`_, and for `AKS`_.

			The `Ray Docker Hub <https://hub.docker.com/r/rayproject/>`_ hosts CUDA-based images packaged with Ray for use in Kubernetes pods.
			For example, the image ``rayproject/ray-ml:nightly-gpu`` is ideal for running GPU-based ML workloads with the most recent nightly build of Ray.
			Read :ref:`here<docker-images>` for further details on Ray images.

			Using Nvidia GPUs requires specifying the relevant resource `limits` in the container fields of your Kubernetes configurations.
			(Kubernetes `sets <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#using-device-plugins>`_
			`the GPU request equal to the limit.) The configuration for a pod running a Ray GPU image and`
			`using one Nvidia GPU looks like this:`

			`.. code-block:: yaml`

			`apiVersion: v1`
			`kind: Pod`
			`metadata:`
			`generateName: example-cluster-ray-worker`
			`spec:`
			`...`
			`containers:`
			`- name: ray-node`
			`image: rayproject/ray:nightly-gpu`
			`...`
			`resources:`
			`cpu: 1000m`
			`memory: 512Mi`
			`limits:`
			`memory: 512Mi`
			`nvidia.com/gpu: 1`

			`GPU taints and tolerations`
			`--------------------------`
			`.. note::`

			`Users using a managed Kubernetes service probably don't need to worry about this section.`

			The `Nvidia gpu plugin`_ for Kubernetes applies `taints`_ to GPU nodes; these taints prevent non-GPU pods from being scheduled on GPU nodes.
			Managed Kubernetes services like GKE, EKS, and AKS automatically apply matching `tolerations`_
			to pods requesting GPU resources. Tolerations are applied by means of Kubernetes's `ExtendedResourceToleration`_ `admission controller`_.
			`If this admission controller is not enabled for your Kubernetes cluster, you may need to manually add a GPU toleration each of to your GPU pod configurations. For example,`

			`.. code-block:: yaml`

			`apiVersion: v1`
			`kind: Pod`
			`metadata:`
			`generateName: example-cluster-ray-worker`
			`spec:`
			`...`
			`tolerations:`
			`- effect: NoSchedule`
			`key: nvidia.com/gpu`
			`operator: Exists`
			`...`
			`containers:`
			`- name: ray-node`
			`image: rayproject/ray:nightly-gpu`
			`...`

			`Further reference and discussion`
			`--------------------------------`
			Read about Kubernetes device plugins `here <https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/>`__,
			about Kubernetes GPU plugins `here <https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus>`__,
			and about Nvidia's GPU plugin for Kubernetes `here <https://github.com/NVIDIA/k8s-device-plugin>`__.

			If you run into problems setting up GPUs for your Ray cluster on Kubernetes, please reach out to us at `<https://discuss.ray.io>`_.

			`Questions or Issues?`
			`--------------------`

[docs] new structure (#21776) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted). 2022-01-22 00:42:05 +01:00			`.. include:: /_includes/_help.rst`
[autoscaler][kubernetes][docs] Updated Kubernetes Documentation (#14016) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2021-02-11 23:00:25 -08:00
			.. _`GKE`: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
			.. _`EKS`: https://docs.aws.amazon.com/eks/latest/userguide/eks-optimized-ami.html
			.. _`AKS`: https://docs.microsoft.com/en-us/azure/aks/gpu-cluster

			.. _`tolerations`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
			.. _`taints`: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
			.. _`Nvidia gpu plugin`: https://github.com/NVIDIA/k8s-device-plugin
			.. _`admission controller`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/
			.. _`ExtendedResourceToleration`: https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#extendedresourcetoleration