ray/doc/source/cluster/k8s-operator.rst
2021-01-12 20:35:38 -08:00

238 lines
11 KiB
ReStructuredText

.. _k8s-operator:
The Ray Kubernetes Operator
=================================
Ray provides a `Kubernetes Operator`_ for managing autoscaling Ray clusters.
Using the operator provides similar functionality to deploying a Ray cluster using
the :ref:`Ray Cluster Launcher<ref-autoscaling>`. However, working with the operator does not require
running Ray locally -- all interactions with your Ray cluster are mediated by Kubernetes.
The operator makes use of a `Kubernetes Custom Resource`_ called a *RayCluster*.
A RayCluster is specified by a configuration similar to the ``yaml`` files used by the Ray Cluster Launcher.
Internally, the operator uses Ray's autoscaler to manage your Ray cluster. However, the autoscaler runs in a
separate operator pod, rather than on the Ray head node. Applying multiple RayCluster custom resources in the operator's
namespace allows the operator to manage several Ray clusters.
The rest of this document explains step-by-step how to use the Ray Kubernetes Operator to launch a Ray cluster on your existing Kubernetes cluster.
.. role:: bash(code)
:language: bash
.. warning::
The Ray Kubernetes Operator requires Kubernetes version at least ``v1.17.0``. Check Kubernetes version info with the command
:bash:`kubectl version`.
.. note::
The example commands in this document launch six Kubernetes pods, using a total of 6 CPU and 3.5Gi memory.
If you are experimenting using a test Kubernetes environment such as `minikube`_, make sure to provision sufficient resources, e.g.
:bash:`minikube start --cpus=6 --memory=\"4G\"`.
Alternatively, reduce resource usage by editing the ``yaml`` files referenced in this document; for example, reduce ``minWorkers``
in ``example_cluster.yaml`` and ``example_cluster2.yaml``.
Applying the RayCluster Custom Resource Definition
--------------------------------------------------
First, we need to apply the `Kubernetes Custom Resource Definition`_ (CRD) defining a RayCluster.
.. note::
Creating a Custom Resource Definition requires the appropriate Kubernetes cluster-level privileges.
.. code-block:: shell
$ kubectl apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
customresourcedefinition.apiextensions.k8s.io/rayclusters.cluster.ray.io created
Picking a Kubernetes Namespace
-------------------------------
The rest of the Kubernetes resources we will use are `namespaced`_.
You can use an existing namespace for your Ray clusters or create a new one if you have permissions.
For this example, we will create a namespace called ``ray``.
.. code-block:: shell
$ kubectl create namespace ray
namespace/ray created
Starting the Operator
----------------------
To launch the operator in our namespace, we execute the following command.
.. code-block:: shell
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
serviceaccount/ray-operator-serviceaccount created
role.rbac.authorization.k8s.io/ray-operator-role created
rolebinding.rbac.authorization.k8s.io/ray-operator-rolebinding created
pod/ray-operator-pod created
The output shows that we've launched a Pod named ``ray-operator-pod``. This is the pod that runs the operator process.
The ServiceAccount, Role, and RoleBinding we have created grant the operator pod the `permissions`_ it needs to manage Ray clusters.
Launching Ray Clusters
----------------------
Finally, to launch a Ray cluster, we create a RayCluster custom resource.
.. code-block:: shell
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml
raycluster.cluster.ray.io/example-cluster created
The operator detects the RayCluster resource we've created and launches an autoscaling Ray cluster.
Our RayCluster configuration specifies ``minWorkers:2`` in the second entry of ``spec.podTypes``, so we get a head node and two workers upon launch.
.. note::
For more details about RayCluster resources, we recommend take a looking at the annotated example ``example_cluster.yaml`` applied in the last command.
.. code-block:: shell
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-hbxvv 1/1 Running 0 72s
example-cluster-ray-worker-4hvv6 1/1 Running 0 64s
example-cluster-ray-worker-78kp5 1/1 Running 0 64s
ray-operator-pod 1/1 Running 0 2m33s
We see four pods: the operator, the Ray head node, and two Ray worker nodes.
Let's launch another cluster in the same namespace, this one specifiying ``minWorkers:1``.
.. code-block:: shell
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
We confirm that both clusters are running in our namespace.
.. code-block:: shell
$ kubectl -n ray get rayclusters
NAME AGE
example-cluster 12m
example-cluster2 114s
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-th4wv 1/1 Running 0 10m
example-cluster-ray-worker-q9pjn 1/1 Running 0 10m
example-cluster-ray-worker-qltnp 1/1 Running 0 10m
example-cluster2-ray-head-kj5mg 1/1 Running 0 10s
example-cluster2-ray-worker-qsgnd 1/1 Running 0 1s
ray-operator-pod 1/1 Running 0 10m
Now we can :ref:`run Ray programs<ray-k8s-run>` on our Ray clusters.
Monitoring
----------
Autoscaling logs are written to the operator pod's ``stdout`` and can be accessed with :code:`kubectl logs`.
Each line of output is prefixed by the name of the cluster followed by a colon.
The following command gets the last hundred lines of autoscaling logs for our second cluster.
.. code-block:: shell
$ kubectl -n ray logs ray-operator-pod | grep ^example-cluster2: | tail -n 100
The output should include monitoring updates that look like this:
.. code-block:: shell
example-cluster2:2020-12-12 13:55:36,814 DEBUG autoscaler.py:693 -- Cluster status: 1 nodes
example-cluster2: - MostDelayedHeartbeats: {'172.17.0.4': 0.04093289375305176, '172.17.0.5': 0.04084634780883789}
example-cluster2: - NodeIdleSeconds: Min=36 Mean=38 Max=41
example-cluster2: - ResourceUsage: 0.0/2.0 CPU, 0.0/1.0 Custom1, 0.0/1.0 is_spot, 0.0 GiB/0.58 GiB memory, 0.0 GiB/0.1 GiB object_store_memory
example-cluster2: - TimeSinceLastHeartbeat: Min=0 Mean=0 Max=0
example-cluster2:Worker node types:
example-cluster2: - worker-nodes: 1
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:148 -- Cluster resources: [{'object_store_memory': 1.0, 'node:172.17.0.4': 1.0, 'memory': 5.0, 'CPU': 1.0}, {'object_store_memory': 1.0, 'is_spot': 1.0, 'memory': 6.0, 'node:172.17.0.5': 1.0, 'Custom1': 1.0, 'CPU': 1.0}]
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:149 -- Node counts: defaultdict(<class 'int'>, {'head-node': 1, 'worker-nodes
': 1})
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:159 -- Placement group demands: []
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:186 -- Resource demands: []
example-cluster2:2020-12-12 13:55:36,870 INFO resource_demand_scheduler.py:187 -- Unfulfilled demands: []
example-cluster2:2020-12-12 13:55:36,891 INFO resource_demand_scheduler.py:209 -- Node requests: {}
example-cluster2:2020-12-12 13:55:36,903 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
example-cluster2:2020-12-12 13:55:36,923 DEBUG autoscaler.py:654 -- example-cluster2-ray-worker-tdxdr is not being updated and passes config check (can_update=True).
Updating and Retrying
---------------------
To update a Ray cluster's configuration, edit the ``yaml`` file of the corresponding RayCluster resource
and apply it again:
.. code-block:: shell
$ kubectl -n ray apply -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster.yaml
To force a restart with the same configuration, you can add an `annotation`_ to the RayCluster resource's ``metadata.labels`` field, e.g.
.. code-block:: yaml
apiVersion: cluster.ray.io/v1
kind: RayCluster
metadata:
name: example-cluster
annotations:
try: again
spec:
...
Then reapply the RayCluster, as above.
Currently, editing and reapplying a RayCluster resource will stop and restart Ray processes running on the corresponding
Ray cluster. Similarly, deleting and relaunching the operator pod will stop and restart Ray processes on all Ray clusters in the operator's namespace.
This behavior may be modified in future releases.
Cleaning Up
-----------
We shut down a Ray cluster by deleting the associated RayCluster resource.
Either of the next two commands will delete our second cluster ``example-cluster2``.
.. code-block:: shell
$ kubectl -n ray delete raycluster example-cluster2
# OR
$ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/example_cluster2.yaml
The pods associated with ``example-cluster2`` go into ``TERMINATING`` status. In a few moments, we check that these pods are gone:
.. code-block:: shell
$ kubectl -n ray get pods
NAME READY STATUS RESTARTS AGE
example-cluster-ray-head-th4wv 1/1 Running 0 57m
example-cluster-ray-worker-q9pjn 1/1 Running 0 56m
example-cluster-ray-worker-qltnp 1/1 Running 0 56m
ray-operator-pod 1/1 Running 0 57m
Only the operator pod and the first ``example-cluster`` remain.
To finish clean-up, we delete the cluster ``example-cluster`` and then the operator's resources.
.. code-block:: shell
$ kubectl -n ray delete raycluster example-cluster
$ kubectl -n ray delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/operator.yaml
If you like, you can delete the RayCluster customer resource definition.
(Using the operator again will then require reapplying the CRD.)
.. code-block:: shell
$ kubectl delete crd rayclusters.cluster.ray.io
# OR
$ kubectl delete -f ray/python/ray/autoscaler/kubernetes/operator_configs/cluster_crd.yaml
.. _`Kubernetes Operator`: https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
.. _`Kubernetes Custom Resource`: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/
.. _`Kubernetes Custom Resource Definition`: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
.. _`annotation`: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#attaching-metadata-to-objects
.. _`permissions`: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
.. _`minikube`: https://minikube.sigs.k8s.io/docs/start/
.. _`namespaced`: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/