ray/doc/source/cluster/cluster_under_construction/ray-clusters-on-kubernetes/user-guides/autoscaling.md

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

95 lines
5.7 KiB
Markdown
Raw Normal View History

(kuberay-autoscaler-discussion)=
# Autoscaling
This page discusses autoscaling in the context of Ray on Kubernetes.
For details on autoscaler configuration see the {ref}`configuration guide<kuberay-autoscaling-config>`.
(autoscaler-pro-con)=
## Should I enable autoscaling?
Ray Autoscaler support is optional.
Here are some considerations to keep in mind when choosing whether to use autoscaling.
### Autoscaling: Pros
**Cope with unknown resource requirements.** If you don't know how much compute your Ray
workload will require, autoscaling can adjust your Ray cluster to the right size.
**Save on costs.** Idle compute is automatically scaled down, potentially leading to cost savings,
especially when you are using expensive resources like GPUs.
### Autoscaling: Cons
**Less predictable when resource requirements are known.** If you already know exactly
how much compute your workload requires, it could make sense to provision a static Ray cluster
of the appropriate fixed size.
**Longer end-to-end runtime.** Autoscaling entails provisioning compute for Ray workers
while the Ray application is running. On the other hand, if you pre-provision a fixed
number of Ray workers, all of the Ray workers can be started in parallel, potentially reducing your application's
runtime.
## Ray Autoscaler vs. other autoscalers
We describe the relationship between the Ray autoscaler and other autoscalers in the Kubernetes
ecosystem.
### Ray Autoscaler vs. Horizontal Pod Autoscaler
The Ray autoscaler adjusts the number of Ray nodes in a Ray cluster.
On Kubernetes, each Ray node is run as a Kubernetes pod. Thus in the context of Kubernetes,
the Ray autoscaler scales Ray **pod quantities**. In this sense, the Ray autoscaler
plays a role similar to that of the Kubernetes
[Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) (HPA).
However, the following features distinguish the Ray Autoscaler from the HPA.
#### Load metrics are based on application semantics
The Horizontal Pod Autoscaler determines scale based on physical usage metrics like CPU
and memory. By contrast, the Ray autoscaler uses the logical resources expressed in
task and actor annotations. For instance, if each Ray container spec in your RayCluster CR indicates
a limit of 10 CPUs, and you submit twenty tasks annotated with `@ray.remote(num_cpus=5)`,
10 Ray pods will be created to satisfy the 100-CPU resource demand.
In this respect, the Ray autoscaler is similar to the
[Kuberentes Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler),
which makes scaling decisions based on the logical resources expressed in container
resource requests.
#### Fine-grained control of scale-down
To accommodate the statefulness of Ray applications, the Ray autoscaler has more
fine-grained control over scale-down than the Horizontal Pod Autoscaler. In addition to
determining desired scale, the Ray Autoscaler is able to select precisely which pods
to scale down. The KubeRay operator then deletes that pod.
By contrast, the Horizontal Pod Autoscaler can only decrease a replica count, without much
control over which pods are deleted. For a Ray application, downscaling a random
pod could be dangerous.
#### Architecture: One Ray Autoscaler per Ray Cluster.
Horizontal Pod Autoscaling is centrally controlled by a manager in the Kubernetes control plane;
the manager controls the scale of many Kubernetes objects.
By contrast, each Ray cluster is managed by its own Ray autoscaler process,
running as a sidecar container in the Ray head pod. This design choice is motivated
by the following considerations:
- **Scalability.** Autoscaling each Ray cluster requires processing a significant volume of resource
data from that Ray cluster.
- **Simplified versioning and compatibility.** The autoscaler and Ray are both developed
as part of the Ray repository. The interface between the autoscaler and the Ray core is complex.
To support multiple Ray clusters running at different Ray versions, it is thus best to match
Ray and Autoscaler code versions. Running one autoscaler per Ray cluster and matching the code versions
ensures compatibility.
### Ray Autoscaler with Kubernetes Cluster Autoscaler
The Ray Autoscaler and the
[Kubernetes Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler)
complement each other.
After the Ray autoscaler decides to create a Ray pod, the Kubernetes Cluster Autoscaler
can provision a Kubernetes node so that the pod can be placed.
Similarly, after the Ray autoscaler decides to delete an idle pod, the Kubernetes
Cluster Autoscaler can clean up the idle Kubernetes node that remains.
It is recommended to configure your RayCluster so that only one Ray pod fits per Kubernetes node.
If you follow this pattern, Ray Autoscaler pod scaling events will correspond roughly one-to-one with cluster autoscaler
node scaling events. (We say "roughly" because it is possible for a Ray pod be deleted and replaced
with a new Ray pod before the underlying Kubernetes node is scaled down.)
### Vertical Pod Autoscaler
There is no relationship between the Ray Autoscaler and the Kubernetes
[Vertical Pod Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) (VPA),
which is meant to size individual pods to the appropriate size based on current and past usage.
If you find that the load on your individual Ray pods is too high, there are a number
of manual techniques to decrease the load.
One method is to schedule fewer tasks/actors per node by increasing the resource
requirements specified in the `ray.remote` annotation.
For example, changing `@ray.remote(num_cpus=2)` to `@ray.remote(num_cpus=4)`
will halve the quantity of that task or actor that can fit in a given Ray pod.