2021-06-19 10:07:15 -07:00
|
|
|
|
.. _deployment-guide:
|
|
|
|
|
|
|
|
|
|
Ray Deployment Guide
|
|
|
|
|
====================
|
|
|
|
|
|
|
|
|
|
This page provides an overview of how to deploy a multi-node Ray cluster, including how to:
|
|
|
|
|
|
|
|
|
|
* Launch the cluster.
|
|
|
|
|
* Set up the autoscaler.
|
|
|
|
|
* Monitor a multi-node cluster.
|
|
|
|
|
* Best practices for setting up a Ray cluster.
|
|
|
|
|
|
|
|
|
|
Launching a Ray cluster
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
There 2 recommended ways of launching a Ray cluster are via:
|
|
|
|
|
|
|
|
|
|
1. :ref:`The cluster launcher <cluster-cloud>`
|
|
|
|
|
2. :ref:`The kubernetes operator <Ray-operator>`
|
|
|
|
|
|
|
|
|
|
Cluster Launcher
|
|
|
|
|
^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
The goal of :ref:`the cluster launcher <cluster-cloud>` is to make it easy to deploy a Ray cluster on
|
|
|
|
|
any cloud. It will:
|
|
|
|
|
|
|
|
|
|
* provision a new instance/machine using the cloud provider's SDK.
|
|
|
|
|
* execute shell commands to set up Ray with the provided options.
|
|
|
|
|
* (optionally) run any custom, user defined setup commands.
|
|
|
|
|
* Initialize the Ray cluster.
|
|
|
|
|
* Deploy an autoscaler process.
|
|
|
|
|
|
|
|
|
|
Kubernetes Operator
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
2021-06-20 22:38:34 -04:00
|
|
|
|
The goal of the :ref:`Ray Kubernetes Operator <Ray-operator>` is to make it easy
|
|
|
|
|
to deploy a Ray cluster on an existing Kubernetes cluster.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
2021-06-20 22:38:34 -04:00
|
|
|
|
To simplify Operator configuration, Ray provides a :ref:`a Helm chart <Ray-helm>`.
|
|
|
|
|
Installing the Helm chart will create an Operator Deployment.
|
|
|
|
|
The Operator manages autoscaling Ray clusters; each Ray node runs in its own K8s Pod.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Autoscaling with Ray
|
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
Ray is designed to support highly elastic workloads which are most efficient on
|
|
|
|
|
an autoscaling cluster. At a high level, the autoscaler attempts to
|
|
|
|
|
launch/terminate nodes in order to ensure that workloads have sufficient
|
|
|
|
|
resources to run, while minimizing the idle resources.
|
|
|
|
|
|
|
|
|
|
It does this by taking into consideration:
|
|
|
|
|
|
|
|
|
|
* User specified hard limits (min/max workers).
|
|
|
|
|
* User specified node types (nodes in a Ray cluster do _not_ have to be
|
|
|
|
|
homogenous).
|
|
|
|
|
* Information from the Ray core's scheduling layer about the current resource
|
|
|
|
|
usage/demands of the cluster.
|
|
|
|
|
* Programmatic autoscaling hints.
|
|
|
|
|
|
|
|
|
|
Take a look at :ref:`the cluster reference <cluster-config>` to learn more
|
|
|
|
|
about configuring the autoscaler.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
How does it work?
|
|
|
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
The Ray Cluster Launcher will automatically enable a load-based autoscaler. The
|
|
|
|
|
autoscaler resource demand scheduler will look at the pending tasks, actors,
|
|
|
|
|
and placement groups resource demands from the cluster, and try to add the
|
|
|
|
|
minimum list of nodes that can fulfill these demands. When worker nodes are
|
|
|
|
|
idle for more than :ref:`idle_timeout_minutes
|
|
|
|
|
<cluster-configuration-idle-timeout-minutes>`, they will be removed (the head
|
|
|
|
|
node is never removed unless the cluster is torn down).
|
|
|
|
|
|
|
|
|
|
Autoscaler uses a simple binpacking algorithm to binpack the user demands into
|
|
|
|
|
the available cluster resources. The remaining unfulfilled demands are placed
|
|
|
|
|
on the smallest list of nodes that satisfies the demand while maximizing
|
|
|
|
|
utilization (starting from the smallest node).
|
|
|
|
|
|
|
|
|
|
**Here is "A Glimpse into the Ray Autoscaler" and how to debug/monitor your cluster:**
|
|
|
|
|
|
|
|
|
|
2021-19-01 by Ameer Haj-Ali, Anyscale Inc.
|
|
|
|
|
|
|
|
|
|
.. youtube:: BJ06eJasdu4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Deploying an application
|
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
|
|
The recommended way of connecting to a Ray cluster is to use the
|
2021-07-14 21:45:46 -07:00
|
|
|
|
``ray.init("ray://<host>:<port>")`` API and connect via the Ray Client.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
2021-07-14 21:45:46 -07:00
|
|
|
|
Using ``ray.init("ray://<host>:<port>")`` is generally a best practice because it allows
|
2021-06-19 10:07:15 -07:00
|
|
|
|
you to test your code locally, and deploy to a cluster with **no code
|
|
|
|
|
changes**.
|
|
|
|
|
|
|
|
|
|
To connect via Ray Client, set the ``RAY_ADDRESS`` environment variable to the
|
|
|
|
|
address of the Ray client server.
|
|
|
|
|
|
|
|
|
|
:ref:`Learn more about setting up the Ray client server here <Ray-client>`.
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
When deploying an application, the job will be killed if the driver
|
|
|
|
|
disconnects.
|
|
|
|
|
|
|
|
|
|
:ref:`A detached actor <actor-lifetimes>` can be used to avoid having a long running driver.
|
|
|
|
|
|
|
|
|
|
Monitoring and observability
|
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
|
|
Ray comes with 3 main observability features:
|
|
|
|
|
|
|
|
|
|
1. :ref:`The dashboard <Ray-dashboard>`
|
2021-06-24 19:25:00 -04:00
|
|
|
|
2. :ref:`ray status <monitor-cluster>`
|
2021-06-19 10:07:15 -07:00
|
|
|
|
3. :ref:`Prometheus metrics <multi-node-metrics>`
|
|
|
|
|
|
|
|
|
|
Monitoring the cluster via the dashboard
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
:ref:`The dashboard provides detailed information about the state of the cluster <Ray-dashboard>`,
|
|
|
|
|
including the running jobs, actors, workers, nodes, etc.
|
|
|
|
|
|
|
|
|
|
By default, the cluster launcher and operator will launch the dashboard, but
|
|
|
|
|
not publicly expose it.
|
|
|
|
|
|
|
|
|
|
If you launch your application via the cluster launcher, you can securely
|
2021-06-24 19:25:00 -04:00
|
|
|
|
portforward local traffic to the dashboard via the ``ray dashboard`` command
|
2021-06-19 10:07:15 -07:00
|
|
|
|
(which establishes an SSH tunnel). The dashboard will now be visible at
|
|
|
|
|
``http://localhost:8265``.
|
|
|
|
|
|
2021-06-20 22:38:34 -04:00
|
|
|
|
The Kubernetes Operator makes the dashboard available via a Service targeting the Ray head pod.
|
|
|
|
|
You can :ref:`access the dashboard <ray-k8s-dashboard>` using ``kubectl port-forward``.
|
|
|
|
|
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
|
|
|
|
Observing the autoscaler
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
The autoscaler makes decisions by scheduling information, and programmatic
|
|
|
|
|
information from the cluster. This information, along with the status of
|
|
|
|
|
starting nodes, can be accessed via the ``ray status`` command.
|
|
|
|
|
|
|
|
|
|
To dump the current state of a cluster launched via the cluster launcher, you
|
2021-06-24 19:25:00 -04:00
|
|
|
|
can run ``ray exec cluster.yaml "Ray status"``.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
2021-06-24 19:25:00 -04:00
|
|
|
|
For a more "live" monitoring experience, it is recommended that you run ``ray
|
|
|
|
|
status`` in a watch loop: ``ray exec cluster.yaml "watch -n 1 Ray status"``.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
2021-06-24 19:25:00 -04:00
|
|
|
|
With the kubernetes operator, you should replace ``ray exec cluster.yaml`` with
|
2021-06-19 10:07:15 -07:00
|
|
|
|
``kubectl exec <head node pod>``.
|
|
|
|
|
|
|
|
|
|
Prometheus metrics
|
|
|
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Ray is capable of producing prometheus metrics. When enabled, Ray produces some
|
|
|
|
|
metrics about the Ray core, and some internal metrics by default. It also
|
|
|
|
|
supports custom, user-defined metrics.
|
|
|
|
|
|
|
|
|
|
These metrics can be consumed by any metrics infrastructure which can ingest
|
|
|
|
|
metrics from the prometheus server on the head node of the cluster.
|
|
|
|
|
|
|
|
|
|
:ref:`Learn more about setting up prometheus here. <multi-node-metrics>`
|
|
|
|
|
|
|
|
|
|
Best practices for deploying large clusters
|
|
|
|
|
-------------------------------------------
|
|
|
|
|
|
|
|
|
|
This section aims to document best practices for deploying Ray clusters at
|
|
|
|
|
large scale.
|
|
|
|
|
|
|
|
|
|
Networking configuration
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
End users should only need to directly interact with the head node of the
|
|
|
|
|
cluster. In particular, there are 2 services which should be exposed to users:
|
|
|
|
|
|
|
|
|
|
1. The dashboard
|
|
|
|
|
2. The Ray client server
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
While users only need 2 ports to connect to a cluster, the nodes within a
|
|
|
|
|
cluster require a much wider range of ports to communicate.
|
|
|
|
|
|
|
|
|
|
See :ref:`Ray port configuration <Ray-ports>` for a comprehensive list.
|
|
|
|
|
|
|
|
|
|
Applications (such as :ref:`Ray Serve <Rayserve>`) may also require
|
|
|
|
|
additional ports to work properly.
|
|
|
|
|
|
|
|
|
|
System configuration
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
There are a few system level configurations that should be set when using Ray
|
|
|
|
|
at a large scale.
|
|
|
|
|
|
|
|
|
|
* Make sure ``ulimit -n`` is set to at least 65535. Ray opens many direct
|
|
|
|
|
connections between worker processes to avoid bottlenecks, so it can quickly
|
|
|
|
|
use a large number of file descriptors.
|
|
|
|
|
* Make sure ``/dev/shm`` is sufficiently large. Most ML/RL applications rely
|
|
|
|
|
heavily on the plasma store. By default, Ray will try to use ``/dev/shm`` for
|
|
|
|
|
the object store, but if it is not large enough (i.e. ``--object-store-memory``
|
|
|
|
|
> size of ``/dev/shm``), Ray will write the plasma store to disk instead, which
|
|
|
|
|
may cause significant performance problems.
|
|
|
|
|
* Use NVMe SSDs (or other high perforfmance storage) if possible. If
|
|
|
|
|
:ref:`object spilling <object-spilling>` is enabled Ray will spill objects to
|
|
|
|
|
disk if necessary. This is most commonly needed for data processing
|
|
|
|
|
workloads.
|
|
|
|
|
|
|
|
|
|
Configuring the head node
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
In addition to the above changes, when deploying a large cluster, Ray's
|
|
|
|
|
architecture means that the head node will have extra stress due to GCS.
|
|
|
|
|
|
|
|
|
|
* Make sure the head node has sufficient bandwidth. The most heavily stressed
|
|
|
|
|
resource on the head node is outbound bandwidth. For large clusters (see the
|
|
|
|
|
scalability envelope), we recommend using machines networking characteristics
|
|
|
|
|
at least as good as an r5dn.16xlarge on AWS EC2.
|
2021-06-20 22:38:34 -04:00
|
|
|
|
* Set ``resources: {"CPU": 0}`` on the head node. (For Ray clusters deployed using Helm,
|
|
|
|
|
set ``rayResources: {"CPU": 0}``.) Due to the heavy networking
|
2021-06-19 10:07:15 -07:00
|
|
|
|
load (and the GCS and redis processes), we recommend setting the number of
|
2021-06-20 22:38:34 -04:00
|
|
|
|
CPUs to 0 on the head node to avoid scheduling additional tasks on it.
|
2021-06-19 10:07:15 -07:00
|
|
|
|
|
|
|
|
|
Configuring the autoscaler
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
For large, long running clusters, there are a few parameters that can be tuned.
|
|
|
|
|
|
|
|
|
|
* Ensure your quotas for node types are set correctly.
|
2021-06-24 19:25:00 -04:00
|
|
|
|
* For long running clusters, set the ``AUTOSCALER_MAX_NUM_FAILURES`` environment
|
2021-06-19 10:07:15 -07:00
|
|
|
|
variable to a large number (or ``inf``) to avoid unexpected autoscaler
|
2021-06-24 19:25:00 -04:00
|
|
|
|
crashes. The variable can be set by prepending \ ``export AUTOSCALER_MAX_NUM_FAILURES=inf;``
|
|
|
|
|
to the head node's Ray start command.
|
|
|
|
|
(Note: you may want a separate mechanism to detect if the autoscaler
|
2021-06-19 10:07:15 -07:00
|
|
|
|
errors too often).
|
|
|
|
|
* For large clusters, consider tuning ``upscaling_speed`` for faster
|
|
|
|
|
autoscaling.
|
|
|
|
|
|
|
|
|
|
Picking nodes
|
|
|
|
|
^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Here are some tips for how to set your ``available_node_types`` for a cluster,
|
|
|
|
|
using AWS instance types as a concrete example.
|
|
|
|
|
|
|
|
|
|
General recommendations with AWS instance types:
|
|
|
|
|
|
|
|
|
|
**When to use GPUs**
|
|
|
|
|
|
|
|
|
|
* If you’re using some RL/ML framework
|
|
|
|
|
* You’re doing something with tensorflow/pytorch/jax (some framework that can
|
|
|
|
|
leverage GPUs well)
|
|
|
|
|
|
|
|
|
|
**What type of GPU?**
|
|
|
|
|
|
|
|
|
|
* The latest gen GPU is almost always the best bang for your buck (p3 > p2, g4
|
|
|
|
|
> g3), for most well designed applications the performance outweighs the
|
|
|
|
|
price (the instance price may be higher, but you’ll use the instance for less
|
|
|
|
|
time.
|
|
|
|
|
* You may want to consider using older instances if you’re doing dev work and
|
|
|
|
|
won’t actually fully utilize the GPUs though.
|
|
|
|
|
* If you’re doing training (ML or RL), you should use a P instance. If you’re
|
|
|
|
|
doing inference, you should use a G instance. The difference is
|
|
|
|
|
processing:VRAM ratio (training requires more memory).
|
|
|
|
|
|
|
|
|
|
**What type of CPU?**
|
|
|
|
|
|
|
|
|
|
* Again stick to the latest generation, they’re typically cheaper and faster.
|
|
|
|
|
* When in doubt use M instances, they have typically have the highest
|
|
|
|
|
availability.
|
|
|
|
|
* If you know your application is memory intensive (memory utilization is full,
|
|
|
|
|
but cpu is not), go with an R instance
|
|
|
|
|
* If you know your application is CPU intensive go with a C instance
|
|
|
|
|
* If you have a big cluster, make the head node an instance with an n (r5dn or
|
|
|
|
|
c5n)
|
|
|
|
|
|
|
|
|
|
**How many CPUs/GPUs?**
|
|
|
|
|
|
|
|
|
|
* Focus on your CPU:GPU ratio first and look at the utilization (Ray dashboard
|
|
|
|
|
should help with this). If your CPU utilization is low add GPUs, or vice
|
|
|
|
|
versa.
|
|
|
|
|
* The exact ratio will be very dependent on your workload.
|
|
|
|
|
* Once you find a good ratio, you should be able to scale up and and keep the
|
|
|
|
|
same ratio.
|
|
|
|
|
* You can’t infinitely scale forever. Eventually, as you add more machines your
|
|
|
|
|
performance improvements will become sub-linear/not worth it. There may not
|
|
|
|
|
be a good one-size fits all strategy at this point.
|
|
|
|
|
|
|
|
|
|
.. note::
|
|
|
|
|
|
|
|
|
|
If you're using RLlib, check out :ref:`the RLlib scaling guide
|
|
|
|
|
<rllib-scaling-guide>` for RLlib specific recommendations.
|