mirror of
https://github.com/vale981/ray
synced 2025-03-12 14:16:39 -04:00
129 lines
5.2 KiB
ReStructuredText
129 lines
5.2 KiB
ReStructuredText
.. include:: /_includes/clusters/we_are_hiring.rst
|
||
|
||
Best practices for deploying large clusters
|
||
-------------------------------------------
|
||
|
||
This section aims to document best practices for deploying Ray clusters at
|
||
large scale.
|
||
|
||
Networking configuration
|
||
^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
End users should only need to directly interact with the head node of the
|
||
cluster. In particular, there are 2 services which should be exposed to users:
|
||
|
||
1. The dashboard
|
||
2. The Ray client server
|
||
|
||
.. note::
|
||
|
||
While users only need 2 ports to connect to a cluster, the nodes within a
|
||
cluster require a much wider range of ports to communicate.
|
||
|
||
See :ref:`Ray port configuration <Ray-ports>` for a comprehensive list.
|
||
|
||
Applications (such as :ref:`Ray Serve <Rayserve>`) may also require
|
||
additional ports to work properly.
|
||
|
||
System configuration
|
||
^^^^^^^^^^^^^^^^^^^^
|
||
|
||
There are a few system level configurations that should be set when using Ray
|
||
at a large scale.
|
||
|
||
* Make sure ``ulimit -n`` is set to at least 65535. Ray opens many direct
|
||
connections between worker processes to avoid bottlenecks, so it can quickly
|
||
use a large number of file descriptors.
|
||
* Make sure ``/dev/shm`` is sufficiently large. Most ML/RL applications rely
|
||
heavily on the plasma store. By default, Ray will try to use ``/dev/shm`` for
|
||
the object store, but if it is not large enough (i.e. ``--object-store-memory``
|
||
> size of ``/dev/shm``), Ray will write the plasma store to disk instead, which
|
||
may cause significant performance problems.
|
||
* Use NVMe SSDs (or other high perforfmance storage) if possible. If
|
||
:ref:`object spilling <object-spilling>` is enabled Ray will spill objects to
|
||
disk if necessary. This is most commonly needed for data processing
|
||
workloads.
|
||
|
||
Configuring the head node
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
In addition to the above changes, when deploying a large cluster, Ray's
|
||
architecture means that the head node will have extra stress due to GCS.
|
||
|
||
* Make sure the head node has sufficient bandwidth. The most heavily stressed
|
||
resource on the head node is outbound bandwidth. For large clusters (see the
|
||
scalability envelope), we recommend using machines networking characteristics
|
||
at least as good as an r5dn.16xlarge on AWS EC2.
|
||
* Set ``resources: {"CPU": 0}`` on the head node. (For Ray clusters deployed using Helm,
|
||
set ``rayResources: {"CPU": 0}``.) Due to the heavy networking
|
||
load (and the GCS and dashboard processes), we recommend setting the number of
|
||
CPUs to 0 on the head node to avoid scheduling additional tasks on it.
|
||
|
||
Configuring the autoscaler
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
For large, long running clusters, there are a few parameters that can be tuned.
|
||
|
||
* Ensure your quotas for node types are set correctly.
|
||
* For long running clusters, set the ``AUTOSCALER_MAX_NUM_FAILURES`` environment
|
||
variable to a large number (or ``inf``) to avoid unexpected autoscaler
|
||
crashes. The variable can be set by prepending \ ``export AUTOSCALER_MAX_NUM_FAILURES=inf;``
|
||
to the head node's Ray start command.
|
||
(Note: you may want a separate mechanism to detect if the autoscaler
|
||
errors too often).
|
||
* For large clusters, consider tuning ``upscaling_speed`` for faster
|
||
autoscaling.
|
||
|
||
Picking nodes
|
||
^^^^^^^^^^^^^
|
||
|
||
Here are some tips for how to set your ``available_node_types`` for a cluster,
|
||
using AWS instance types as a concrete example.
|
||
|
||
General recommendations with AWS instance types:
|
||
|
||
**When to use GPUs**
|
||
|
||
* If you’re using some RL/ML framework
|
||
* You’re doing something with tensorflow/pytorch/jax (some framework that can
|
||
leverage GPUs well)
|
||
|
||
**What type of GPU?**
|
||
|
||
* The latest gen GPU is almost always the best bang for your buck (p3 > p2, g4
|
||
> g3), for most well designed applications the performance outweighs the
|
||
price (the instance price may be higher, but you’ll use the instance for less
|
||
time.
|
||
* You may want to consider using older instances if you’re doing dev work and
|
||
won’t actually fully utilize the GPUs though.
|
||
* If you’re doing training (ML or RL), you should use a P instance. If you’re
|
||
doing inference, you should use a G instance. The difference is
|
||
processing:VRAM ratio (training requires more memory).
|
||
|
||
**What type of CPU?**
|
||
|
||
* Again stick to the latest generation, they’re typically cheaper and faster.
|
||
* When in doubt use M instances, they have typically have the highest
|
||
availability.
|
||
* If you know your application is memory intensive (memory utilization is full,
|
||
but cpu is not), go with an R instance
|
||
* If you know your application is CPU intensive go with a C instance
|
||
* If you have a big cluster, make the head node an instance with an n (r5dn or
|
||
c5n)
|
||
|
||
**How many CPUs/GPUs?**
|
||
|
||
* Focus on your CPU:GPU ratio first and look at the utilization (Ray dashboard
|
||
should help with this). If your CPU utilization is low add GPUs, or vice
|
||
versa.
|
||
* The exact ratio will be very dependent on your workload.
|
||
* Once you find a good ratio, you should be able to scale up and and keep the
|
||
same ratio.
|
||
* You can’t infinitely scale forever. Eventually, as you add more machines your
|
||
performance improvements will become sub-linear/not worth it. There may not
|
||
be a good one-size fits all strategy at this point.
|
||
|
||
.. note::
|
||
|
||
If you're using RLlib, check out :ref:`the RLlib scaling guide
|
||
<rllib-scaling-guide>` for RLlib specific recommendations.
|