Resource (CPUs, GPUs) ===================== This document describes how resources are managed in Ray. Each node in a Ray cluster knows its own resource capacities, and each task specifies its resource requirements. CPUs and GPUs ------------- The Ray backend includes built-in support for CPUs and GPUs. Specifying a node's resource requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To specify a node's resource requirements from the command line, pass the ``--num-cpus`` and ``--num-cpus`` flags into ``ray start``. .. code-block:: bash # To start a head node. ray start --head --num-cpus=8 --num-gpus=1 # To start a non-head node. ray start --redis-address= --num-cpus=4 --num-gpus=2 To specify a node's resource requirements when the Ray processes are all started through ``ray.init``, do the following. .. code-block:: python ray.init(num_cpus=8, num_gpus=1) If the number of CPUs is unspecified, Ray will automatically determine the number by running ``psutil.cpu_count()``. If the number of GPUs is unspecified, Ray will attempt to automatically detect the number of GPUs. Specifying a task's CPU and GPU requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To specify a task's CPU and GPU requirements, pass the ``num_cpus`` and ``num_gpus`` arguments into the remote decorator. .. code-block:: python @ray.remote(num_cpus=4, num_gpus=2) def f(): return 1 When ``f`` tasks will be scheduled on machines that have at least 4 CPUs and 2 GPUs, and when one of the ``f`` tasks executes, 4 CPUs and 2 GPUs will be reserved for that task. The IDs of the GPUs that are reserved for the task can be accessed with ``ray.get_gpu_ids()``. Ray will automatically set the environment variable ``CUDA_VISIBLE_DEVICES`` for that process. These resources will be released when the task finishes executing. However, if the task gets blocked in a call to ``ray.get``. For example, consider the following remote function. .. code-block:: python @ray.remote(num_cpus=1, num_gpus=1) def g(): return ray.get(f.remote()) When a ``g`` task is executing, it will release its CPU resources when it gets blocked in the call to ``ray.get``. It will reacquire the CPU resources when ``ray.get`` returns. It will retain its GPU resources throughout the lifetime of the task because the task will most likely continue to use GPU memory. To specify that an **actor** requires GPUs, do the following. .. code-block:: python @ray.remote(num_gpus=1) class Actor(object): pass When an ``Actor`` instance is created, it will be placed on a node that has at least 1 GPU, and the GPU will be reserved for the actor for the duration of the actor's lifetime (even if the actor is not executing tasks). The GPU resources will be released when the actor terminates. Note that currently **only GPU resources are used for actor placement**. Custom Resources ---------------- While Ray has built-in support for CPUs and GPUs, nodes can be started with arbitrary custom resources. **All custom resources behave like GPUs.** A node can be started with some custom resources as follows. .. code-block:: bash ray start --head --resources='{"Resource1": 4, "Resource2": 16}' It can be done through ``ray.init`` as follows. .. code-block:: python ray.init(resources={'Resource1': 4, 'Resource2': 16}) To require custom resources in a task, specify the requirements in the remote decorator. .. code-block:: python @ray.remote(resources={'Resource2': 1}) def f(): return 1 Current Limitations ------------------- We are working to remove the following limitations. - **Actor Resource Requirements:** Currently only GPUs are used to determine actor placement. - **Recovering from Bad Scheduling:** Currently Ray does not recover from poor scheduling decisions. For example, suppose there are two GPUs (on separate machines) in the cluster and we wish to run two GPU tasks. There are scenarios in which both tasks can be accidentally scheduled on the same machine, which will result in poor load balancing.