ray/doc/source/using-ray-with-gpus.rst

GPU Support
===========

GPUs are critical for many machine learning applications. Ray enables remote
functions and actors to specify their GPU requirements in the ``ray.remote``
decorator.

Starting Ray with GPUs
----------------------

Ray will automatically detect the number of GPUs available on a machine.
If you need to, you can override this by specifying ``ray.init(num_gpus=N)`` or
``ray start --num-gpus=N``.

**Note:** There is nothing preventing you from passing in a larger value of
``num_gpus`` than the true number of GPUs on the machine. In this case, Ray will
act as if the machine has the number of GPUs you specified for the purposes of
scheduling tasks that require GPUs. Trouble will only occur if those tasks
attempt to actually use GPUs that don't exist.

Using Remote Functions with GPUs
--------------------------------

If a remote function requires GPUs, indicate the number of required GPUs in the
remote decorator.

.. code-block:: python

  import os

  @ray.remote(num_gpus=1)
  def use_gpu():
      print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))
      print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

Inside of the remote function, a call to ``ray.get_gpu_ids()`` will return a
list of strings indicating which GPUs the remote function is allowed to use.
Typically, it is not necessary to call ``ray.get_gpu_ids()`` because Ray will
automatically set the ``CUDA_VISIBLE_DEVICES`` environment variable.

**Note:** The function ``use_gpu`` defined above doesn't actually use any
GPUs. Ray will schedule it on a machine which has at least one GPU, and will
reserve one GPU for it while it is being executed, however it is up to the
function to actually make use of the GPU. This is typically done through an
external library like TensorFlow. Here is an example that actually uses GPUs.
Note that for this example to work, you will need to install the GPU version of
TensorFlow.

.. code-block:: python

  import tensorflow as tf

  @ray.remote(num_gpus=1)
  def use_gpu():
      # Create a TensorFlow session. TensorFlow will restrict itself to use the
      # GPUs specified by the CUDA_VISIBLE_DEVICES environment variable.
      tf.Session()

**Note:** It is certainly possible for the person implementing ``use_gpu`` to
ignore ``ray.get_gpu_ids`` and to use all of the GPUs on the machine. Ray does
not prevent this from happening, and this can lead to too many workers using the
same GPU at the same time. However, Ray does automatically set the
``CUDA_VISIBLE_DEVICES`` environment variable, which will restrict the GPUs used
by most deep learning frameworks.

Fractional GPUs
---------------

If you want two tasks to share the same GPU, then the tasks can each request
half (or some other fraction) of a GPU.

.. code-block:: python

  import ray
  import time

  ray.init(num_cpus=4, num_gpus=1)

  @ray.remote(num_gpus=0.25)
  def f():
      time.sleep(1)

  # The four tasks created here can execute concurrently.
  ray.get([f.remote() for _ in range(4)])

It is the developer's responsibility to make sure that the individual tasks
don't use more than their share of the GPU memory. TensorFlow can be configured
to limit its memory usage.

Using Actors with GPUs
----------------------

When defining an actor that uses GPUs, indicate the number of GPUs an actor
instance requires in the ``ray.remote`` decorator.

.. code-block:: python

  @ray.remote(num_gpus=1)
  class GPUActor(object):
      def __init__(self):
          return "This actor is allowed to use GPUs {}.".format(ray.get_gpu_ids())

When the actor is created, GPUs will be reserved for that actor for the lifetime
of the actor. If sufficient GPU resources are not available, then the actor will
not be created.

The following is an example of how to use GPUs in an actor through TensorFlow.

.. code-block:: python

  @ray.remote(num_gpus=1)
  class GPUActor(object):
      def __init__(self):
          # The call to tf.Session() will restrict TensorFlow to use the GPUs
          # specified in the CUDA_VISIBLE_DEVICES environment variable.
          self.sess = tf.Session()

Workers not Releasing GPU Resources
-----------------------------------

**Note:** Currently, when a worker executes a task that uses a GPU (e.g.,
through TensorFlow), the task may allocate memory on the GPU and may not release
it when the task finishes executing. This can lead to problems the next time a
task tries to use the same GPU. You can address this by setting ``max_calls=1``
in the remote decorator so that the worker automatically exits after executing
the task (thereby releasing the GPU resources).

.. code-block:: python

  import tensorflow as tf

  @ray.remote(num_gpus=1, max_calls=1)
  def leak_gpus():
      # This task will allocate memory on the GPU and then never release it, so
      # we include the max_calls argument to kill the worker and release the
      # resources.
      sess = tf.Session()
Consolidate and clean up documentation (#5645) 2019-09-07 11:50:18 -07:00			`GPU Support`
			`===========`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
			`GPUs are critical for many machine learning applications. Ray enables remote`
			functions and actors to specify their GPU requirements in the ``ray.remote``
			`decorator.`

			`Starting Ray with GPUs`
			`----------------------`

Consolidate and clean up documentation (#5645) 2019-09-07 11:50:18 -07:00			`Ray will automatically detect the number of GPUs available on a machine.`
			If you need to, you can override this by specifying ``ray.init(num_gpus=N)`` or
			``ray start --num-gpus=N``.
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
			`Note: There is nothing preventing you from passing in a larger value of`
			``num_gpus`` than the true number of GPUs on the machine. In this case, Ray will
			`act as if the machine has the number of GPUs you specified for the purposes of`
			`scheduling tasks that require GPUs. Trouble will only occur if those tasks`
			`attempt to actually use GPUs that don't exist.`

			`Using Remote Functions with GPUs`
			`--------------------------------`

			`If a remote function requires GPUs, indicate the number of required GPUs in the`
			`remote decorator.`

			`.. code-block:: python`

[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`import os`

Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00			`@ray.remote(num_gpus=1)`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`def use_gpu():`
			`print("ray.get_gpu_ids(): {}".format(ray.get_gpu_ids()))`
			`print("CUDA_VISIBLE_DEVICES: {}".format(os.environ["CUDA_VISIBLE_DEVICES"]))`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[api] Initial API deprecations for Ray 1.0 (#10325) 2020-08-28 15:03:50 -07:00			Inside of the remote function, a call to ``ray.get_gpu_ids()`` will return a
[Core] Do not convert gpu id to int (#9744) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-08-11 13:09:46 -06:00			`list of strings indicating which GPUs the remote function is allowed to use.`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			Typically, it is not necessary to call ``ray.get_gpu_ids()`` because Ray will
			automatically set the ``CUDA_VISIBLE_DEVICES`` environment variable.
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			Note: The function ``use_gpu`` defined above doesn't actually use any
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00			`GPUs. Ray will schedule it on a machine which has at least one GPU, and will`
			`reserve one GPU for it while it is being executed, however it is up to the`
			`function to actually make use of the GPU. This is typically done through an`
			`external library like TensorFlow. Here is an example that actually uses GPUs.`
			`Note that for this example to work, you will need to install the GPU version of`
			`TensorFlow.`

			`.. code-block:: python`

			`import tensorflow as tf`

			`@ray.remote(num_gpus=1)`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`def use_gpu():`
Change Python examples in documentation to use 4 space indentation. (#736) * Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-.rst Ray documentation - changed Python to 4 space indentation for files install-.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-.rst 2017-07-16 22:19:33 -07:00			`# Create a TensorFlow session. TensorFlow will restrict itself to use the`
			`# GPUs specified by the CUDA_VISIBLE_DEVICES environment variable.`
			`tf.Session()`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			Note: It is certainly possible for the person implementing ``use_gpu`` to
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00			ignore ``ray.get_gpu_ids`` and to use all of the GPUs on the machine. Ray does
			`not prevent this from happening, and this can lead to too many workers using the`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`same GPU at the same time. However, Ray does automatically set the`
			``CUDA_VISIBLE_DEVICES`` environment variable, which will restrict the GPUs used
			`by most deep learning frameworks.`

			`Fractional GPUs`
			`---------------`

			`If you want two tasks to share the same GPU, then the tasks can each request`
			`half (or some other fraction) of a GPU.`

			`.. code-block:: python`

			`import ray`
			`import time`

			`ray.init(num_cpus=4, num_gpus=1)`

			`@ray.remote(num_gpus=0.25)`
			`def f():`
			`time.sleep(1)`

			`# The four tasks created here can execute concurrently.`
			`ray.get([f.remote() for _ in range(4)])`

			`It is the developer's responsibility to make sure that the individual tasks`
			`don't use more than their share of the GPU memory. TensorFlow can be configured`
			`to limit its memory usage.`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
			`Using Actors with GPUs`
			`----------------------`

			`When defining an actor that uses GPUs, indicate the number of GPUs an actor`
			instance requires in the ``ray.remote`` decorator.

			`.. code-block:: python`

			`@ray.remote(num_gpus=1)`
			`class GPUActor(object):`
Change Python examples in documentation to use 4 space indentation. (#736) * Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-.rst Ray documentation - changed Python to 4 space indentation for files install-.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-.rst 2017-07-16 22:19:33 -07:00			`def __init__(self):`
			`return "This actor is allowed to use GPUs {}.".format(ray.get_gpu_ids())`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
			`When the actor is created, GPUs will be reserved for that actor for the lifetime`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`of the actor. If sufficient GPU resources are not available, then the actor will`
			`not be created.`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
			`The following is an example of how to use GPUs in an actor through TensorFlow.`

			`.. code-block:: python`

			`@ray.remote(num_gpus=1)`
			`class GPUActor(object):`
Change Python examples in documentation to use 4 space indentation. (#736) * Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-.rst Ray documentation - changed Python to 4 space indentation for files install-.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-.rst 2017-07-16 22:19:33 -07:00			`def __init__(self):`
			`# The call to tf.Session() will restrict TensorFlow to use the GPUs`
			`# specified in the CUDA_VISIBLE_DEVICES environment variable.`
			`self.sess = tf.Session()`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`Workers not Releasing GPU Resources`
			`-----------------------------------`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`Note: Currently, when a worker executes a task that uses a GPU (e.g.,`
			`through TensorFlow), the task may allocate memory on the GPU and may not release`
			`it when the task finishes executing. This can lead to problems the next time a`
			task tries to use the same GPU. You can address this by setting ``max_calls=1``
			`in the remote decorator so that the worker automatically exits after executing`
			`the task (thereby releasing the GPU resources).`

			`.. code-block:: python`

			`import tensorflow as tf`
Doc using ray with gpu (#644) * Added to troubleshooting documentation about whether redefining remote functions runs the new code version * Minor correction to troubleshooting documentation * Writing new documentation page for using Ray with GPUs * Wrote new documentation page on using ray with gpus * Add some more details. 2017-06-08 00:12:44 -07:00
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			`@ray.remote(num_gpus=1, max_calls=1)`
			`def leak_gpus():`
			`# This task will allocate memory on the GPU and then never release it, so`
			`# we include the max_calls argument to kill the worker and release the`
			`# resources.`
			`sess = tf.Session()`