[Serve] Add Instructions for GPU (#8495)

2025-03-06 02:21:39 -05:00 · 2020-05-19 18:33:58 -07:00 · 2020-05-19 18:33:58 -07:00 · c9c84c87f4
commit c9c84c87f4
parent 1163ddbe45
3 changed files with 27 additions and 3 deletions
--- a/doc/source/actors.rst
+++ b/doc/source/actors.rst
@ -64,6 +64,7 @@ Any method of the actor can return multiple object IDs with the ``ray.method`` d
    assert ray.get(obj_id1) == 1
    assert ray.get(obj_id2) == 2

+.. _actor-resource-guide:

 Resources with Actors
 ---------------------
--- a/doc/source/configure.rst
+++ b/doc/source/configure.rst
@ -50,14 +50,16 @@ If using the command line, connect to the Ray cluster as follow:
  # Connect to ray. Notice if connected to existing cluster, you don't specify resources.
  ray.init(address=<address>)

+.. _omp-num-thread-note:
+
 .. note::
-    Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done 
-    to avoid performance degradation with many workers (issue #6998). You can 
+    Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done
+    to avoid performance degradation with many workers (issue #6998). You can
    override this by explicitly setting ``OMP_NUM_THREADS``. ``OMP_NUM_THREADS`` is commonly
    used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra.
    In multi-worker setting, we want one thread per worker instead of many threads
    per worker to avoid contention.
-    
+

 Logging and Debugging
 ---------------------
--- a/doc/source/rayserve/overview.rst
+++ b/doc/source/rayserve/overview.rst
@ -192,6 +192,27 @@ To scale out a backend to multiple workers, simplify configure the number of rep

 This will scale out the number of workers that can accept requests.

+Using Resources (CPUs, GPUs)
++++++++++++++++++++++++++++
+To assign hardware resource per worker, you can pass resource requirements to
+``ray_actor_options``. To learn about options to pass in, take a look at
+:ref:`Resources with Actor<actor-resource-guide>` guide.
+
+For example, to create a backend where each replica uses a single GPU, you can do the
+following:
+
+.. code-block:: python
+
+  options = {"num_gpus": 1}
+  serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=options)
+
+.. note::
+
+  Deep learning models like PyTorch and Tensorflow often use all the CPUs when
+  performing inference. Ray sets the environment variable ``OMP_NUM_THREADS=1`` to
+  :ref:`avoid contention<omp-num-thread-note>`. This means each worker will only
+  use one CPU instead of all of them.
+
 Splitting Traffic
 +++++++++++++++++