[Serve] Add Instructions for GPU (#8495)

2025-03-06 10:31:39 -05:00 · 2020-05-19 18:33:58 -07:00 · 2020-05-19 18:33:58 -07:00 · c9c84c87f4
commit c9c84c87f4
parent 1163ddbe45
3 changed files with 27 additions and 3 deletions
--- a/doc/source/actors.rst
+++ b/doc/source/actors.rst
@ -64,6 +64,7 @@ Any method of the actor can return multiple object IDs with the ``ray.method`` d
    assert ray.get(obj_id1) == 1
    assert ray.get(obj_id2) == 2
 .. _actor-resource-guide:
 Resources with Actors
 ---------------------
--- a/doc/source/configure.rst
+++ b/doc/source/configure.rst
@ -50,14 +50,16 @@ If using the command line, connect to the Ray cluster as follow:
  # Connect to ray. Notice if connected to existing cluster, you don't specify resources.
  ray.init(address=<address>)
 .. _omp-num-thread-note:
 .. note::
-    Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done 
+    Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done
-    to avoid performance degradation with many workers (issue #6998). You can 
+    to avoid performance degradation with many workers (issue #6998). You can
    override this by explicitly setting ``OMP_NUM_THREADS``. ``OMP_NUM_THREADS`` is commonly
    used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra.
    In multi-worker setting, we want one thread per worker instead of many threads
    per worker to avoid contention.
-    
+
 Logging and Debugging
 ---------------------
--- a/doc/source/rayserve/overview.rst
+++ b/doc/source/rayserve/overview.rst
@ -192,6 +192,27 @@ To scale out a backend to multiple workers, simplify configure the number of rep
 This will scale out the number of workers that can accept requests.
 Using Resources (CPUs, GPUs)
 ++++++++++++++++++++++++++++
 To assign hardware resource per worker, you can pass resource requirements to
 ``ray_actor_options``. To learn about options to pass in, take a look at
 :ref:`Resources with Actor<actor-resource-guide>` guide.
 For example, to create a backend where each replica uses a single GPU, you can do the
 following:
 .. code-block:: python
  options = {"num_gpus": 1}
  serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=options)
 .. note::
  Deep learning models like PyTorch and Tensorflow often use all the CPUs when
  performing inference. Ray sets the environment variable ``OMP_NUM_THREADS=1`` to
  :ref:`avoid contention<omp-num-thread-note>`. This means each worker will only
  use one CPU instead of all of them.
 Splitting Traffic
 +++++++++++++++++