[Serve] Add Instructions for GPU (#8495)

This commit is contained in:
Simon Mo 2020-05-19 18:33:58 -07:00 committed by GitHub
parent 1163ddbe45
commit c9c84c87f4
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 27 additions and 3 deletions

View file

@ -64,6 +64,7 @@ Any method of the actor can return multiple object IDs with the ``ray.method`` d
assert ray.get(obj_id1) == 1
assert ray.get(obj_id2) == 2
.. _actor-resource-guide:
Resources with Actors
---------------------

View file

@ -50,14 +50,16 @@ If using the command line, connect to the Ray cluster as follow:
# Connect to ray. Notice if connected to existing cluster, you don't specify resources.
ray.init(address=<address>)
.. _omp-num-thread-note:
.. note::
Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done
to avoid performance degradation with many workers (issue #6998). You can
Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done
to avoid performance degradation with many workers (issue #6998). You can
override this by explicitly setting ``OMP_NUM_THREADS``. ``OMP_NUM_THREADS`` is commonly
used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra.
In multi-worker setting, we want one thread per worker instead of many threads
per worker to avoid contention.
Logging and Debugging
---------------------

View file

@ -192,6 +192,27 @@ To scale out a backend to multiple workers, simplify configure the number of rep
This will scale out the number of workers that can accept requests.
Using Resources (CPUs, GPUs)
++++++++++++++++++++++++++++
To assign hardware resource per worker, you can pass resource requirements to
``ray_actor_options``. To learn about options to pass in, take a look at
:ref:`Resources with Actor<actor-resource-guide>` guide.
For example, to create a backend where each replica uses a single GPU, you can do the
following:
.. code-block:: python
options = {"num_gpus": 1}
serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=options)
.. note::
Deep learning models like PyTorch and Tensorflow often use all the CPUs when
performing inference. Ray sets the environment variable ``OMP_NUM_THREADS=1`` to
:ref:`avoid contention<omp-num-thread-note>`. This means each worker will only
use one CPU instead of all of them.
Splitting Traffic
+++++++++++++++++