[Serve] Add Instructions for GPU (#8495)

This commit is contained in:
Simon Mo 2020-05-19 18:33:58 -07:00 committed by GitHub
parent 1163ddbe45
commit c9c84c87f4
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 27 additions and 3 deletions

View file

@ -64,6 +64,7 @@ Any method of the actor can return multiple object IDs with the ``ray.method`` d
assert ray.get(obj_id1) == 1 assert ray.get(obj_id1) == 1
assert ray.get(obj_id2) == 2 assert ray.get(obj_id2) == 2
.. _actor-resource-guide:
Resources with Actors Resources with Actors
--------------------- ---------------------

View file

@ -50,14 +50,16 @@ If using the command line, connect to the Ray cluster as follow:
# Connect to ray. Notice if connected to existing cluster, you don't specify resources. # Connect to ray. Notice if connected to existing cluster, you don't specify resources.
ray.init(address=<address>) ray.init(address=<address>)
.. _omp-num-thread-note:
.. note:: .. note::
Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done Ray sets the environment variable ``OMP_NUM_THREADS=1`` by default. This is done
to avoid performance degradation with many workers (issue #6998). You can to avoid performance degradation with many workers (issue #6998). You can
override this by explicitly setting ``OMP_NUM_THREADS``. ``OMP_NUM_THREADS`` is commonly override this by explicitly setting ``OMP_NUM_THREADS``. ``OMP_NUM_THREADS`` is commonly
used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra. used in numpy, PyTorch, and Tensorflow to perform multit-threaded linear algebra.
In multi-worker setting, we want one thread per worker instead of many threads In multi-worker setting, we want one thread per worker instead of many threads
per worker to avoid contention. per worker to avoid contention.
Logging and Debugging Logging and Debugging
--------------------- ---------------------

View file

@ -192,6 +192,27 @@ To scale out a backend to multiple workers, simplify configure the number of rep
This will scale out the number of workers that can accept requests. This will scale out the number of workers that can accept requests.
Using Resources (CPUs, GPUs)
++++++++++++++++++++++++++++
To assign hardware resource per worker, you can pass resource requirements to
``ray_actor_options``. To learn about options to pass in, take a look at
:ref:`Resources with Actor<actor-resource-guide>` guide.
For example, to create a backend where each replica uses a single GPU, you can do the
following:
.. code-block:: python
options = {"num_gpus": 1}
serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=options)
.. note::
Deep learning models like PyTorch and Tensorflow often use all the CPUs when
performing inference. Ray sets the environment variable ``OMP_NUM_THREADS=1`` to
:ref:`avoid contention<omp-num-thread-note>`. This means each worker will only
use one CPU instead of all of them.
Splitting Traffic Splitting Traffic
+++++++++++++++++ +++++++++++++++++