[serve] Clarify OMP_NUM_THREADS behavior (#8740)

This commit is contained in:
Edward Oakes 2020-06-05 15:39:37 -05:00 committed by GitHub
parent c0df913b19
commit 7bfce5c027
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -44,12 +44,27 @@ following:
config = {"num_gpus": 1}
serve.create_backend("my_gpu_backend", handle_request, ray_actor_options=config)
.. note::
Configuring Parallelism with OMP_NUM_THREADS
--------------------------------------------
Deep learning models like PyTorch and Tensorflow often use all the CPUs when
performing inference. Ray sets the environment variable ``OMP_NUM_THREADS=1`` to
:ref:`avoid contention<omp-num-thread-note>`. This means each worker will only
use one CPU instead of all of them.
Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference.
The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable.
To :ref:`avoid contention<omp-num-thread-note>`, Ray sets ``OMP_NUM_THREADS=1`` by default because Ray workers and actors use a single CPU by default.
If you *do* want to enable this parallelism in your Serve backend, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:
.. code-block:: bash
OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
.. code-block:: python
class MyBackend:
def __init__(self, parallelism):
os.environ["OMP_NUM_THREADS"] = parallelism
# Download model weights, initialize model, etc.
serve.create_backend("parallel_backend", MyBackend, 12)
.. _serve-batching: