mirror of
https://github.com/vale981/ray
synced 2025-03-08 19:41:38 -05:00
101 lines
4.3 KiB
ReStructuredText
101 lines
4.3 KiB
ReStructuredText
.. _serve-batch-tutorial:
|
|
|
|
Batching Tutorial
|
|
=================
|
|
|
|
In this guide, we will deploy a simple vectorized adder that takes
|
|
a batch of queries and adds them at once. In particular, we show:
|
|
|
|
- How to implement and deploy a Ray Serve deployment that accepts batches.
|
|
- How to configure the batch size.
|
|
- How to query the model in Python.
|
|
|
|
This tutorial should help the following use cases:
|
|
|
|
- You want to perform offline batch inference on a cluster of machines.
|
|
- You want to serve online queries and your model can take advantage of batching.
|
|
For example, linear regressions and neural networks use CPU and GPU's
|
|
vectorized instructions to perform computation in parallel. Performing
|
|
inference with batching can increase the *throughput* of the model as well as
|
|
*utilization* of the hardware.
|
|
|
|
|
|
Let's import Ray Serve and some other helpers.
|
|
|
|
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
|
:start-after: __doc_import_begin__
|
|
:end-before: __doc_import_end__
|
|
|
|
You can use the ``@serve.batch`` decorator to annotate a function or a method.
|
|
This annotation will automatically cause calls to the function to be batched together.
|
|
The function must handle a list of objects and will be called with a single object.
|
|
This function must also be ``async def`` so that you can handle multiple queries concurrently:
|
|
|
|
.. code-block:: python
|
|
|
|
@serve.batch
|
|
async def my_batch_handler(self, requests: List):
|
|
pass
|
|
|
|
This batch handler can then be called from another ``async def`` method in your deployment.
|
|
These calls will be batched and executed together, but return an individual result as if
|
|
they were a normal function call:
|
|
|
|
.. code-block:: python
|
|
|
|
class MyBackend:
|
|
@serve.batch
|
|
async def my_batch_handler(self, requests: List):
|
|
results = []
|
|
for request in requests:
|
|
results.append(request.json())
|
|
return results
|
|
|
|
async def __call__(self, request):
|
|
await self.my_batch_handler(request)
|
|
|
|
.. note::
|
|
By default, Ray Serve performs *opportunistic batching*. This means that as
|
|
soon as the batch handler is called, the method will be executed without
|
|
waiting for a full batch. If there are more queries available after this call
|
|
finishes, a larger batch may be executed. This behavior can be tuned using the
|
|
``batch_wait_timeout_s`` option to ``@serve.batch`` (defaults to 0). Increasing this
|
|
timeout may improve throughput at the cost of latency under low load.
|
|
|
|
Let's define a deployment that takes in a list of requests, extracts the input value,
|
|
converts them into an array, and uses NumPy to add 1 to each element.
|
|
|
|
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
|
:start-after: __doc_define_servable_begin__
|
|
:end-before: __doc_define_servable_end__
|
|
|
|
Let's deploy it. Note that in the ``@serve.batch`` decorator, we are specifying
|
|
specifying the maximum batch size via ``max_batch_size=4``. This option limits
|
|
the maximum possible batch size that will be executed at once.
|
|
|
|
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
|
:start-after: __doc_deploy_begin__
|
|
:end-before: __doc_deploy_end__
|
|
|
|
Let's define a :ref:`Ray remote task<ray-remote-functions>` to send queries in
|
|
parallel. As you can see, the first batch has a batch size of 1, and the subsequent
|
|
queries have a batch size of 4. Even though each query is issued independently,
|
|
Ray Serve was able to evaluate them in batches.
|
|
|
|
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
|
:start-after: __doc_query_begin__
|
|
:end-before: __doc_query_end__
|
|
|
|
What if you want to evaluate a whole batch in Python? Ray Serve allows you to send
|
|
queries via the Python API. A batch of queries can either come from the web server
|
|
or the Python API. Learn more :ref:`here<serve-handle-explainer>`.
|
|
|
|
To query the deployment via the Python API, we can use ``Deployment.get_handle`` to receive
|
|
a handle to the corresponding deployment. To enqueue a query, you can call
|
|
``handle.method.remote(data)``. This call returns immediately
|
|
with a :ref:`Ray ObjectRef<ray-object-refs>`. You can call `ray.get` to retrieve
|
|
the result.
|
|
|
|
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
|
:start-after: __doc_query_handle_begin__
|
|
:end-before: __doc_query_handle_end__
|