- understand the performance characteristics of Ray Serve
- find ways to debug and tune the performance of your Serve deployment
..note::
While this section offers some tips and tricks to improve the performance of your Serve deployment,
the :ref:`architecture doc <serve-architecture>` is helpful to gain a deeper understanding of these contexts and parameters.
..contents::
Performance and known benchmarks
--------------------------------
We are continuously benchmarking Ray Serve. The metrics we care about are latency, throughput, and scalability. We can confidently say:
- Ray Serve’s latency overhead is single digit milliseconds, around 1-2 milliseconds on average.
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 backend replicas performing noop requests.
- It is horizontally scalable so you can add more machines to increase the overall throughput. Ray Serve is built on top of Ray,
so its scalability is bounded by Ray’s scalability. Please check out Ray’s `scalability envelope <https://github.com/ray-project/ray/blob/master/benchmarks/README.md>`_
to learn more about the maximum number of nodes and other limitations.
You can check out our `microbenchmark instruction <https://github.com/ray-project/ray/blob/master/python/ray/serve/benchmarks/README.md>`_
to benchmark on your hardware.
Debugging performance issues
----------------------------
The performance issue you're most likely to encounter is high latency and/or low throughput for requests.
If you have set up :ref:`monitoring <serve-monitoring>` with Ray and Ray Serve, you will likely observe that
``serve_num_router_requests`` is constant while your load increases
``serve_backend_queuing_latency_ms`` is spiking up as queries queue up in the background
Given the symptom, there are several ways to fix it.
Choosing the right hardware
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Make sure you are using the right hardware and resources.
Are you using GPUs (``actor_init_options={“num_gpus”: 1}``) or 1+ cores (``actor_init_options={“num_cpus”: 2}``, and setting ``OMP_NUM_THREADS``)
to increase the performance of your deep learning framework?
then batching is your best approach. Check out the :ref:`batching guide <serve-batching>` to
make your backend accept batches (especially for GPU-based ML inference). You might want to tune your ``max_batch_size`` and ``batch_wait_timeout`` in the ``@serve.batch`` decorator to maximize the benefits: