2022-05-10 14:04:17 -07:00
# Performance Tuning
2021-02-26 10:28:02 -08:00
This section should help you:
- understand the performance characteristics of Ray Serve
- find ways to debug and tune the performance of your Serve deployment
2022-05-10 14:04:17 -07:00
:::{note}
While this section offers some tips and tricks to improve the performance of your Serve deployment,
2022-08-09 09:36:21 -07:00
the [architecture doc ](serve-architecture ) is helpful for context, including an overview of the HTTP proxy actor and replica actors.
2022-05-10 14:04:17 -07:00
:::
2021-02-26 10:28:02 -08:00
2022-05-10 14:04:17 -07:00
```{contents}
```
## Performance and known benchmarks
2021-02-26 10:28:02 -08:00
We are continuously benchmarking Ray Serve. The metrics we care about are latency, throughput, and scalability. We can confidently say:
- Ray Serve’ s latency overhead is single digit milliseconds, around 1-2 milliseconds on average.
2022-08-09 09:36:21 -07:00
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 HTTP proxy actor and 8 replicas performing noop requests.
2022-05-10 14:04:17 -07:00
- It is horizontally scalable so you can add more machines to increase the overall throughput. Ray Serve is built on top of Ray,
so its scalability is bounded by Ray’ s scalability. Please check out Ray’ s [scalability envelope ](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md )
2021-02-26 10:28:02 -08:00
to learn more about the maximum number of nodes and other limitations.
2022-08-09 09:36:21 -07:00
You can check out our [microbenchmark instructions ](https://github.com/ray-project/ray/blob/master/python/ray/serve/benchmarks/README.md )
2021-02-26 10:28:02 -08:00
to benchmark on your hardware.
2022-05-10 14:04:17 -07:00
## Debugging performance issues
2021-02-26 10:28:02 -08:00
The performance issue you're most likely to encounter is high latency and/or low throughput for requests.
2022-08-09 09:36:21 -07:00
If you have set up [monitoring ](serve-monitoring ) with Ray and Ray Serve, you will likely observe the following:
2021-02-26 10:28:02 -08:00
2022-08-09 09:36:21 -07:00
- `serve_num_router_requests` is constant while your load increases
- `serve_deployment_queuing_latency_ms` is spiking up as queries queue up in the background
Given these symptoms, there are several ways to fix the issue.
2021-02-26 10:28:02 -08:00
2022-05-10 14:04:17 -07:00
### Choosing the right hardware
Make sure you are using the right hardware and resources.
2022-08-09 09:36:21 -07:00
Are you using GPUs (`ray_actor_options={“num_gpus”: 1}` )? Are you using one or more cores (`ray_actor_options={“num_cpus”: 2}` ) and setting [`OMP_NUM_THREADS` ](serve-omp-num-threads ) to increase the performance of your deep learning framework?
2021-02-26 10:28:02 -08:00
2022-08-09 09:36:21 -07:00
### `async` methods
2022-05-10 14:04:17 -07:00
2022-08-09 09:36:21 -07:00
Are you using `async def` in your callable? If you are using `asyncio` and
2022-05-10 14:04:17 -07:00
hitting the same queuing issue mentioned above, you might want to increase
2022-08-09 09:36:21 -07:00
`max_concurrent_queries` . Serve sets a low number (100) by default so the client gets
proper backpressure. You can increase the value in the Deployment decorator; e.g.
`@serve.deployment(max_concurrent_queries=1000)` .
2021-03-15 13:47:01 -05:00
2022-05-10 14:04:17 -07:00
### Batching
2021-02-26 10:28:02 -08:00
2022-05-10 14:04:17 -07:00
If your deployment can process a batch at a time at a sublinear latency
(for example, if it takes 1ms to process 1 query and 5ms to process 10 of them)
then batching is your best approach. Check out the [batching guide ](serve-batching ) to
2022-08-09 09:36:21 -07:00
make your deployment accept batches (especially for GPU-based ML inference). You might want to tune `max_batch_size` and `batch_wait_timeout` in the `@serve.batch` decorator to maximize the benefits:
2022-05-10 14:04:17 -07:00
- `max_batch_size` specifies how big the batch should be. Generally,
we recommend choosing the largest batch size your function can handle
AND the performance improvement is no longer sublinear. Take a dummy
2021-02-26 10:28:02 -08:00
example: suppose it takes 1ms to process 1 query, 5ms to process 10 queries,
2022-05-10 14:04:17 -07:00
and 6ms to process 11 queries. Here you should set the batch size to to 10
2021-02-26 10:28:02 -08:00
because adding more queries won’ t improve the performance.
2022-08-09 09:36:21 -07:00
- `batch_wait_timeout` specifies the maximum amount of time to wait before
2022-05-10 14:04:17 -07:00
a batch should be processed, even if it’ s not full. It should be set according
2022-08-09 09:36:21 -07:00
to the equation `batch_wait_timeout + full batch processing time ~= expected latency` .
The larger `batch_wait_timeout` is, the more full the typical batch will be.
To maximize throughput, you should set `batch_wait_timeout` as large as possible without exceeding your desired expected latency in the equation above.