2022-05-10 14:04:17 -07:00
# Performance Tuning
2021-02-26 10:28:02 -08:00
This section should help you:
- understand the performance characteristics of Ray Serve
- find ways to debug and tune the performance of your Serve deployment
2022-05-10 14:04:17 -07:00
:::{note}
While this section offers some tips and tricks to improve the performance of your Serve deployment,
the [architecture doc ](serve-architecture ) is helpful to gain a deeper understanding of these contexts and parameters.
:::
2021-02-26 10:28:02 -08:00
2022-05-10 14:04:17 -07:00
```{contents}
```
## Performance and known benchmarks
2021-02-26 10:28:02 -08:00
We are continuously benchmarking Ray Serve. The metrics we care about are latency, throughput, and scalability. We can confidently say:
- Ray Serve’ s latency overhead is single digit milliseconds, around 1-2 milliseconds on average.
2021-08-05 17:37:21 -05:00
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 replicas performing noop requests.
2022-05-10 14:04:17 -07:00
- It is horizontally scalable so you can add more machines to increase the overall throughput. Ray Serve is built on top of Ray,
so its scalability is bounded by Ray’ s scalability. Please check out Ray’ s [scalability envelope ](https://github.com/ray-project/ray/blob/master/release/benchmarks/README.md )
2021-02-26 10:28:02 -08:00
to learn more about the maximum number of nodes and other limitations.
2022-05-10 14:04:17 -07:00
You can check out our [microbenchmark instruction ](https://github.com/ray-project/ray/blob/master/python/ray/serve/benchmarks/README.md )
2021-02-26 10:28:02 -08:00
to benchmark on your hardware.
2022-05-10 14:04:17 -07:00
## Debugging performance issues
2021-02-26 10:28:02 -08:00
The performance issue you're most likely to encounter is high latency and/or low throughput for requests.
2022-05-10 14:04:17 -07:00
If you have set up [monitoring ](serve-monitoring ) with Ray and Ray Serve, you will likely observe that
`serve_num_router_requests` is constant while your load increases
`serve_deployment_queuing_latency_ms` is spiking up as queries queue up in the background
2021-02-26 10:28:02 -08:00
Given the symptom, there are several ways to fix it.
2022-05-10 14:04:17 -07:00
### Choosing the right hardware
Make sure you are using the right hardware and resources.
Are you using GPUs (`actor_init_options={“num_gpus”: 1}` ) or 1+ cores (`actor_init_options={“num_cpus”: 2}` , and setting `OMP_NUM_THREADS` )
2021-02-26 10:28:02 -08:00
to increase the performance of your deep learning framework?
2022-05-10 14:04:17 -07:00
### Async functions
Are you using `async def` in your callable? If you are using asyncio and
hitting the same queuing issue mentioned above, you might want to increase
`max_concurrent_queries` . Serve sets a low number by default so the client gets
2021-08-05 17:37:21 -05:00
proper backpressure. You can increase the value in the Deployment decorator.
2021-03-15 13:47:01 -05:00
2022-05-10 14:04:17 -07:00
### Batching
2021-02-26 10:28:02 -08:00
2022-05-10 14:04:17 -07:00
If your deployment can process a batch at a time at a sublinear latency
(for example, if it takes 1ms to process 1 query and 5ms to process 10 of them)
then batching is your best approach. Check out the [batching guide ](serve-batching ) to
make your deployment accept batches (especially for GPU-based ML inference). You might want to tune your `max_batch_size` and `batch_wait_timeout` in the `@serve.batch` decorator to maximize the benefits:
- `max_batch_size` specifies how big the batch should be. Generally,
we recommend choosing the largest batch size your function can handle
AND the performance improvement is no longer sublinear. Take a dummy
2021-02-26 10:28:02 -08:00
example: suppose it takes 1ms to process 1 query, 5ms to process 10 queries,
2022-05-10 14:04:17 -07:00
and 6ms to process 11 queries. Here you should set the batch size to to 10
2021-02-26 10:28:02 -08:00
because adding more queries won’ t improve the performance.
2022-05-10 14:04:17 -07:00
- `batch_wait_timeout` specifies how the maximum amount of time to wait before
a batch should be processed, even if it’ s not full. It should be set according
to `batch-wait-timeout + full batch processing time ~= expected latency` . The idea
here is to have the first query wait for the longest possible time to achieve high throughput.
This means you should set `batch_wait_timeout` as large as possible without exceeding your desired expected latency in the equation above.
### Scaling HTTP servers
2021-02-26 10:28:02 -08:00
Sometimes it’ s not about your code: Serve’ s HTTP server can become the bottleneck.
If you observe that the CPU utilization for HTTPProxy actor spike up to 100%, the HTTP server is the bottleneck.
2022-05-10 14:04:17 -07:00
Serve only starts a single HTTP server on the Ray head node by default.
This single HTTP server can handle about 3k queries per second.
2021-02-26 10:28:02 -08:00
If your workload exceeds this number, you might want to consider starting one
2022-05-10 14:04:17 -07:00
HTTP server per Ray node to spread the load by `serve.start(http_options={“location”: “EveryNode”})` .
This configuration tells Serve to spawn one HTTP server per node.
2021-02-26 10:28:02 -08:00
You should put an external load balancer in front of it.