Each replica maintains a queue of requests and executes requests one at a time, possibly
using `asyncio` to process them concurrently. If the handler (the deployment function or the `__call__` method of the deployment class) is declared with `async def`, the replica will not wait for the
handler to run. Otherwise, the replica will block until the handler returns.
When making a request via [ServeHandle](serve-handle-explainer) instead of HTTP, the request is placed on a queue in the ServeHandle, and we skip to step 3 above.
Machine errors and faults will be handled by Ray Serve as follows:
- When replica actors fail, the Controller actor will replace them with new ones.
- When the HTTP proxy actor fails, the Controller actor will restart it.
- When the Controller actor fails, Ray will restart it.
- When using the [Kuberay RayService](https://ray-project.github.io/kuberay/guidance/rayservice/), KubeRay will recover crashed nodes or a crashed cluster. Cluster crashes can be avoided using the [GCS HA feature](https://ray-project.github.io/kuberay/guidance/gcs-ha/).
- If not using Kuberay, when the Ray cluster fails, Ray Serve cannot recover.
When a machine hosting any of the actors crashes, those actors will be automatically restarted on another
- The Serve Autoscaler runs in the Serve Controller actor.
- Each ServeHandle and each replica periodically pushes its metrics to the autoscaler.
- For each deployment, the autoscaler periodically checks ServeHandle queues and in-flight queries on replicas to decide whether or not to scale the number of replicas.
- Each ServeHandle continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it will send any buffered or new queries to the replica until `max_concurrent_queries` is reached. Queries are sent to replicas in round-robin fashion, subject to the constraint that no replica is handling more than `max_concurrent_queries` requests at a time.
:::{note}
When the controller dies, requests can still be sent via HTTP and ServeHandles, but autoscaling will be paused. When the controller recovers, the autoscaling will resume, but all previous metrics collected will be lost.
:::
## Ray Serve API Server
Ray Serve provides a [CLI](serve-cli) for managing your Ray Serve instance, as well as a [REST API](serve-rest-api).
Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests.
Serve can be configured to start one HTTP proxy actor per node via the `location` field of [`http_options`](core-apis). Each one will bind the same port. You
This architecture ensures horizontal scalability for Serve. You can scale your HTTP ingress by adding more nodes and scale your model inference by increasing the number
of replicas via the `num_replicas` option of your deployment.