ray/doc/source/serve/architecture.md

(serve-architecture)=

# Serve Architecture

This section should help you:

- Get an overview of how each component in Serve works
- Understand the different types of actors that make up a Serve instance

% Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing

```{image} architecture-2.0.svg
:align: center
:width: 600px
```

## High-Level View

Serve runs on Ray and utilizes [Ray actors](actor-guide).

There are three kinds of actors that are created to make up a Serve instance:

- **Controller**: A global actor unique to each Serve instance that manages
  the control plane. The Controller is responsible for creating, updating, and
  destroying other actors. Serve API calls like creating or getting a deployment
  make remote calls to the Controller.
- **HTTP Proxy**: By default there is one HTTP proxy actor on the head node. This actor runs a [Uvicorn](https://www.uvicorn.org/) HTTP
  server that accepts incoming requests, forwards them to replicas, and
  responds once they are completed.  For scalability and high availability,
  you can also run a proxy on each node in the cluster via the `location` field of [`http_options`](core-apis).
- **Replicas**: Actors that actually execute the code in response to a
  request. For example, they may contain an instantiation of an ML model. Each
  replica processes individual requests from the HTTP proxy (these may be batched
  by the replica using `@serve.batch`, see the [batching](serve-batching) docs).

## Lifetime of a Request

When an HTTP request is sent to the HTTP proxy, the following things happen:

1. The HTTP request is received and parsed.
2. The correct deployment associated with the HTTP URL path is looked up. The
  request is placed on a queue.
3. For each request in a deployment's queue, an available replica is looked up in round-robin fashion
  and the request is sent to it. If there are no available replicas (i.e. there
  are more than `max_concurrent_queries` requests outstanding at each replica), the request
  is left in the queue until a replica becomes available.

Each replica maintains a queue of requests and executes requests one at a time, possibly
using `asyncio` to process them concurrently. If the handler (the deployment function or the `__call__` method of the deployment class) is declared with `async def`, the replica will not wait for the
handler to run.  Otherwise, the replica will block until the handler returns.

When making a request via [ServeHandle](serve-handle-explainer) instead of HTTP, the request is placed on a queue in the ServeHandle, and we skip to step 3 above.

(serve-ft-detail)=

## Fault tolerance

Application errors like exceptions in your model evaluation code are caught and
wrapped. A 500 status code will be returned with the traceback information. The
replica will be able to continue to handle requests.

Machine errors and faults will be handled by Ray Serve as follows:

- When replica actors fail, the Controller actor will replace them with new ones.
- When the HTTP proxy actor fails, the Controller actor will restart it.
- When the Controller actor fails, Ray will restart it.
- When using the [Kuberay RayService](https://ray-project.github.io/kuberay/guidance/rayservice/), KubeRay will recover crashed nodes or a crashed cluster.  Cluster crashes can be avoided using the [GCS HA feature](https://ray-project.github.io/kuberay/guidance/gcs-ha/).
- If not using Kuberay, when the Ray cluster fails, Ray Serve cannot recover.

When a machine hosting any of the actors crashes, those actors will be automatically restarted on another
available machine. All data in the Controller (routing policies, deployment
configurations, etc) is checkpointed to the Ray Global Control Store (GCS) on the head node. Transient data in the
router and the replica (like network connections and internal request
queues) will be lost for this kind of failure.

(serve-autoscaling-architecture)=

## Ray Serve Autoscaling

Ray Serve's autoscaling feature automatically increases or decreases a deployment's number of replicas based on its load.

![pic](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling.svg)

- The Serve Autoscaler runs in the Serve Controller actor.
- Each ServeHandle and each replica periodically pushes its metrics to the autoscaler.
- For each deployment, the autoscaler periodically checks ServeHandle queues and in-flight queries on replicas to decide whether or not to scale the number of replicas.
- Each ServeHandle continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it will send any buffered or new queries to the replica until `max_concurrent_queries` is reached.  Queries are sent to replicas in round-robin fashion, subject to the constraint that no replica is handling more than `max_concurrent_queries` requests at a time.

:::{note}
When the controller dies, requests can still be sent via HTTP and ServeHandles, but autoscaling will be paused. When the controller recovers, the autoscaling will resume, but all previous metrics collected will be lost.
:::

## Ray Serve API Server

Ray Serve provides a [CLI](serve-cli) for managing your Ray Serve instance, as well as a [REST API](serve-rest-api).
Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests.

## FAQ

### How does Serve ensure horizontal scalability and availability?

Serve can be configured to start one HTTP proxy actor per node via the `location` field of [`http_options`](core-apis). Each one will bind the same port. You
should be able to reach Serve and send requests to any models via any of the
servers.  You can use your own load balancer on top of Ray Serve.

This architecture ensures horizontal scalability for Serve. You can scale your HTTP ingress by adding more nodes and scale your model inference by increasing the number
of replicas via the `num_replicas` option of your deployment.

### How do ServeHandles work?

{mod}`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to a "router" on the
same node which routes requests to replicas for a deployment. When a
request is sent from one replica to another via the handle, the
requests go through the same data path as incoming HTTP requests. This enables
the same deployment selection and batching procedures to happen. ServeHandles are
often used to implement [model composition](serve-model-composition).

### What happens to large requests?

Serve utilizes Ray’s [shared memory object store](plasma-store) and in process memory
store. Small request objects are directly sent between actors via network
call. Larger request objects (100KiB+) are written to the object store and the replica can read them via zero-copy read.
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								(serve-architecture)=
 								# Serve Architecture
-												[Serve] Add Perf Tuning Doc (#14334)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
											
										
										
											2021-02-26 10:28:02 -08:00
-												[Doc] [Serve] Add summary sub header to each page (#20231)


											
										
										
											2021-11-12 14:18:42 -08:00
+								This section should help you:
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								- Get an overview of how each component in Serve works
 								- Understand the different types of actors that make up a Serve instance
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								% Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
-												[Serve] Update architecture.svg (#10754)


											
										
										
											2020-09-22 12:43:02 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								```{image} architecture-2.0.svg
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								:align: center
 								:width: 600px
 								```
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								## High-Level View
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								Serve runs on Ray and utilizes [Ray actors](actor-guide).
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
 								There are three kinds of actors that are created to make up a Serve instance:
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								- **Controller**: A global actor unique to each Serve instance that manages
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								  the control plane. The Controller is responsible for creating, updating, and
-												[serve] Update docs for v2 Deployments API (#15582)


											
										
										
											2021-05-03 13:19:34 -05:00
+								  destroying other actors. Serve API calls like creating or getting a deployment
 								  make remote calls to the Controller.
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								- **HTTP Proxy**: By default there is one HTTP proxy actor on the head node. This actor runs a [Uvicorn](https://www.uvicorn.org/) HTTP
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								  server that accepts incoming requests, forwards them to replicas, and
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								  responds once they are completed.  For scalability and high availability,
 								  you can also run a proxy on each node in the cluster via the `location` field of [`http_options`](core-apis).
 								- **Replicas**: Actors that actually execute the code in response to a
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								  request. For example, they may contain an instantiation of an ML model. Each
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								  replica processes individual requests from the HTTP proxy (these may be batched
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								  by the replica using `@serve.batch`, see the [batching](serve-batching) docs).
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								## Lifetime of a Request
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								When an HTTP request is sent to the HTTP proxy, the following things happen:
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+. The HTTP request is received and parsed.
 . The correct deployment associated with the HTTP URL path is looked up. The
-												[serve] Update docs for v2 Deployments API (#15582)


											
										
										
											2021-05-03 13:19:34 -05:00
+								  request is placed on a queue.
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+. For each request in a deployment's queue, an available replica is looked up in round-robin fashion
 								  and the request is sent to it. If there are no available replicas (i.e. there
 								  are more than `max_concurrent_queries` requests outstanding at each replica), the request
 								  is left in the queue until a replica becomes available.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								Each replica maintains a queue of requests and executes requests one at a time, possibly
 								using `asyncio` to process them concurrently. If the handler (the deployment function or the `__call__` method of the deployment class) is declared with `async def`, the replica will not wait for the
 								handler to run.  Otherwise, the replica will block until the handler returns.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								When making a request via [ServeHandle](serve-handle-explainer) instead of HTTP, the request is placed on a queue in the ServeHandle, and we skip to step 3 above.
-												[Serve][Doc] Add Failure Recovery Doc (#19166)


											
										
										
											2021-10-21 13:32:42 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								(serve-ft-detail)=
-												[Serve][Doc] Add Failure Recovery Doc (#19166)


											
										
										
											2021-10-21 13:32:42 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								## Fault tolerance
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[docs] Fix typos in documentation (#11414)


											
										
										
											2020-10-16 01:00:48 +01:00
+								Application errors like exceptions in your model evaluation code are caught and
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								wrapped. A 500 status code will be returned with the traceback information. The
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								replica will be able to continue to handle requests.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								Machine errors and faults will be handled by Ray Serve as follows:
 								- When replica actors fail, the Controller actor will replace them with new ones.
 								- When the HTTP proxy actor fails, the Controller actor will restart it.
 								- When the Controller actor fails, Ray will restart it.
 								- When using the [Kuberay RayService](https://ray-project.github.io/kuberay/guidance/rayservice/), KubeRay will recover crashed nodes or a crashed cluster.  Cluster crashes can be avoided using the [GCS HA feature](https://ray-project.github.io/kuberay/guidance/gcs-ha/).
 								- If not using Kuberay, when the Ray cluster fails, Ray Serve cannot recover.
 								When a machine hosting any of the actors crashes, those actors will be automatically restarted on another
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								available machine. All data in the Controller (routing policies, deployment
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								configurations, etc) is checkpointed to the Ray Global Control Store (GCS) on the head node. Transient data in the
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								router and the replica (like network connections and internal request
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								queues) will be lost for this kind of failure.
 								(serve-autoscaling-architecture)=
 								## Ray Serve Autoscaling
 								Ray Serve's autoscaling feature automatically increases or decreases a deployment's number of replicas based on its load.
 								![pic](https://raw.githubusercontent.com/ray-project/images/master/docs/serve/autoscaling.svg)
 								- The Serve Autoscaler runs in the Serve Controller actor.
 								- Each ServeHandle and each replica periodically pushes its metrics to the autoscaler.
 								- For each deployment, the autoscaler periodically checks ServeHandle queues and in-flight queries on replicas to decide whether or not to scale the number of replicas.
 								- Each ServeHandle continuously polls the controller to check for new deployment replicas. Whenever new replicas are discovered, it will send any buffered or new queries to the replica until `max_concurrent_queries` is reached.  Queries are sent to replicas in round-robin fashion, subject to the constraint that no replica is handling more than `max_concurrent_queries` requests at a time.
 								:::{note}
 								When the controller dies, requests can still be sent via HTTP and ServeHandles, but autoscaling will be paused. When the controller recovers, the autoscaling will resume, but all previous metrics collected will be lost.
 								:::
 								## Ray Serve API Server
 								Ray Serve provides a [CLI](serve-cli) for managing your Ray Serve instance, as well as a [REST API](serve-rest-api).
 								Each node in your Ray cluster provides a Serve REST API server that can connect to Serve and respond to Serve REST requests.
 								## FAQ
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								### How does Serve ensure horizontal scalability and availability?
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								Serve can be configured to start one HTTP proxy actor per node via the `location` field of [`http_options`](core-apis). Each one will bind the same port. You
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								should be able to reach Serve and send requests to any models via any of the
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								servers.  You can use your own load balancer on top of Ray Serve.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								This architecture ensures horizontal scalability for Serve. You can scale your HTTP ingress by adding more nodes and scale your model inference by increasing the number
 								of replicas via the `num_replicas` option of your deployment.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								### How do ServeHandles work?
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								{mod}`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to a "router" on the
 								same node which routes requests to replicas for a deployment. When a
-												[doc] Fix typos in serve documentation (#21379)


											
										
										
											2022-01-04 08:56:07 -08:00
+								request is sent from one replica to another via the handle, the
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								requests go through the same data path as incoming HTTP requests. This enables
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								the same deployment selection and batching procedures to happen. ServeHandles are
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								often used to implement [model composition](serve-model-composition).
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								### What happens to large requests?
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657)


											
										
										
											2022-05-10 14:04:17 -07:00
+								Serve utilizes Ray’s [shared memory object store](plasma-store) and in process memory
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								store. Small request objects are directly sent between actors via network
-												[Doc] Update Serve architecture doc for 2.0 (#26861)

- Move autoscaling architecture from autoscaling page to architecture page
- Update architecture page
    - Remove "Router" actor
    - Update description of ServeHandle
    - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node)
- Add note about fault tolerance in different failure scenarios
- Assorted typos/usage nits

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
											
										
										
											2022-08-03 12:30:33 -07:00
+								call. Larger request objects (100KiB+) are written to the object store and the replica can read them via zero-copy read.