ray/doc/source/serve/architecture.rst

.. _serve-architecture:

Serve Architecture
==================
This section should help you:

- understand an overview of how each component in Serve works
- understand the different types of actors that make up a Serve instance

.. Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing

.. image:: architecture.svg
    :align: center
    :width: 600px

High Level View
---------------

Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.

There are three kinds of actors that are created to make up a Serve instance:

- Controller: A global actor unique to each Serve instance that manages
  the control plane. The Controller is responsible for creating, updating, and
  destroying other actors. Serve API calls like creating or getting a deployment
  make remote calls to the Controller.
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
  server that accepts incoming requests, forwards them to replicas, and
  responds once they are completed.
- Worker Replica: Worker replicas actually execute the code in response to a
  request. For example, they may contain an instantiation of an ML model. Each
  replica processes individual requests from the routers (they may be batched
  by the replica using ``@serve.batch``, see the :ref:`batching<serve-batching>` docs).


Lifetime of a Request
---------------------
When an HTTP request is sent to the router, the follow things happen:

- The HTTP request is received and parsed.
- The correct deployment associated with the HTTP url path is looked up. The
  request is placed on a queue.
- For each request in a deployment queue, an available replica is looked up
  and the request is sent to it. If there are no available replicas (there
  are more than ``max_concurrent_queries`` requests outstanding), the request
  is left in the queue until an outstanding request is finished.

Each replica maintains a queue of requests and executes one at a time, possibly
using asyncio to process them concurrently. If the handler (the function for the
deployment or ``__call__``) is ``async``, the replica will not wait for the
handler to run; otherwise, the replica will block until the handler returns.

FAQ
---

.. _serve-ft-detail:

How does Serve handle fault tolerance?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Application errors like exceptions in your model evaluation code are caught and
wrapped. A 500 status code will be returned with the traceback information. The
replica will be able to continue to handle requests.

Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
actors crashes, those actors will be automatically restarted on another
available machine. All data in the Controller (routing policies, deployment
configurations, etc) is checkpointed to the Ray. Transient data in the
router and the replica (like network connections and internal request
queues) will be lost upon failure.

How does Serve ensure horizontal scalability and availability?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Serve starts one router per node. Each router will bind the same port. You
should be able to reach Serve and send requests to any models via any of the
servers.

This architecture ensures horizontal scalability for Serve. You can scale the
router by adding more nodes and scale the model by increasing the number
of replicas.

How do ServeHandles work?
^^^^^^^^^^^^^^^^^^^^^^^^^

:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
request is sent from one replica to another via the handle, the
requests go through the same data path as incoming HTTP requests. This enables
the same deployment selection and batching procedures to happen. ServeHandles are
often used to implement :ref:`model composition <serve-model-composition>`.


What happens to large requests?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
store. Small request objects are directly sent between actors via network
call. Larger request objects (100KiB+) are written to a distributed shared
memory store and the replica can read them via zero-copy read.
-												[Serve] Add Perf Tuning Doc (#14334)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
											
										
										
											2021-02-26 10:28:02 -08:00
+								.. _serve-architecture:
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								Serve Architecture
 								==================
-												[Doc] [Serve] Add summary sub header to each page (#20231)


											
										
										
											2021-11-12 14:18:42 -08:00
+								This section should help you:
 								- understand an overview of how each component in Serve works
 								- understand the different types of actors that make up a Serve instance
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
-												[Serve] Update architecture.svg (#10754)


											
										
										
											2020-09-22 12:43:02 -07:00
+								.. Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								.. image:: architecture.svg
 								    :align: center
 								    :width: 600px
 								High Level View
 								---------------
 								Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
 								There are three kinds of actors that are created to make up a Serve instance:
-												[serve] Serve client refactor (#10409)


											
										
										
											2020-09-04 12:02:23 -05:00
+								- Controller: A global actor unique to each Serve instance that manages
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								  the control plane. The Controller is responsible for creating, updating, and
-												[serve] Update docs for v2 Deployments API (#15582)


											
										
										
											2021-05-03 13:19:34 -05:00
+								  destroying other actors. Serve API calls like creating or getting a deployment
 								  make remote calls to the Controller.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								  server that accepts incoming requests, forwards them to replicas, and
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								  responds once they are completed.
 								- Worker Replica: Worker replicas actually execute the code in response to a
 								  request. For example, they may contain an instantiation of an ML model. Each
-												[serve] Deprecate system-level batching with warning, update the docs (#14648)


											
										
										
											2021-03-15 13:47:01 -05:00
+								  replica processes individual requests from the routers (they may be batched
 								  by the replica using ``@serve.batch``, see the :ref:`batching<serve-batching>` docs).
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
 								Lifetime of a Request
 								---------------------
 								When an HTTP request is sent to the router, the follow things happen:
 								- The HTTP request is received and parsed.
-												[serve] Update docs for v2 Deployments API (#15582)


											
										
										
											2021-05-03 13:19:34 -05:00
+								- The correct deployment associated with the HTTP url path is looked up. The
 								  request is placed on a queue.
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								- For each request in a deployment queue, an available replica is looked up
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								  and the request is sent to it. If there are no available replicas (there
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								  are more than ``max_concurrent_queries`` requests outstanding), the request
 								  is left in the queue until an outstanding request is finished.
-												[serve] Deprecate system-level batching with warning, update the docs (#14648)


											
										
										
											2021-03-15 13:47:01 -05:00
+								Each replica maintains a queue of requests and executes one at a time, possibly
 								using asyncio to process them concurrently. If the handler (the function for the
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								deployment or ``__call__``) is ``async``, the replica will not wait for the
-												[serve] Deprecate system-level batching with warning, update the docs (#14648)


											
										
										
											2021-03-15 13:47:01 -05:00
+								handler to run; otherwise, the replica will block until the handler returns.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
 								FAQ
 								---
-												[Serve][Doc] Add Failure Recovery Doc (#19166)


											
										
										
											2021-10-21 13:32:42 -07:00
 								.. _serve-ft-detail:
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								How does Serve handle fault tolerance?
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-												[docs] Fix typos in documentation (#11414)


											
										
										
											2020-10-16 01:00:48 +01:00
+								Application errors like exceptions in your model evaluation code are caught and
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								wrapped. A 500 status code will be returned with the traceback information. The
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								replica will be able to continue to handle requests.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
 								Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
 								reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
 								actors crashes, those actors will be automatically restarted on another
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								available machine. All data in the Controller (routing policies, deployment
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								configurations, etc) is checkpointed to the Ray. Transient data in the
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								router and the replica (like network connections and internal request
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								queues) will be lost upon failure.
 								How does Serve ensure horizontal scalability and availability?
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 								Serve starts one router per node. Each router will bind the same port. You
 								should be able to reach Serve and send requests to any models via any of the
 								servers.
 								This architecture ensures horizontal scalability for Serve. You can scale the
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								router by adding more nodes and scale the model by increasing the number
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								of replicas.
 								How do ServeHandles work?
 								^^^^^^^^^^^^^^^^^^^^^^^^^
 								:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
-												[doc] Fix typos in serve documentation (#21379)


											
										
										
											2022-01-04 08:56:07 -08:00
+								request is sent from one replica to another via the handle, the
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								requests go through the same data path as incoming HTTP requests. This enables
-												[serve] Replace "backend" with "deployment" in metrics & logging (#17434)


											
										
										
											2021-08-05 17:37:21 -05:00
+								the same deployment selection and batching procedures to happen. ServeHandles are
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								often used to implement :ref:`model composition <serve-model-composition>`.
 								What happens to large requests?
 								^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 								Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
 								store. Small request objects are directly sent between actors via network
 								call. Larger request objects (100KiB+) are written to a distributed shared
-												[serve] Rename to use replicas, not workers (#11822)


											
										
										
											2020-11-10 11:36:15 -08:00
+								memory store and the replica can read them via zero-copy read.