2021-02-26 10:28:02 -08:00
|
|
|
|
.. _serve-architecture:
|
|
|
|
|
|
2020-08-20 11:40:47 -07:00
|
|
|
|
Serve Architecture
|
|
|
|
|
==================
|
2021-11-12 14:18:42 -08:00
|
|
|
|
This section should help you:
|
|
|
|
|
|
|
|
|
|
- understand an overview of how each component in Serve works
|
|
|
|
|
- understand the different types of actors that make up a Serve instance
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
2020-09-22 12:43:02 -07:00
|
|
|
|
.. Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
|
|
|
|
|
|
2020-08-20 11:40:47 -07:00
|
|
|
|
.. image:: architecture.svg
|
|
|
|
|
:align: center
|
|
|
|
|
:width: 600px
|
|
|
|
|
|
|
|
|
|
High Level View
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
|
|
|
|
|
|
|
|
|
|
There are three kinds of actors that are created to make up a Serve instance:
|
|
|
|
|
|
2020-09-04 12:02:23 -05:00
|
|
|
|
- Controller: A global actor unique to each Serve instance that manages
|
2020-08-20 11:40:47 -07:00
|
|
|
|
the control plane. The Controller is responsible for creating, updating, and
|
2021-05-03 13:19:34 -05:00
|
|
|
|
destroying other actors. Serve API calls like creating or getting a deployment
|
|
|
|
|
make remote calls to the Controller.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
|
2020-11-10 11:36:15 -08:00
|
|
|
|
server that accepts incoming requests, forwards them to replicas, and
|
2020-08-20 11:40:47 -07:00
|
|
|
|
responds once they are completed.
|
|
|
|
|
- Worker Replica: Worker replicas actually execute the code in response to a
|
|
|
|
|
request. For example, they may contain an instantiation of an ML model. Each
|
2021-03-15 13:47:01 -05:00
|
|
|
|
replica processes individual requests from the routers (they may be batched
|
|
|
|
|
by the replica using ``@serve.batch``, see the :ref:`batching<serve-batching>` docs).
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lifetime of a Request
|
|
|
|
|
---------------------
|
|
|
|
|
When an HTTP request is sent to the router, the follow things happen:
|
|
|
|
|
|
|
|
|
|
- The HTTP request is received and parsed.
|
2021-05-03 13:19:34 -05:00
|
|
|
|
- The correct deployment associated with the HTTP url path is looked up. The
|
|
|
|
|
request is placed on a queue.
|
2021-08-05 17:37:21 -05:00
|
|
|
|
- For each request in a deployment queue, an available replica is looked up
|
2020-11-10 11:36:15 -08:00
|
|
|
|
and the request is sent to it. If there are no available replicas (there
|
2020-08-20 11:40:47 -07:00
|
|
|
|
are more than ``max_concurrent_queries`` requests outstanding), the request
|
|
|
|
|
is left in the queue until an outstanding request is finished.
|
|
|
|
|
|
2021-03-15 13:47:01 -05:00
|
|
|
|
Each replica maintains a queue of requests and executes one at a time, possibly
|
|
|
|
|
using asyncio to process them concurrently. If the handler (the function for the
|
2021-08-05 17:37:21 -05:00
|
|
|
|
deployment or ``__call__``) is ``async``, the replica will not wait for the
|
2021-03-15 13:47:01 -05:00
|
|
|
|
handler to run; otherwise, the replica will block until the handler returns.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
|
|
|
|
FAQ
|
|
|
|
|
---
|
2021-10-21 13:32:42 -07:00
|
|
|
|
|
|
|
|
|
.. _serve-ft-detail:
|
|
|
|
|
|
2020-08-20 11:40:47 -07:00
|
|
|
|
How does Serve handle fault tolerance?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
2020-10-16 01:00:48 +01:00
|
|
|
|
Application errors like exceptions in your model evaluation code are caught and
|
2020-08-20 11:40:47 -07:00
|
|
|
|
wrapped. A 500 status code will be returned with the traceback information. The
|
2020-11-10 11:36:15 -08:00
|
|
|
|
replica will be able to continue to handle requests.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
|
|
|
|
Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
|
|
|
|
|
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
|
|
|
|
|
actors crashes, those actors will be automatically restarted on another
|
2021-08-05 17:37:21 -05:00
|
|
|
|
available machine. All data in the Controller (routing policies, deployment
|
2020-08-20 11:40:47 -07:00
|
|
|
|
configurations, etc) is checkpointed to the Ray. Transient data in the
|
2020-11-10 11:36:15 -08:00
|
|
|
|
router and the replica (like network connections and internal request
|
2020-08-20 11:40:47 -07:00
|
|
|
|
queues) will be lost upon failure.
|
|
|
|
|
|
|
|
|
|
How does Serve ensure horizontal scalability and availability?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Serve starts one router per node. Each router will bind the same port. You
|
|
|
|
|
should be able to reach Serve and send requests to any models via any of the
|
|
|
|
|
servers.
|
|
|
|
|
|
|
|
|
|
This architecture ensures horizontal scalability for Serve. You can scale the
|
2020-11-10 11:36:15 -08:00
|
|
|
|
router by adding more nodes and scale the model by increasing the number
|
2020-08-20 11:40:47 -07:00
|
|
|
|
of replicas.
|
|
|
|
|
|
|
|
|
|
How do ServeHandles work?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
|
2022-01-04 08:56:07 -08:00
|
|
|
|
request is sent from one replica to another via the handle, the
|
2020-08-20 11:40:47 -07:00
|
|
|
|
requests go through the same data path as incoming HTTP requests. This enables
|
2021-08-05 17:37:21 -05:00
|
|
|
|
the same deployment selection and batching procedures to happen. ServeHandles are
|
2020-08-20 11:40:47 -07:00
|
|
|
|
often used to implement :ref:`model composition <serve-model-composition>`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
What happens to large requests?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
|
|
|
|
|
store. Small request objects are directly sent between actors via network
|
|
|
|
|
call. Larger request objects (100KiB+) are written to a distributed shared
|
2020-11-10 11:36:15 -08:00
|
|
|
|
memory store and the replica can read them via zero-copy read.
|