2020-08-20 11:40:47 -07:00
|
|
|
|
Serve Architecture
|
|
|
|
|
==================
|
|
|
|
|
This document provides an overview of how each component in Serve works.
|
|
|
|
|
|
2020-09-22 12:43:02 -07:00
|
|
|
|
.. Figure source: https://docs.google.com/drawings/d/1jSuBN5dkSj2s9-0eGzlU_ldsRa3TsswQUZM-cMQ29a0/edit?usp=sharing
|
|
|
|
|
|
2020-08-20 11:40:47 -07:00
|
|
|
|
.. image:: architecture.svg
|
|
|
|
|
:align: center
|
|
|
|
|
:width: 600px
|
|
|
|
|
|
|
|
|
|
High Level View
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
|
|
|
|
|
|
|
|
|
|
There are three kinds of actors that are created to make up a Serve instance:
|
|
|
|
|
|
2020-09-04 12:02:23 -05:00
|
|
|
|
- Controller: A global actor unique to each Serve instance that manages
|
2020-08-20 11:40:47 -07:00
|
|
|
|
the control plane. The Controller is responsible for creating, updating, and
|
2020-09-04 12:02:23 -05:00
|
|
|
|
destroying other actors. Serve API calls like :mod:`client.create_backend <ray.serve.api.Client.create_backend>`,
|
|
|
|
|
:mod:`client.create_endpoint <ray.serve.api.Client.create_endpoint>` make remote calls to the Controller.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
|
2020-11-10 11:36:15 -08:00
|
|
|
|
server that accepts incoming requests, forwards them to replicas, and
|
2020-08-20 11:40:47 -07:00
|
|
|
|
responds once they are completed.
|
|
|
|
|
- Worker Replica: Worker replicas actually execute the code in response to a
|
|
|
|
|
request. For example, they may contain an instantiation of an ML model. Each
|
|
|
|
|
replica processes individual requests or batches of requests from the routers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Lifetime of a Request
|
|
|
|
|
---------------------
|
|
|
|
|
When an HTTP request is sent to the router, the follow things happen:
|
|
|
|
|
|
|
|
|
|
- The HTTP request is received and parsed.
|
2021-02-16 14:07:56 -06:00
|
|
|
|
- The correct endpoint associated with the HTTP url path is looked up.
|
|
|
|
|
- One or more backends is selected to handle the request given the :ref:`traffic
|
2020-08-20 11:40:47 -07:00
|
|
|
|
splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
|
|
|
|
|
are placed on a queue.
|
2020-11-10 11:36:15 -08:00
|
|
|
|
- For each request in a backend queue, an available replica is looked up
|
|
|
|
|
and the request is sent to it. If there are no available replicas (there
|
2020-08-20 11:40:47 -07:00
|
|
|
|
are more than ``max_concurrent_queries`` requests outstanding), the request
|
|
|
|
|
is left in the queue until an outstanding request is finished.
|
|
|
|
|
|
2020-11-10 11:36:15 -08:00
|
|
|
|
Each replica maintains a queue of requests and processes one batch of requests at
|
2020-08-20 11:40:47 -07:00
|
|
|
|
a time. By default the batch size is 1, you can increase the batch size <ref> to
|
|
|
|
|
increase throughput. If the handler (the function for the backend or
|
2020-11-10 11:36:15 -08:00
|
|
|
|
``__call__``) is ``async``, the replica will not wait for the handler to run;
|
|
|
|
|
otherwise, the replica will block until the handler returns.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
|
|
|
|
FAQ
|
|
|
|
|
---
|
|
|
|
|
How does Serve handle fault tolerance?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
2020-10-16 01:00:48 +01:00
|
|
|
|
Application errors like exceptions in your model evaluation code are caught and
|
2020-08-20 11:40:47 -07:00
|
|
|
|
wrapped. A 500 status code will be returned with the traceback information. The
|
2020-11-10 11:36:15 -08:00
|
|
|
|
replica will be able to continue to handle requests.
|
2020-08-20 11:40:47 -07:00
|
|
|
|
|
|
|
|
|
Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
|
|
|
|
|
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
|
|
|
|
|
actors crashes, those actors will be automatically restarted on another
|
|
|
|
|
available machine. All data in the Controller (routing policies, backend
|
|
|
|
|
configurations, etc) is checkpointed to the Ray. Transient data in the
|
2020-11-10 11:36:15 -08:00
|
|
|
|
router and the replica (like network connections and internal request
|
2020-08-20 11:40:47 -07:00
|
|
|
|
queues) will be lost upon failure.
|
|
|
|
|
|
|
|
|
|
How does Serve ensure horizontal scalability and availability?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Serve starts one router per node. Each router will bind the same port. You
|
|
|
|
|
should be able to reach Serve and send requests to any models via any of the
|
|
|
|
|
servers.
|
|
|
|
|
|
|
|
|
|
This architecture ensures horizontal scalability for Serve. You can scale the
|
2020-11-10 11:36:15 -08:00
|
|
|
|
router by adding more nodes and scale the model by increasing the number
|
2020-08-20 11:40:47 -07:00
|
|
|
|
of replicas.
|
|
|
|
|
|
|
|
|
|
How do ServeHandles work?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
|
2020-11-10 11:36:15 -08:00
|
|
|
|
request is sent from one via replica to another via the handle, the
|
2020-08-20 11:40:47 -07:00
|
|
|
|
requests go through the same data path as incoming HTTP requests. This enables
|
|
|
|
|
the same backend selection and batching procedures to happen. ServeHandles are
|
|
|
|
|
often used to implement :ref:`model composition <serve-model-composition>`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
What happens to large requests?
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
|
|
|
|
|
store. Small request objects are directly sent between actors via network
|
|
|
|
|
call. Larger request objects (100KiB+) are written to a distributed shared
|
2020-11-10 11:36:15 -08:00
|
|
|
|
memory store and the replica can read them via zero-copy read.
|