[Doc] Add Architecture Doc for Ray Serve (#10204)

This commit is contained in:
Simon Mo 2020-08-20 11:40:47 -07:00 committed by GitHub
parent a462ae2747
commit 6b93ad11d0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
7 changed files with 105 additions and 1 deletions

View file

@ -41,6 +41,8 @@ You can experiment with this behavior by running the following code.
except ray.exceptions.RayWorkerError:
print('FAILURE')
.. _actor-fault-tolerance:
Actors
------

View file

@ -153,6 +153,7 @@ Academic Papers
serve/tutorials/index.rst
serve/deployment.rst
serve/advanced.rst
serve/architecture.rst
serve/package-ref.rst
.. toctree::

View file

@ -5,6 +5,8 @@ Serialization
Since Ray processes do not share memory space, data transferred between workers and nodes will need to **serialized** and **deserialized**. Ray uses the `Plasma object store <https://arrow.apache.org/docs/python/plasma.html>`_ to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization).
.. _plasma-store:
Plasma Object Store
-------------------

View file

@ -188,6 +188,8 @@ The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or `
handle = serve.get_handle("api_endpoint")
handler.options(shard_key=session_id).remote(args)
.. _serve-shadow-testing:
Shadow Testing
--------------
@ -220,6 +222,8 @@ This is demonstrated in the example below, where we create an endpoint serviced
serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)
.. _serve-model-composition:
Composing Multiple Models
=========================
Ray Serve supports composing individually scalable models into a single model
@ -286,7 +290,7 @@ How do I call an endpoint from Python code?
use the following to get a "handle" to that endpoint.
.. code-block:: python
handle = serve.get_handle("api_endpoint")

View file

@ -0,0 +1,92 @@
Serve Architecture
==================
This document provides an overview of how each component in Serve works.
.. image:: architecture.svg
:align: center
:width: 600px
High Level View
---------------
Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
There are three kinds of actors that are created to make up a Serve instance:
- Controller: A global actor unique to each :ref:`Serve instance <serve-instance>` that manages
the control plane. The Controller is responsible for creating, updating, and
destroying other actors. Serve API calls like :mod:`serve.create_backend <ray.serve.create_backend>`,
:mod:`serve.create_endpoint <ray.serve.create_endpoint>` make remote calls to the Controller.
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
server that accepts incoming requests, forwards them to the worker replicas, and
responds once they are completed.
- Worker Replica: Worker replicas actually execute the code in response to a
request. For example, they may contain an instantiation of an ML model. Each
replica processes individual requests or batches of requests from the routers.
Lifetime of a Request
---------------------
When an HTTP request is sent to the router, the follow things happen:
- The HTTP request is received and parsed.
- The correct :ref:`endpoint <serve-endpoint>` associated with the HTTP url path is looked up.
- One or more :ref:`backends <serve-backend>` is selected to handle the request given the :ref:`traffic
splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
are placed on a queue.
- For each request in a backend queue, an available worker replica is looked up
and the request is sent to it. If there are no available worker replicas (there
are more than ``max_concurrent_queries`` requests outstanding), the request
is left in the queue until an outstanding request is finished.
Each worker maintains a queue of requests and processes one batch of requests at
a time. By default the batch size is 1, you can increase the batch size <ref> to
increase throughput. If the handler (the function for the backend or
``__call__``) is ``async``, worker will not wait for the handler to run;
otherwise, worker will block until the handler returns.
FAQ
---
How does Serve handle fault tolerance?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Application errors like exceptions in your model evaluation code is catched and
wrapped. A 500 status code will be returned with the traceback information. The
worker replica will be able to continue to handle requests.
Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
actors crashes, those actors will be automatically restarted on another
available machine. All data in the Controller (routing policies, backend
configurations, etc) is checkpointed to the Ray. Transient data in the
router and the worker replica (like network connections and internal request
queues) will be lost upon failure.
How does Serve ensure horizontal scalability and availability?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Serve starts one router per node. Each router will bind the same port. You
should be able to reach Serve and send requests to any models via any of the
servers.
This architecture ensures horizontal scalability for Serve. You can scale the
router by adding more nodes and scale the model workers by increasing the number
of replicas.
How do ServeHandles work?
^^^^^^^^^^^^^^^^^^^^^^^^^
:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
request is sent from one via worker replica to another via the handle, the
requests go through the same data path as incoming HTTP requests. This enables
the same backend selection and batching procedures to happen. ServeHandles are
often used to implement :ref:`model composition <serve-model-composition>`.
What happens to large requests?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Serve utilizes Rays :ref:`shared memory object store <plasma-store>` and in process memory
store. Small request objects are directly sent between actors via network
call. Larger request objects (100KiB+) are written to a distributed shared
memory store and the worker can read them via zero-copy read.

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 70 KiB

View file

@ -278,6 +278,8 @@ opt for launching a Ray Cluster locally. Specify a Ray cluster like we did in :r
To learn more, in general, about Ray Clusters see :ref:`cluster-index`.
.. _serve-instance:
Deploying Multiple Serve Instances on a Single Ray Cluster
----------------------------------------------------------