mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
[Doc] Add Architecture Doc for Ray Serve (#10204)
This commit is contained in:
parent
a462ae2747
commit
6b93ad11d0
7 changed files with 105 additions and 1 deletions
|
@ -41,6 +41,8 @@ You can experiment with this behavior by running the following code.
|
|||
except ray.exceptions.RayWorkerError:
|
||||
print('FAILURE')
|
||||
|
||||
.. _actor-fault-tolerance:
|
||||
|
||||
Actors
|
||||
------
|
||||
|
||||
|
|
|
@ -153,6 +153,7 @@ Academic Papers
|
|||
serve/tutorials/index.rst
|
||||
serve/deployment.rst
|
||||
serve/advanced.rst
|
||||
serve/architecture.rst
|
||||
serve/package-ref.rst
|
||||
|
||||
.. toctree::
|
||||
|
|
|
@ -5,6 +5,8 @@ Serialization
|
|||
|
||||
Since Ray processes do not share memory space, data transferred between workers and nodes will need to **serialized** and **deserialized**. Ray uses the `Plasma object store <https://arrow.apache.org/docs/python/plasma.html>`_ to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization).
|
||||
|
||||
.. _plasma-store:
|
||||
|
||||
Plasma Object Store
|
||||
-------------------
|
||||
|
||||
|
|
|
@ -188,6 +188,8 @@ The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or `
|
|||
handle = serve.get_handle("api_endpoint")
|
||||
handler.options(shard_key=session_id).remote(args)
|
||||
|
||||
.. _serve-shadow-testing:
|
||||
|
||||
Shadow Testing
|
||||
--------------
|
||||
|
||||
|
@ -220,6 +222,8 @@ This is demonstrated in the example below, where we create an endpoint serviced
|
|||
serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
|
||||
serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)
|
||||
|
||||
.. _serve-model-composition:
|
||||
|
||||
Composing Multiple Models
|
||||
=========================
|
||||
Ray Serve supports composing individually scalable models into a single model
|
||||
|
@ -286,7 +290,7 @@ How do I call an endpoint from Python code?
|
|||
use the following to get a "handle" to that endpoint.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
||||
handle = serve.get_handle("api_endpoint")
|
||||
|
||||
|
||||
|
|
92
doc/source/serve/architecture.rst
Normal file
92
doc/source/serve/architecture.rst
Normal file
|
@ -0,0 +1,92 @@
|
|||
Serve Architecture
|
||||
==================
|
||||
This document provides an overview of how each component in Serve works.
|
||||
|
||||
.. image:: architecture.svg
|
||||
:align: center
|
||||
:width: 600px
|
||||
|
||||
High Level View
|
||||
---------------
|
||||
|
||||
Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
|
||||
|
||||
There are three kinds of actors that are created to make up a Serve instance:
|
||||
|
||||
- Controller: A global actor unique to each :ref:`Serve instance <serve-instance>` that manages
|
||||
the control plane. The Controller is responsible for creating, updating, and
|
||||
destroying other actors. Serve API calls like :mod:`serve.create_backend <ray.serve.create_backend>`,
|
||||
:mod:`serve.create_endpoint <ray.serve.create_endpoint>` make remote calls to the Controller.
|
||||
- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
|
||||
server that accepts incoming requests, forwards them to the worker replicas, and
|
||||
responds once they are completed.
|
||||
- Worker Replica: Worker replicas actually execute the code in response to a
|
||||
request. For example, they may contain an instantiation of an ML model. Each
|
||||
replica processes individual requests or batches of requests from the routers.
|
||||
|
||||
|
||||
Lifetime of a Request
|
||||
---------------------
|
||||
When an HTTP request is sent to the router, the follow things happen:
|
||||
|
||||
- The HTTP request is received and parsed.
|
||||
- The correct :ref:`endpoint <serve-endpoint>` associated with the HTTP url path is looked up.
|
||||
- One or more :ref:`backends <serve-backend>` is selected to handle the request given the :ref:`traffic
|
||||
splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
|
||||
are placed on a queue.
|
||||
- For each request in a backend queue, an available worker replica is looked up
|
||||
and the request is sent to it. If there are no available worker replicas (there
|
||||
are more than ``max_concurrent_queries`` requests outstanding), the request
|
||||
is left in the queue until an outstanding request is finished.
|
||||
|
||||
Each worker maintains a queue of requests and processes one batch of requests at
|
||||
a time. By default the batch size is 1, you can increase the batch size <ref> to
|
||||
increase throughput. If the handler (the function for the backend or
|
||||
``__call__``) is ``async``, worker will not wait for the handler to run;
|
||||
otherwise, worker will block until the handler returns.
|
||||
|
||||
FAQ
|
||||
---
|
||||
How does Serve handle fault tolerance?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Application errors like exceptions in your model evaluation code is catched and
|
||||
wrapped. A 500 status code will be returned with the traceback information. The
|
||||
worker replica will be able to continue to handle requests.
|
||||
|
||||
Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
|
||||
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
|
||||
actors crashes, those actors will be automatically restarted on another
|
||||
available machine. All data in the Controller (routing policies, backend
|
||||
configurations, etc) is checkpointed to the Ray. Transient data in the
|
||||
router and the worker replica (like network connections and internal request
|
||||
queues) will be lost upon failure.
|
||||
|
||||
How does Serve ensure horizontal scalability and availability?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Serve starts one router per node. Each router will bind the same port. You
|
||||
should be able to reach Serve and send requests to any models via any of the
|
||||
servers.
|
||||
|
||||
This architecture ensures horizontal scalability for Serve. You can scale the
|
||||
router by adding more nodes and scale the model workers by increasing the number
|
||||
of replicas.
|
||||
|
||||
How do ServeHandles work?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
|
||||
request is sent from one via worker replica to another via the handle, the
|
||||
requests go through the same data path as incoming HTTP requests. This enables
|
||||
the same backend selection and batching procedures to happen. ServeHandles are
|
||||
often used to implement :ref:`model composition <serve-model-composition>`.
|
||||
|
||||
|
||||
What happens to large requests?
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
|
||||
store. Small request objects are directly sent between actors via network
|
||||
call. Larger request objects (100KiB+) are written to a distributed shared
|
||||
memory store and the worker can read them via zero-copy read.
|
1
doc/source/serve/architecture.svg
Normal file
1
doc/source/serve/architecture.svg
Normal file
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 70 KiB |
|
@ -278,6 +278,8 @@ opt for launching a Ray Cluster locally. Specify a Ray cluster like we did in :r
|
|||
To learn more, in general, about Ray Clusters see :ref:`cluster-index`.
|
||||
|
||||
|
||||
.. _serve-instance:
|
||||
|
||||
Deploying Multiple Serve Instances on a Single Ray Cluster
|
||||
----------------------------------------------------------
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue