[Doc] Add Architecture Doc for Ray Serve (#10204)

2025-03-06 10:31:39 -05:00 · 2020-08-20 11:40:47 -07:00 · 2020-08-20 11:40:47 -07:00 · 6b93ad11d0
commit 6b93ad11d0
parent a462ae2747
7 changed files with 105 additions and 1 deletions
--- a/doc/source/fault-tolerance.rst
+++ b/doc/source/fault-tolerance.rst
@ -41,6 +41,8 @@ You can experiment with this behavior by running the following code.
        except ray.exceptions.RayWorkerError:
            print('FAILURE')

+.. _actor-fault-tolerance:
+
 Actors
 ------

--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -153,6 +153,7 @@ Academic Papers
   serve/tutorials/index.rst
   serve/deployment.rst
   serve/advanced.rst
+   serve/architecture.rst
   serve/package-ref.rst

 .. toctree::
--- a/doc/source/serialization.rst
+++ b/doc/source/serialization.rst
@ -5,6 +5,8 @@ Serialization

 Since Ray processes do not share memory space, data transferred between workers and nodes will need to **serialized** and **deserialized**. Ray uses the `Plasma object store <https://arrow.apache.org/docs/python/plasma.html>`_ to efficiently transfer objects across different processes and different nodes. Numpy arrays in the object store are shared between workers on the same node (zero-copy deserialization).

+.. _plasma-store:
+
 Plasma Object Store
 -------------------

--- a/doc/source/serve/advanced.rst
+++ b/doc/source/serve/advanced.rst
@ -188,6 +188,8 @@ The shard key can either be specified via the X-SERVE-SHARD-KEY HTTP header or `
  handle = serve.get_handle("api_endpoint")
  handler.options(shard_key=session_id).remote(args)

+.. _serve-shadow-testing:
+
 Shadow Testing
 --------------

@ -220,6 +222,8 @@ This is demonstrated in the example below, where we create an endpoint serviced
  serve.shadow_traffic("shadowed_endpoint", "new_backend_1", 0)
  serve.shadow_traffic("shadowed_endpoint", "new_backend_2", 0)

+.. _serve-model-composition:
+
 Composing Multiple Models
 =========================
 Ray Serve supports composing individually scalable models into a single model
@ -286,7 +290,7 @@ How do I call an endpoint from Python code?
 use the following  to get a "handle" to that endpoint.

 .. code-block:: python
-    
+
    handle = serve.get_handle("api_endpoint")


--- a/doc/source/serve/architecture.rst
+++ b/doc/source/serve/architecture.rst
@ -0,0 +1,92 @@
+Serve Architecture
+==================
+This document provides an overview of how each component in Serve works.
+
+.. image:: architecture.svg
+    :align: center
+    :width: 600px
+
+High Level View
+---------------
+
+Serve runs on Ray and utilizes :ref:`Ray actors<actor-guide>`.
+
+There are three kinds of actors that are created to make up a Serve instance:
+
+- Controller: A global actor unique to each :ref:`Serve instance <serve-instance>` that manages
+  the control plane. The Controller is responsible for creating, updating, and
+  destroying other actors. Serve API calls like :mod:`serve.create_backend <ray.serve.create_backend>`,
+  :mod:`serve.create_endpoint <ray.serve.create_endpoint>` make remote calls to the Controller.
+- Router: There is one router per node. Each router is a `Uvicorn <https://www.uvicorn.org/>`_ HTTP
+  server that accepts incoming requests, forwards them to the worker replicas, and
+  responds once they are completed.
+- Worker Replica: Worker replicas actually execute the code in response to a
+  request. For example, they may contain an instantiation of an ML model. Each
+  replica processes individual requests or batches of requests from the routers.
+
+
+Lifetime of a Request
+---------------------
+When an HTTP request is sent to the router, the follow things happen:
+
+- The HTTP request is received and parsed.
+- The correct :ref:`endpoint <serve-endpoint>` associated with the HTTP url path is looked up.
+- One or more :ref:`backends <serve-backend>` is selected to handle the request given the :ref:`traffic
+  splitting <serve-split-traffic>` and :ref:`shadow testing <serve-shadow-testing>` rules. The requests for each backend
+  are placed on a queue.
+- For each request in a backend queue, an available worker replica is looked up
+  and the request is sent to it. If there are no available worker replicas (there
+  are more than ``max_concurrent_queries`` requests outstanding), the request
+  is left in the queue until an outstanding request is finished.
+
+Each worker maintains a queue of requests and processes one batch of requests at
+a time. By default the batch size is 1, you can increase the batch size <ref> to
+increase throughput. If the handler (the function for the backend or
+``__call__``) is ``async``, worker will not wait for the handler to run;
+otherwise, worker will block until the handler returns.
+
+FAQ
+---
+How does Serve handle fault tolerance?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Application errors like exceptions in your model evaluation code is catched and
+wrapped. A 500 status code will be returned with the traceback information. The
+worker replica will be able to continue to handle requests.
+
+Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
+reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
+actors crashes, those actors will be automatically restarted on another
+available machine. All data in the Controller (routing policies, backend
+configurations, etc) is checkpointed to the Ray. Transient data in the
+router and the worker replica (like network connections and internal request
+queues) will be lost upon failure.
+
+How does Serve ensure horizontal scalability and availability?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Serve starts one router per node. Each router will bind the same port. You
+should be able to reach Serve and send requests to any models via any of the
+servers.
+
+This architecture ensures horizontal scalability for Serve. You can scale the
+router by adding more nodes and scale the model workers by increasing the number
+of replicas.
+
+How do ServeHandles work?
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
+request is sent from one via worker replica to another via the handle, the
+requests go through the same data path as incoming HTTP requests. This enables
+the same backend selection and batching procedures to happen. ServeHandles are
+often used to implement :ref:`model composition <serve-model-composition>`.
+
+
+What happens to large requests?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Serve utilizes Ray’s :ref:`shared memory object store <plasma-store>` and in process memory
+store. Small request objects are directly sent between actors via network
+call. Larger request objects (100KiB+) are written to a distributed shared
+memory store and the worker can read them via zero-copy read.
--- a/doc/source/serve/architecture.svg
+++ b/doc/source/serve/architecture.svg
--- a/doc/source/serve/deployment.rst
+++ b/doc/source/serve/deployment.rst
@ -278,6 +278,8 @@ opt for launching a Ray Cluster locally. Specify a Ray cluster like we did in :r
 To learn more, in general, about Ray Clusters see :ref:`cluster-index`.


+.. _serve-instance:
+
 Deploying Multiple Serve Instances on a Single Ray Cluster
 ----------------------------------------------------------