mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
[serve] Replace "backend" with "deployment" in metrics & logging (#17434)
This commit is contained in:
parent
05b0da94b7
commit
839ceba6db
18 changed files with 92 additions and 99 deletions
|
@ -37,14 +37,14 @@ When an HTTP request is sent to the router, the follow things happen:
|
|||
- The HTTP request is received and parsed.
|
||||
- The correct deployment associated with the HTTP url path is looked up. The
|
||||
request is placed on a queue.
|
||||
- For each request in a backend queue, an available replica is looked up
|
||||
- For each request in a deployment queue, an available replica is looked up
|
||||
and the request is sent to it. If there are no available replicas (there
|
||||
are more than ``max_concurrent_queries`` requests outstanding), the request
|
||||
is left in the queue until an outstanding request is finished.
|
||||
|
||||
Each replica maintains a queue of requests and executes one at a time, possibly
|
||||
using asyncio to process them concurrently. If the handler (the function for the
|
||||
backend or ``__call__``) is ``async``, the replica will not wait for the
|
||||
deployment or ``__call__``) is ``async``, the replica will not wait for the
|
||||
handler to run; otherwise, the replica will block until the handler returns.
|
||||
|
||||
FAQ
|
||||
|
@ -59,7 +59,7 @@ replica will be able to continue to handle requests.
|
|||
Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
|
||||
reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
|
||||
actors crashes, those actors will be automatically restarted on another
|
||||
available machine. All data in the Controller (routing policies, backend
|
||||
available machine. All data in the Controller (routing policies, deployment
|
||||
configurations, etc) is checkpointed to the Ray. Transient data in the
|
||||
router and the replica (like network connections and internal request
|
||||
queues) will be lost upon failure.
|
||||
|
@ -81,7 +81,7 @@ How do ServeHandles work?
|
|||
:mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
|
||||
request is sent from one via replica to another via the handle, the
|
||||
requests go through the same data path as incoming HTTP requests. This enables
|
||||
the same backend selection and batching procedures to happen. ServeHandles are
|
||||
the same deployment selection and batching procedures to happen. ServeHandles are
|
||||
often used to implement :ref:`model composition <serve-model-composition>`.
|
||||
|
||||
|
||||
|
|
|
@ -28,7 +28,7 @@ Deploying on a Single Node
|
|||
|
||||
While Ray Serve makes it easy to scale out on a multi-node Ray cluster, in some scenarios a single node may suite your needs.
|
||||
There are two ways you can run Ray Serve on a single node, shown below.
|
||||
In general, **Option 2 is recommended for most users** because it allows you to fully make use of Serve's ability to dynamically update running backends.
|
||||
In general, **Option 2 is recommended for most users** because it allows you to fully make use of Serve's ability to dynamically update running deployments.
|
||||
|
||||
1. Start Ray and deploy with Ray Serve all in a single Python file.
|
||||
|
||||
|
@ -157,7 +157,7 @@ Now, we just need to start the cluster:
|
|||
Session Affinity: None
|
||||
Events: <none>
|
||||
|
||||
With the cluster now running, we can run a simple script to start Ray Serve and deploy a "hello world" backend:
|
||||
With the cluster now running, we can run a simple script to start Ray Serve and deploy a "hello world" deployment:
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -219,7 +219,7 @@ Below is an example of what the Ray Dashboard might look like for a Serve deploy
|
|||
.. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/serve-dashboard.png
|
||||
:align: center
|
||||
|
||||
Here you can see the Serve controller actor, an HTTP proxy actor, and all of the replicas for each Serve backend in the deployment.
|
||||
Here you can see the Serve controller actor, an HTTP proxy actor, and all of the replicas for each Serve deployment.
|
||||
To learn about the function of the controller and proxy actors, see the `Serve Architecture page <architecture.html>`__.
|
||||
In this example pictured above, we have a single-node cluster with a deployment named Counter with ``num_replicas=2``.
|
||||
|
||||
|
@ -235,18 +235,18 @@ Logging in Ray Serve uses Python's standard logging facility.
|
|||
Tracing Backends and Replicas
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
When looking through log files of your Ray Serve application, it is useful to know which backend and replica each log line originated from.
|
||||
To automatically include the current backend tag and replica tag in your logs, simply call
|
||||
``logger = logging.getLogger("ray")``, and use ``logger`` within your backend code:
|
||||
When looking through log files of your Ray Serve application, it is useful to know which deployment and replica each log line originated from.
|
||||
To automatically include the current deployment and replica in your logs, simply call
|
||||
``logger = logging.getLogger("ray")``, and use ``logger`` within your deployment code:
|
||||
|
||||
.. literalinclude:: ../../../python/ray/serve/examples/doc/snippet_logger.py
|
||||
:lines: 1, 9, 11-13, 15-16
|
||||
|
||||
Querying a Serve endpoint with the above backend will produce a log line like the following:
|
||||
Querying a Serve endpoint with the above deployment will produce a log line like the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
(pid=42161) 2021-02-26 11:05:21,709 INFO snippet_logger.py:13 -- Some info! component=serve backend=f replica=f#jZlnUI
|
||||
(pid=42161) 2021-02-26 11:05:21,709 INFO snippet_logger.py:13 -- Some info! component=serve deployment=f replica=f#jZlnUI
|
||||
|
||||
To write your own custom logger using Python's ``logging`` package, use the following method:
|
||||
|
||||
|
@ -319,20 +319,20 @@ Now we are ready to start our Ray Serve deployment. Start a long-running Ray cl
|
|||
ray start --head
|
||||
serve start
|
||||
|
||||
Now run the following Python script to deploy a basic Serve backend with a Serve backend logger:
|
||||
Now run the following Python script to deploy a basic Serve deployment with a Serve deployment logger:
|
||||
|
||||
.. literalinclude:: ../../../python/ray/serve/examples/doc/backend_logger.py
|
||||
.. literalinclude:: ../../../python/ray/serve/examples/doc/deployment_logger.py
|
||||
|
||||
Now `install and run Grafana <https://grafana.com/docs/grafana/latest/installation/>`__ and navigate to ``http://localhost:3000``, where you can log in with the default username "admin" and default password "admin".
|
||||
On the welcome page, click "Add your first data source" and click "Loki" to add Loki as a data source.
|
||||
|
||||
Now click "Explore" in the left-side panel. You are ready to run some queries!
|
||||
|
||||
To filter all these Ray logs for the ones relevant to our backend, use the following `LogQL <https://grafana.com/docs/loki/latest/logql/>`__ query:
|
||||
To filter all these Ray logs for the ones relevant to our deployment, use the following `LogQL <https://grafana.com/docs/loki/latest/logql/>`__ query:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
{job="ray"} |= "backend=Counter"
|
||||
{job="ray"} |= "deployment=Counter"
|
||||
|
||||
You should see something similar to the following:
|
||||
|
||||
|
@ -353,18 +353,18 @@ The following metrics are exposed by Ray Serve:
|
|||
|
||||
* - Name
|
||||
- Description
|
||||
* - ``serve_backend_request_counter``
|
||||
* - ``serve_deployment_request_counter``
|
||||
- The number of queries that have been processed in this replica.
|
||||
* - ``serve_backend_error_counter``
|
||||
- The number of exceptions that have occurred in the backend.
|
||||
* - ``serve_backend_replica_starts``
|
||||
* - ``serve_deployment_error_counter``
|
||||
- The number of exceptions that have occurred in the deployment.
|
||||
* - ``serve_deployment_replica_starts``
|
||||
- The number of times this replica has been restarted due to failure.
|
||||
* - ``serve_backend_queuing_latency_ms``
|
||||
* - ``serve_deployment_queuing_latency_ms``
|
||||
- The latency for queries in the replica's queue waiting to be processed.
|
||||
* - ``serve_backend_processing_latency_ms``
|
||||
* - ``serve_deployment_processing_latency_ms``
|
||||
- The latency for queries to be processed.
|
||||
* - ``serve_replica_queued_queries``
|
||||
- The current number of queries queued in the backend replicas.
|
||||
- The current number of queries queued in the deployment replicas.
|
||||
* - ``serve_replica_processing_queries``
|
||||
- The current number of queries being processed.
|
||||
* - ``serve_num_http_requests``
|
||||
|
@ -373,8 +373,8 @@ The following metrics are exposed by Ray Serve:
|
|||
- The number of requests processed by the router.
|
||||
* - ``serve_handle_request_counter``
|
||||
- The number of requests processed by this ServeHandle.
|
||||
* - ``backend_queued_queries``
|
||||
- The number of queries for this backend waiting to be assigned to a replica.
|
||||
* - ``serve_deployment_queued_queries``
|
||||
- The number of queries for this deployment waiting to be assigned to a replica.
|
||||
|
||||
To see this in action, run ``ray start --head --metrics-export-port=8080`` in your terminal, and then run the following script:
|
||||
|
||||
|
@ -386,12 +386,12 @@ The metrics are updated once every ten seconds, and you will need to refresh the
|
|||
|
||||
For example, after running the script for some time and refreshing ``localhost:8080`` you might see something that looks like::
|
||||
|
||||
ray_serve_backend_processing_latency_ms_count{...,backend="f",...} 99.0
|
||||
ray_serve_backend_processing_latency_ms_sum{...,backend="f",...} 99279.30498123169
|
||||
ray_serve_deployment_processing_latency_ms_count{...,deployment="f",...} 99.0
|
||||
ray_serve_deployment_processing_latency_ms_sum{...,deployment="f",...} 99279.30498123169
|
||||
|
||||
which indicates that the average processing latency is just over one second, as expected.
|
||||
|
||||
You can even define a `custom metric <..ray-metrics.html#custom-metrics>`__ to use in your backend, and tag it with the current backend or replica.
|
||||
You can even define a `custom metric <..ray-metrics.html#custom-metrics>`__ to use in your deployment, and tag it with the current deployment or replica.
|
||||
Here's an example:
|
||||
|
||||
.. literalinclude:: ../../../python/ray/serve/examples/doc/snippet_custom_metric.py
|
||||
|
|
|
@ -77,4 +77,4 @@ Is Ray Serve only for ML models?
|
|||
--------------------------------
|
||||
Nope! Ray Serve can be used to build any type of Python microservices
|
||||
application. You can also use the full power of Ray within your Ray Serve
|
||||
programs, so it's easy to run parallel computations within your backends.
|
||||
programs, so it's easy to run parallel computations within your deployments.
|
||||
|
|
|
@ -76,7 +76,7 @@ lack of flexibility.
|
|||
Ray Serve solves these problems by giving you a simple web server (and the ability to :ref:`use your own <serve-web-server-integration-tutorial>`) while still handling the complex routing, scaling, and testing logic
|
||||
necessary for production deployments.
|
||||
|
||||
Beyond scaling up your backends with multiple replicas, Ray Serve also enables:
|
||||
Beyond scaling up your deployments with multiple replicas, Ray Serve also enables:
|
||||
|
||||
- :ref:`serve-model-composition`---ability to flexibly compose multiple models and independently scale and update each.
|
||||
- :ref:`serve-batching`---built in request batching to help you meet your performance objectives.
|
||||
|
|
|
@ -51,7 +51,7 @@ stacking or ensembles.
|
|||
To define a higher-level composed model you need to do three things:
|
||||
|
||||
1. Define your underlying models (the ones that you will compose together) as
|
||||
Ray Serve backends
|
||||
Ray Serve deployments.
|
||||
2. Define your composed model, using the handles of the underlying models
|
||||
(see the example below).
|
||||
3. Define an endpoint representing this composed model and query it!
|
||||
|
|
|
@ -17,7 +17,7 @@ Performance and known benchmarks
|
|||
We are continuously benchmarking Ray Serve. The metrics we care about are latency, throughput, and scalability. We can confidently say:
|
||||
|
||||
- Ray Serve’s latency overhead is single digit milliseconds, around 1-2 milliseconds on average.
|
||||
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 backend replicas performing noop requests.
|
||||
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 replicas performing noop requests.
|
||||
- It is horizontally scalable so you can add more machines to increase the overall throughput. Ray Serve is built on top of Ray,
|
||||
so its scalability is bounded by Ray’s scalability. Please check out Ray’s `scalability envelope <https://github.com/ray-project/ray/blob/master/benchmarks/README.md>`_
|
||||
to learn more about the maximum number of nodes and other limitations.
|
||||
|
@ -31,7 +31,7 @@ The performance issue you're most likely to encounter is high latency and/or low
|
|||
|
||||
If you have set up :ref:`monitoring <serve-monitoring>` with Ray and Ray Serve, you will likely observe that
|
||||
``serve_num_router_requests`` is constant while your load increases
|
||||
``serve_backend_queuing_latency_ms`` is spiking up as queries queue up in the background
|
||||
``serve_deployment_queuing_latency_ms`` is spiking up as queries queue up in the background
|
||||
|
||||
Given the symptom, there are several ways to fix it.
|
||||
|
||||
|
@ -46,15 +46,14 @@ Async functions
|
|||
Are you using ``async def`` in your callable? If you are using asyncio and
|
||||
hitting the same queuing issue mentioned above, you might want to increase
|
||||
``max_concurrent_queries``. Serve sets a low number by default so the client gets
|
||||
proper backpressure. You can increase the value in the :mod:`backend config <ray.serve.config.BackendConfig>`
|
||||
to allow more coroutines running in the same replica.
|
||||
proper backpressure. You can increase the value in the Deployment decorator.
|
||||
|
||||
Batching
|
||||
^^^^^^^^
|
||||
If your backend can process a batch at a time at a sublinear latency
|
||||
If your deployment can process a batch at a time at a sublinear latency
|
||||
(for example, if it takes 1ms to process 1 query and 5ms to process 10 of them)
|
||||
then batching is your best approach. Check out the :ref:`batching guide <serve-batching>` to
|
||||
make your backend accept batches (especially for GPU-based ML inference). You might want to tune your ``max_batch_size`` and ``batch_wait_timeout`` in the ``@serve.batch`` decorator to maximize the benefits:
|
||||
make your deployment accept batches (especially for GPU-based ML inference). You might want to tune your ``max_batch_size`` and ``batch_wait_timeout`` in the ``@serve.batch`` decorator to maximize the benefits:
|
||||
|
||||
- ``max_batch_size`` specifies how big the batch should be. Generally,
|
||||
we recommend choosing the largest batch size your function can handle
|
||||
|
|
|
@ -6,7 +6,7 @@ Batching Tutorial
|
|||
In this guide, we will deploy a simple vectorized adder that takes
|
||||
a batch of queries and adds them at once. In particular, we show:
|
||||
|
||||
- How to implement and deploy a Ray Serve backend that accepts batches.
|
||||
- How to implement and deploy a Ray Serve deployment that accepts batches.
|
||||
- How to configure the batch size.
|
||||
- How to query the model in Python.
|
||||
|
||||
|
@ -37,7 +37,7 @@ This function must also be ``async def`` so that you can handle multiple queries
|
|||
async def my_batch_handler(self, requests: List):
|
||||
pass
|
||||
|
||||
This batch handler can then be called from another ``async def`` method in your backend.
|
||||
This batch handler can then be called from another ``async def`` method in your deployment.
|
||||
These calls will be batched and executed together, but return an individual result as if
|
||||
they were a normal function call:
|
||||
|
||||
|
@ -62,7 +62,7 @@ they were a normal function call:
|
|||
``batch_wait_timeout_s`` option to ``@serve.batch`` (defaults to 0). Increasing this
|
||||
timeout may improve throughput at the cost of latency under low load.
|
||||
|
||||
Let's define a backend that takes in a list of requests, extracts the input value,
|
||||
Let's define a deployment that takes in a list of requests, extracts the input value,
|
||||
converts them into an array, and uses NumPy to add 1 to each element.
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
|
||||
|
@ -90,7 +90,7 @@ What if you want to evaluate a whole batch in Python? Ray Serve allows you to se
|
|||
queries via the Python API. A batch of queries can either come from the web server
|
||||
or the Python API. Learn more :ref:`here<serve-handle-explainer>`.
|
||||
|
||||
To query the backend via the Python API, we can use ``Deployment.get_handle`` to receive
|
||||
To query the deployment via the Python API, we can use ``Deployment.get_handle`` to receive
|
||||
a handle to the corresponding deployment. To enqueue a query, you can call
|
||||
``handle.method.remote(data)``. This call returns immediately
|
||||
with a :ref:`Ray ObjectRef<ray-object-refs>`. You can call `ray.get` to retrieve
|
||||
|
|
|
@ -33,8 +33,7 @@ The ``__call__`` method will be invoked per request.
|
|||
:end-before: __doc_define_servable_end__
|
||||
|
||||
Now that we've defined our services, let's deploy the model to Ray Serve. We will
|
||||
define an endpoint for the route representing the digit classifier task, a
|
||||
backend correspond the physical implementation, and connect them together.
|
||||
define a Serve deployment that will be exposed over an HTTP route.
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_pytorch.py
|
||||
:start-after: __doc_deploy_begin__
|
||||
|
|
|
@ -42,8 +42,7 @@ retrieves the ``request.json()["observation"]`` as input.
|
|||
:end-before: __doc_define_servable_end__
|
||||
|
||||
Now that we've defined our services, let's deploy the model to Ray Serve. We will
|
||||
define an endpoint for the route representing the ppo model, a
|
||||
backend correspond the physical implementation, and connect them together.
|
||||
define a Serve deployment that will be exposed over an HTTP route.
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_rllib.py
|
||||
:start-after: __doc_deploy_begin__
|
||||
|
|
|
@ -37,8 +37,7 @@ The ``__call__`` method will be invoked per request.
|
|||
:end-before: __doc_define_servable_end__
|
||||
|
||||
Now that we've defined our services, let's deploy the model to Ray Serve. We will
|
||||
define an endpoint for the route representing the classifier task, a
|
||||
backend correspond the physical implementation, and connect them together.
|
||||
define a Serve deployment that will be exposed over an HTTP route.
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_sklearn.py
|
||||
:start-after: __doc_deploy_begin__
|
||||
|
|
|
@ -40,8 +40,7 @@ The ``__call__`` method will be invoked per request.
|
|||
:end-before: __doc_define_servable_end__
|
||||
|
||||
Now that we've defined our services, let's deploy the model to Ray Serve. We will
|
||||
define an endpoint for the route representing the digit classifier task, a
|
||||
backend correspond the physical implementation, and connect them together.
|
||||
define a Serve deployment that will be exposed over an HTTP route.
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_tensorflow.py
|
||||
:start-after: __doc_deploy_begin__
|
||||
|
|
|
@ -26,7 +26,7 @@ Here’s a simple FastAPI web server. It uses Huggingface Transformers to auto-g
|
|||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/fastapi_simple.py
|
||||
|
||||
To scale this up, we define a Ray Serve backend containing our text model and call it from Python using a ServeHandle:
|
||||
To scale this up, we define a Ray Serve deployment containing our text model and call it from Python:
|
||||
|
||||
.. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py
|
||||
|
||||
|
@ -52,7 +52,7 @@ The terminal should then print the generated text:
|
|||
To clean up the Ray cluster, run ``ray stop`` in the terminal.
|
||||
|
||||
.. tip::
|
||||
According to the backend configuration parameter ``num_replicas``, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have :ref:`started a multi-node Ray cluster <cluster-index>`), which will correspondingly multiply your throughput.
|
||||
According to the deployment configuration parameter ``num_replicas``, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have :ref:`started a multi-node Ray cluster <cluster-index>`), which will correspondingly multiply your throughput.
|
||||
|
||||
Scaling Up an AIOHTTP Application
|
||||
---------------------------------
|
||||
|
|
|
@ -25,7 +25,7 @@ from ray.serve.common import BackendInfo, GoalId
|
|||
from ray.serve.config import (BackendConfig, HTTPOptions, ReplicaConfig)
|
||||
from ray.serve.constants import (DEFAULT_HTTP_HOST, DEFAULT_HTTP_PORT,
|
||||
HTTP_PROXY_TIMEOUT, SERVE_CONTROLLER_NAME)
|
||||
from ray.serve.controller import BackendTag, ReplicaTag, ServeController
|
||||
from ray.serve.controller import ReplicaTag, ServeController
|
||||
from ray.serve.exceptions import RayServeException
|
||||
from ray.serve.handle import RayServeHandle, RayServeSyncHandle
|
||||
from ray.serve.http_util import (ASGIHTTPSender, make_fastapi_class_based_view)
|
||||
|
@ -60,21 +60,21 @@ def _set_global_client(client):
|
|||
@dataclass
|
||||
class ReplicaContext:
|
||||
"""Stores data for Serve API calls from within the user's backend code."""
|
||||
backend_tag: BackendTag
|
||||
deployment: str
|
||||
replica_tag: ReplicaTag
|
||||
_internal_controller_name: str
|
||||
servable_object: Callable
|
||||
|
||||
|
||||
def _set_internal_replica_context(
|
||||
backend_tag: BackendTag,
|
||||
deployment: str,
|
||||
replica_tag: ReplicaTag,
|
||||
controller_name: str,
|
||||
servable_object: Callable,
|
||||
):
|
||||
global _INTERNAL_REPLICA_CONTEXT
|
||||
_INTERNAL_REPLICA_CONTEXT = ReplicaContext(
|
||||
backend_tag, replica_tag, controller_name, servable_object)
|
||||
deployment, replica_tag, controller_name, servable_object)
|
||||
|
||||
|
||||
def _ensure_connected(f: Callable) -> Callable:
|
||||
|
@ -987,19 +987,17 @@ def get_handle(
|
|||
|
||||
@PublicAPI
|
||||
def get_replica_context() -> ReplicaContext:
|
||||
"""When called from a backend, returns the backend tag and replica tag.
|
||||
|
||||
When not called from a backend, returns None.
|
||||
"""If called from a deployment, returns the deployment and replica tag.
|
||||
|
||||
A replica tag uniquely identifies a single replica for a Ray Serve
|
||||
backend at runtime. Replica tags are of the form
|
||||
`<backend tag>#<random letters>`.
|
||||
deployment at runtime. Replica tags are of the form
|
||||
`<deployment_name>#<random letters>`.
|
||||
|
||||
Raises:
|
||||
RayServeException: if not called from within a Ray Serve backend
|
||||
RayServeException: if not called from within a Ray Serve deployment.
|
||||
Example:
|
||||
>>> serve.get_replica_context().backend_tag # my_backend
|
||||
>>> serve.get_replica_context().replica_tag # my_backend#krcwoa
|
||||
>>> serve.get_replica_context().deployment # deployment_name
|
||||
>>> serve.get_replica_context().replica_tag # deployment_name#krcwoa
|
||||
"""
|
||||
if _INTERNAL_REPLICA_CONTEXT is None:
|
||||
raise RayServeException("`serve.get_replica_context()` "
|
||||
|
|
|
@ -131,7 +131,7 @@ class RayServeReplica:
|
|||
|
||||
def __init__(self, _callable: Callable, backend_config: BackendConfig,
|
||||
is_function: bool, controller_handle: ActorHandle) -> None:
|
||||
self.backend_tag = ray.serve.api.get_replica_context().backend_tag
|
||||
self.backend_tag = ray.serve.api.get_replica_context().deployment
|
||||
self.replica_tag = ray.serve.api.get_replica_context().replica_tag
|
||||
self.callable = _callable
|
||||
self.is_function = is_function
|
||||
|
@ -141,11 +141,11 @@ class RayServeReplica:
|
|||
self.num_ongoing_requests = 0
|
||||
|
||||
self.request_counter = metrics.Counter(
|
||||
"serve_backend_request_counter",
|
||||
"serve_deployment_request_counter",
|
||||
description=("The number of queries that have been "
|
||||
"processed in this replica."),
|
||||
tag_keys=("backend", ))
|
||||
self.request_counter.set_default_tags({"backend": self.backend_tag})
|
||||
tag_keys=("deployment", ))
|
||||
self.request_counter.set_default_tags({"deployment": self.backend_tag})
|
||||
|
||||
self.loop = asyncio.get_event_loop()
|
||||
self.long_poll_client = LongPollClient(
|
||||
|
@ -158,38 +158,38 @@ class RayServeReplica:
|
|||
)
|
||||
|
||||
self.error_counter = metrics.Counter(
|
||||
"serve_backend_error_counter",
|
||||
"serve_deployment_error_counter",
|
||||
description=("The number of exceptions that have "
|
||||
"occurred in the backend."),
|
||||
tag_keys=("backend", ))
|
||||
self.error_counter.set_default_tags({"backend": self.backend_tag})
|
||||
"occurred in the deployment."),
|
||||
tag_keys=("deployment", ))
|
||||
self.error_counter.set_default_tags({"deployment": self.backend_tag})
|
||||
|
||||
self.restart_counter = metrics.Counter(
|
||||
"serve_backend_replica_starts",
|
||||
"serve_deployment_replica_starts",
|
||||
description=("The number of times this replica "
|
||||
"has been restarted due to failure."),
|
||||
tag_keys=("backend", "replica"))
|
||||
tag_keys=("deployment", "replica"))
|
||||
self.restart_counter.set_default_tags({
|
||||
"backend": self.backend_tag,
|
||||
"deployment": self.backend_tag,
|
||||
"replica": self.replica_tag
|
||||
})
|
||||
|
||||
self.processing_latency_tracker = metrics.Histogram(
|
||||
"serve_backend_processing_latency_ms",
|
||||
"serve_deployment_processing_latency_ms",
|
||||
description="The latency for queries to be processed.",
|
||||
boundaries=DEFAULT_LATENCY_BUCKET_MS,
|
||||
tag_keys=("backend", "replica"))
|
||||
tag_keys=("deployment", "replica"))
|
||||
self.processing_latency_tracker.set_default_tags({
|
||||
"backend": self.backend_tag,
|
||||
"deployment": self.backend_tag,
|
||||
"replica": self.replica_tag
|
||||
})
|
||||
|
||||
self.num_processing_items = metrics.Gauge(
|
||||
"serve_replica_processing_queries",
|
||||
description="The current number of queries being processed.",
|
||||
tag_keys=("backend", "replica"))
|
||||
tag_keys=("deployment", "replica"))
|
||||
self.num_processing_items.set_default_tags({
|
||||
"backend": self.backend_tag,
|
||||
"deployment": self.backend_tag,
|
||||
"replica": self.replica_tag
|
||||
})
|
||||
|
||||
|
@ -200,7 +200,7 @@ class RayServeReplica:
|
|||
handler.setFormatter(
|
||||
logging.Formatter(
|
||||
handler.formatter._fmt +
|
||||
f" component=serve backend={self.backend_tag} "
|
||||
f" component=serve deployment={self.backend_tag} "
|
||||
f"replica={self.replica_tag}"))
|
||||
|
||||
def get_runner_method(self, request_item: Query) -> Callable:
|
||||
|
|
|
@ -9,14 +9,14 @@ serve.start()
|
|||
|
||||
|
||||
@serve.deployment
|
||||
class MyBackend:
|
||||
class MyDeployment:
|
||||
def __init__(self):
|
||||
self.my_counter = metrics.Counter(
|
||||
"my_counter",
|
||||
description=("The number of excellent requests to this backend."),
|
||||
tag_keys=("backend", ))
|
||||
tag_keys=("deployment", ))
|
||||
self.my_counter.set_default_tags({
|
||||
"backend": serve.get_current_backend_tag()
|
||||
"deployment": serve.get_current_deployment()
|
||||
})
|
||||
|
||||
def call(self, excellent=False):
|
||||
|
@ -24,9 +24,9 @@ class MyBackend:
|
|||
self.my_counter.inc()
|
||||
|
||||
|
||||
MyBackend.deploy()
|
||||
MyDeployment.deploy()
|
||||
|
||||
handle = MyBackend.get_handle()
|
||||
handle = MyDeployment.get_handle()
|
||||
while True:
|
||||
ray.get(handle.call.remote(excellent=True))
|
||||
time.sleep(1)
|
||||
|
|
|
@ -82,13 +82,13 @@ class ReplicaSet:
|
|||
self.config_updated_event = asyncio.Event(loop=event_loop)
|
||||
self.num_queued_queries = 0
|
||||
self.num_queued_queries_gauge = metrics.Gauge(
|
||||
"serve_backend_queued_queries",
|
||||
"serve_deployment_queued_queries",
|
||||
description=(
|
||||
"The current number of queries to this backend waiting"
|
||||
"The current number of queries to this deployment waiting"
|
||||
" to be assigned to a replica."),
|
||||
tag_keys=("backend", "endpoint"))
|
||||
tag_keys=("deployment", "endpoint"))
|
||||
self.num_queued_queries_gauge.set_default_tags({
|
||||
"backend": self.backend_tag
|
||||
"deployment": self.backend_tag
|
||||
})
|
||||
|
||||
self.long_poll_client = LongPollClient(
|
||||
|
|
|
@ -33,19 +33,19 @@ def test_serve_metrics(serve_instance):
|
|||
# counter
|
||||
"num_router_requests_total",
|
||||
"num_http_requests_total",
|
||||
"backend_queued_queries_total",
|
||||
"backend_request_counter_requests_total",
|
||||
"backend_worker_starts_restarts_total",
|
||||
"deployment_queued_queries_total",
|
||||
"deployment_request_counter_requests_total",
|
||||
"deployment_worker_starts_restarts_total",
|
||||
# histogram
|
||||
"backend_processing_latency_ms_bucket",
|
||||
"backend_processing_latency_ms_count",
|
||||
"backend_processing_latency_ms_sum",
|
||||
"deployment_processing_latency_ms_bucket",
|
||||
"deployment_processing_latency_ms_count",
|
||||
"deployment_processing_latency_ms_sum",
|
||||
# gauge
|
||||
"replica_processing_queries",
|
||||
# handle
|
||||
"serve_handle_request_counter",
|
||||
# ReplicaSet
|
||||
"backend_queued_queries"
|
||||
"deployment_queued_queries"
|
||||
]
|
||||
for metric in expected_metrics:
|
||||
# For the final error round
|
||||
|
@ -63,8 +63,8 @@ def test_serve_metrics(serve_instance):
|
|||
verify_metrics()
|
||||
|
||||
|
||||
def test_backend_logger(serve_instance):
|
||||
# Tests that backend tag and replica tag appear in Serve log output.
|
||||
def test_deployment_logger(serve_instance):
|
||||
# Tests that deployment tag and replica tag appear in Serve log output.
|
||||
logger = logging.getLogger("ray")
|
||||
|
||||
@serve.deployment(name="counter")
|
||||
|
@ -83,7 +83,7 @@ def test_backend_logger(serve_instance):
|
|||
|
||||
def counter_log_success():
|
||||
s = f.getvalue()
|
||||
return "backend" in s and "replica" in s and "count" in s
|
||||
return "deployment" in s and "replica" in s and "count" in s
|
||||
|
||||
wait_for_condition(counter_log_success)
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue