[serve] Replace "backend" with "deployment" in metrics & logging (#17434)

2025-03-06 02:21:39 -05:00 · 2021-08-05 17:37:21 -05:00 · 2021-08-05 17:37:21 -05:00 · 839ceba6db
commit 839ceba6db
parent 05b0da94b7
18 changed files with 92 additions and 99 deletions
--- a/doc/source/serve/architecture.rst
+++ b/doc/source/serve/architecture.rst
@ -37,14 +37,14 @@ When an HTTP request is sent to the router, the follow things happen:
 - The HTTP request is received and parsed.
 - The correct deployment associated with the HTTP url path is looked up. The
  request is placed on a queue.
- For each request in a backend queue, an available replica is looked up
+- For each request in a deployment queue, an available replica is looked up
  and the request is sent to it. If there are no available replicas (there
  are more than ``max_concurrent_queries`` requests outstanding), the request
  is left in the queue until an outstanding request is finished.

 Each replica maintains a queue of requests and executes one at a time, possibly
 using asyncio to process them concurrently. If the handler (the function for the
-backend or ``__call__``) is ``async``, the replica will not wait for the
+deployment or ``__call__``) is ``async``, the replica will not wait for the
 handler to run; otherwise, the replica will block until the handler returns.

 FAQ
@ -59,7 +59,7 @@ replica will be able to continue to handle requests.
 Machine errors and faults will be handled by Ray. Serve utilizes the :ref:`actor
 reconstruction <actor-fault-tolerance>` capability. For example, when a machine hosting any of the
 actors crashes, those actors will be automatically restarted on another
-available machine. All data in the Controller (routing policies, backend
+available machine. All data in the Controller (routing policies, deployment
 configurations, etc) is checkpointed to the Ray. Transient data in the
 router and the replica (like network connections and internal request
 queues) will be lost upon failure.
@ -81,7 +81,7 @@ How do ServeHandles work?
 :mod:`ServeHandles <ray.serve.handle.RayServeHandle>` wrap a handle to the router actor on the same node. When a
 request is sent from one via replica to another via the handle, the
 requests go through the same data path as incoming HTTP requests. This enables
-the same backend selection and batching procedures to happen. ServeHandles are
+the same deployment selection and batching procedures to happen. ServeHandles are
 often used to implement :ref:`model composition <serve-model-composition>`.


--- a/doc/source/serve/deployment.rst
+++ b/doc/source/serve/deployment.rst
@ -28,7 +28,7 @@ Deploying on a Single Node

 While Ray Serve makes it easy to scale out on a multi-node Ray cluster, in some scenarios a single node may suite your needs.
 There are two ways you can run Ray Serve on a single node, shown below.
-In general, **Option 2 is recommended for most users** because it allows you to fully make use of Serve's ability to dynamically update running backends.
+In general, **Option 2 is recommended for most users** because it allows you to fully make use of Serve's ability to dynamically update running deployments.

 1. Start Ray and deploy with Ray Serve all in a single Python file.

@ -157,7 +157,7 @@ Now, we just need to start the cluster:
      Session Affinity:  None
      Events:            <none>

-With the cluster now running, we can run a simple script to start Ray Serve and deploy a "hello world" backend:
+With the cluster now running, we can run a simple script to start Ray Serve and deploy a "hello world" deployment:

  .. code-block:: python

@ -219,7 +219,7 @@ Below is an example of what the Ray Dashboard might look like for a Serve deploy
 .. image:: https://raw.githubusercontent.com/ray-project/Images/master/docs/dashboard/serve-dashboard.png
    :align: center

-Here you can see the Serve controller actor, an HTTP proxy actor, and all of the replicas for each Serve backend in the deployment.
+Here you can see the Serve controller actor, an HTTP proxy actor, and all of the replicas for each Serve deployment.
 To learn about the function of the controller and proxy actors, see the `Serve Architecture page <architecture.html>`__.
 In this example pictured above, we have a single-node cluster with a deployment named Counter with ``num_replicas=2``.

@ -235,18 +235,18 @@ Logging in Ray Serve uses Python's standard logging facility.
 Tracing Backends and Replicas
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-When looking through log files of your Ray Serve application, it is useful to know which backend and replica each log line originated from.
-To automatically include the current backend tag and replica tag in your logs, simply call
-``logger = logging.getLogger("ray")``, and use ``logger`` within your backend code:
+When looking through log files of your Ray Serve application, it is useful to know which deployment and replica each log line originated from.
+To automatically include the current deployment and replica in your logs, simply call
+``logger = logging.getLogger("ray")``, and use ``logger`` within your deployment code:

 .. literalinclude:: ../../../python/ray/serve/examples/doc/snippet_logger.py
  :lines: 1, 9, 11-13, 15-16

-Querying a Serve endpoint with the above backend will produce a log line like the following:
+Querying a Serve endpoint with the above deployment will produce a log line like the following:

 .. code-block:: bash

-  (pid=42161) 2021-02-26 11:05:21,709     INFO snippet_logger.py:13 -- Some info! component=serve backend=f replica=f#jZlnUI
+  (pid=42161) 2021-02-26 11:05:21,709     INFO snippet_logger.py:13 -- Some info! component=serve deployment=f replica=f#jZlnUI

 To write your own custom logger using Python's ``logging`` package, use the following method:

@ -319,20 +319,20 @@ Now we are ready to start our Ray Serve deployment.  Start a long-running Ray cl
  ray start --head
  serve start

-Now run the following Python script to deploy a basic Serve backend with a Serve backend logger:
+Now run the following Python script to deploy a basic Serve deployment with a Serve deployment logger:

-.. literalinclude:: ../../../python/ray/serve/examples/doc/backend_logger.py
+.. literalinclude:: ../../../python/ray/serve/examples/doc/deployment_logger.py

 Now `install and run Grafana <https://grafana.com/docs/grafana/latest/installation/>`__ and navigate to ``http://localhost:3000``, where you can log in with the default username "admin" and default password "admin".
 On the welcome page, click "Add your first data source" and click "Loki" to add Loki as a data source.

 Now click "Explore" in the left-side panel.  You are ready to run some queries!

-To filter all these Ray logs for the ones relevant to our backend, use the following `LogQL <https://grafana.com/docs/loki/latest/logql/>`__ query:
+To filter all these Ray logs for the ones relevant to our deployment, use the following `LogQL <https://grafana.com/docs/loki/latest/logql/>`__ query:

 .. code-block:: shell 

-  {job="ray"} |= "backend=Counter"
+  {job="ray"} |= "deployment=Counter"

 You should see something similar to the following:

@ -353,18 +353,18 @@ The following metrics are exposed by Ray Serve:

   * - Name
     - Description
-   * - ``serve_backend_request_counter``
+   * - ``serve_deployment_request_counter``
     - The number of queries that have been processed in this replica.
-   * - ``serve_backend_error_counter``
-     - The number of exceptions that have occurred in the backend.
-   * - ``serve_backend_replica_starts``
+   * - ``serve_deployment_error_counter``
+     - The number of exceptions that have occurred in the deployment.
+   * - ``serve_deployment_replica_starts``
     - The number of times this replica has been restarted due to failure.
-   * - ``serve_backend_queuing_latency_ms``
+   * - ``serve_deployment_queuing_latency_ms``
     - The latency for queries in the replica's queue waiting to be processed.
-   * - ``serve_backend_processing_latency_ms``
+   * - ``serve_deployment_processing_latency_ms``
     - The latency for queries to be processed.
   * - ``serve_replica_queued_queries``
-     - The current number of queries queued in the backend replicas.
+     - The current number of queries queued in the deployment replicas.
   * - ``serve_replica_processing_queries``
     - The current number of queries being processed.
   * - ``serve_num_http_requests``
@ -373,8 +373,8 @@ The following metrics are exposed by Ray Serve:
     - The number of requests processed by the router.
   * - ``serve_handle_request_counter``
     - The number of requests processed by this ServeHandle.
-   * - ``backend_queued_queries`` 
-     - The number of queries for this backend waiting to be assigned to a replica.
+   * - ``serve_deployment_queued_queries`` 
+     - The number of queries for this deployment waiting to be assigned to a replica.

 To see this in action, run ``ray start --head --metrics-export-port=8080`` in your terminal, and then run the following script:

@ -386,12 +386,12 @@ The metrics are updated once every ten seconds, and you will need to refresh the

 For example, after running the script for some time and refreshing ``localhost:8080`` you might see something that looks like::

-  ray_serve_backend_processing_latency_ms_count{...,backend="f",...} 99.0
-  ray_serve_backend_processing_latency_ms_sum{...,backend="f",...} 99279.30498123169
+  ray_serve_deployment_processing_latency_ms_count{...,deployment="f",...} 99.0
+  ray_serve_deployment_processing_latency_ms_sum{...,deployment="f",...} 99279.30498123169

 which indicates that the average processing latency is just over one second, as expected.

-You can even define a `custom metric <..ray-metrics.html#custom-metrics>`__ to use in your backend, and tag it with the current backend or replica.
+You can even define a `custom metric <..ray-metrics.html#custom-metrics>`__ to use in your deployment, and tag it with the current deployment or replica.
 Here's an example:

 .. literalinclude:: ../../../python/ray/serve/examples/doc/snippet_custom_metric.py
--- a/doc/source/serve/faq.rst
+++ b/doc/source/serve/faq.rst
@ -77,4 +77,4 @@ Is Ray Serve only for ML models?
 --------------------------------
 Nope! Ray Serve can be used to build any type of Python microservices
 application. You can also use the full power of Ray within your Ray Serve
-programs, so it's easy to run parallel computations within your backends.
+programs, so it's easy to run parallel computations within your deployments.
--- a/doc/source/serve/index.rst
+++ b/doc/source/serve/index.rst
@ -76,7 +76,7 @@ lack of flexibility.
 Ray Serve solves these problems by giving you a simple web server (and the ability to :ref:`use your own <serve-web-server-integration-tutorial>`) while still handling the complex routing, scaling, and testing logic
 necessary for production deployments.

-Beyond scaling up your backends with multiple replicas, Ray Serve also enables:
+Beyond scaling up your deployments with multiple replicas, Ray Serve also enables:

 - :ref:`serve-model-composition`---ability to flexibly compose multiple models and independently scale and update each.
 - :ref:`serve-batching`---built in request batching to help you meet your performance objectives.
--- a/doc/source/serve/ml-models.rst
+++ b/doc/source/serve/ml-models.rst
@ -51,7 +51,7 @@ stacking or ensembles.
 To define a higher-level composed model you need to do three things:

 1. Define your underlying models (the ones that you will compose together) as
-   Ray Serve backends
+   Ray Serve deployments.
 2. Define your composed model, using the handles of the underlying models
   (see the example below).
 3. Define an endpoint representing this composed model and query it!
--- a/doc/source/serve/performance.rst
+++ b/doc/source/serve/performance.rst
@ -17,7 +17,7 @@ Performance and known benchmarks
 We are continuously benchmarking Ray Serve. The metrics we care about are latency, throughput, and scalability. We can confidently say:

 - Ray Serve’s latency overhead is single digit milliseconds, around 1-2 milliseconds on average.
- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 backend replicas performing noop requests.
+- For throughput, Serve achieves about 3-4k queries per second on a single machine (8 cores) using 1 http proxy and 8 replicas performing noop requests.
 - It is horizontally scalable so you can add more machines to increase the overall throughput. Ray Serve is built on top of Ray, 
  so its scalability is bounded by Ray’s scalability. Please check out Ray’s `scalability envelope <https://github.com/ray-project/ray/blob/master/benchmarks/README.md>`_
  to learn more about the maximum number of nodes and other limitations.
@ -31,7 +31,7 @@ The performance issue you're most likely to encounter is high latency and/or low

 If you have set up :ref:`monitoring <serve-monitoring>` with Ray and Ray Serve, you will likely observe that
 ``serve_num_router_requests`` is constant while your load increases
-``serve_backend_queuing_latency_ms`` is spiking up as queries queue up in the background
+``serve_deployment_queuing_latency_ms`` is spiking up as queries queue up in the background

 Given the symptom, there are several ways to fix it.

@ -46,15 +46,14 @@ Async functions
 Are you using ``async def`` in your callable? If you are using asyncio and
 hitting the same queuing issue mentioned above, you might want to increase 
 ``max_concurrent_queries``. Serve sets a low number by default so the client gets 
-proper backpressure. You can increase the value in the :mod:`backend config <ray.serve.config.BackendConfig>`
-to allow more coroutines running in the same replica.
+proper backpressure. You can increase the value in the Deployment decorator.

 Batching
 ^^^^^^^^
-If your backend can process a batch at a time at a sublinear latency 
+If your deployment can process a batch at a time at a sublinear latency 
 (for example, if it takes 1ms to process 1 query and 5ms to process 10 of them) 
 then batching is your best approach. Check out the :ref:`batching guide <serve-batching>` to 
-make your backend accept batches (especially for GPU-based ML inference). You might want to tune your ``max_batch_size`` and ``batch_wait_timeout`` in the ``@serve.batch`` decorator to maximize the benefits:
+make your deployment accept batches (especially for GPU-based ML inference). You might want to tune your ``max_batch_size`` and ``batch_wait_timeout`` in the ``@serve.batch`` decorator to maximize the benefits:

 - ``max_batch_size`` specifies how big the batch should be. Generally, 
  we recommend choosing the largest batch size your function can handle 
--- a/doc/source/serve/tutorials/batch.rst
+++ b/doc/source/serve/tutorials/batch.rst
@ -6,7 +6,7 @@ Batching Tutorial
 In this guide, we will deploy a simple vectorized adder that takes
 a batch of queries and adds them at once. In particular, we show:

- How to implement and deploy a Ray Serve backend that accepts batches.
+- How to implement and deploy a Ray Serve deployment that accepts batches.
 - How to configure the batch size.
 - How to query the model in Python.

@ -37,7 +37,7 @@ This function must also be ``async def`` so that you can handle multiple queries
    async def my_batch_handler(self, requests: List):
        pass

-This batch handler can then be called from another ``async def`` method in your backend.
+This batch handler can then be called from another ``async def`` method in your deployment.
 These calls will be batched and executed together, but return an individual result as if
 they were a normal function call:

@ -62,7 +62,7 @@ they were a normal function call:
    ``batch_wait_timeout_s`` option to ``@serve.batch`` (defaults to 0). Increasing this
    timeout may improve throughput at the cost of latency under low load.

-Let's define a backend that takes in a list of requests, extracts the input value,
+Let's define a deployment that takes in a list of requests, extracts the input value,
 converts them into an array, and uses NumPy to add 1 to each element.

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_batch.py
@ -90,7 +90,7 @@ What if you want to evaluate a whole batch in Python? Ray Serve allows you to se
 queries via the Python API. A batch of queries can either come from the web server
 or the Python API. Learn more :ref:`here<serve-handle-explainer>`.

-To query the backend via the Python API, we can use ``Deployment.get_handle`` to receive
+To query the deployment via the Python API, we can use ``Deployment.get_handle`` to receive
 a handle to the corresponding deployment. To enqueue a query, you can call
 ``handle.method.remote(data)``. This call returns immediately
 with a :ref:`Ray ObjectRef<ray-object-refs>`. You can call `ray.get` to retrieve
--- a/doc/source/serve/tutorials/pytorch.rst
+++ b/doc/source/serve/tutorials/pytorch.rst
@ -33,8 +33,7 @@ The ``__call__`` method will be invoked per request.
    :end-before: __doc_define_servable_end__

 Now that we've defined our services, let's deploy the model to Ray Serve. We will
-define an endpoint for the route representing the digit classifier task, a
-backend correspond the physical implementation, and connect them together.
+define a Serve deployment that will be exposed over an HTTP route.

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_pytorch.py
    :start-after: __doc_deploy_begin__
--- a/doc/source/serve/tutorials/rllib.rst
+++ b/doc/source/serve/tutorials/rllib.rst
@ -42,8 +42,7 @@ retrieves the ``request.json()["observation"]`` as input.
    :end-before: __doc_define_servable_end__

 Now that we've defined our services, let's deploy the model to Ray Serve. We will
-define an endpoint for the route representing the ppo model, a
-backend correspond the physical implementation, and connect them together.
+define a Serve deployment that will be exposed over an HTTP route.

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_rllib.py
    :start-after: __doc_deploy_begin__
--- a/doc/source/serve/tutorials/sklearn.rst
+++ b/doc/source/serve/tutorials/sklearn.rst
@ -37,8 +37,7 @@ The ``__call__`` method will be invoked per request.
    :end-before: __doc_define_servable_end__

 Now that we've defined our services, let's deploy the model to Ray Serve. We will
-define an endpoint for the route representing the classifier task, a
-backend correspond the physical implementation, and connect them together.
+define a Serve deployment that will be exposed over an HTTP route.

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_sklearn.py
    :start-after: __doc_deploy_begin__
--- a/doc/source/serve/tutorials/tensorflow.rst
+++ b/doc/source/serve/tutorials/tensorflow.rst
@ -40,8 +40,7 @@ The ``__call__`` method will be invoked per request.
    :end-before: __doc_define_servable_end__

 Now that we've defined our services, let's deploy the model to Ray Serve. We will
-define an endpoint for the route representing the digit classifier task, a
-backend correspond the physical implementation, and connect them together.
+define a Serve deployment that will be exposed over an HTTP route.

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/tutorial_tensorflow.py
    :start-after: __doc_deploy_begin__
--- a/doc/source/serve/tutorials/web-server-integration.rst
+++ b/doc/source/serve/tutorials/web-server-integration.rst
@ -26,7 +26,7 @@ Here’s a simple FastAPI web server. It uses Huggingface Transformers to auto-g

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/fastapi_simple.py

-To scale this up, we define a Ray Serve backend containing our text model and call it from Python using a ServeHandle:
+To scale this up, we define a Ray Serve deployment containing our text model and call it from Python:

 .. literalinclude:: ../../../../python/ray/serve/examples/doc/fastapi/servehandle_fastapi.py

@ -52,7 +52,7 @@ The terminal should then print the generated text:
 To clean up the Ray cluster, run ``ray stop`` in the terminal.

 .. tip::
-  According to the backend configuration parameter ``num_replicas``, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have :ref:`started a multi-node Ray cluster <cluster-index>`), which will correspondingly multiply your throughput.
+  According to the deployment configuration parameter ``num_replicas``, Ray Serve will place multiple replicas of your model across multiple CPU cores and multiple machines (provided you have :ref:`started a multi-node Ray cluster <cluster-index>`), which will correspondingly multiply your throughput.

 Scaling Up an AIOHTTP Application
 ---------------------------------
--- a/python/ray/serve/api.py
+++ b/python/ray/serve/api.py
@ -25,7 +25,7 @@ from ray.serve.common import BackendInfo, GoalId
 from ray.serve.config import (BackendConfig, HTTPOptions, ReplicaConfig)
 from ray.serve.constants import (DEFAULT_HTTP_HOST, DEFAULT_HTTP_PORT,
                                 HTTP_PROXY_TIMEOUT, SERVE_CONTROLLER_NAME)
-from ray.serve.controller import BackendTag, ReplicaTag, ServeController
+from ray.serve.controller import ReplicaTag, ServeController
 from ray.serve.exceptions import RayServeException
 from ray.serve.handle import RayServeHandle, RayServeSyncHandle
 from ray.serve.http_util import (ASGIHTTPSender, make_fastapi_class_based_view)
@ -60,21 +60,21 @@ def _set_global_client(client):
@dataclass
 class ReplicaContext:
    """Stores data for Serve API calls from within the user's backend code."""
-    backend_tag: BackendTag
+    deployment: str
    replica_tag: ReplicaTag
    _internal_controller_name: str
    servable_object: Callable


 def _set_internal_replica_context(
-        backend_tag: BackendTag,
+        deployment: str,
        replica_tag: ReplicaTag,
        controller_name: str,
        servable_object: Callable,
 ):
    global _INTERNAL_REPLICA_CONTEXT
    _INTERNAL_REPLICA_CONTEXT = ReplicaContext(
-        backend_tag, replica_tag, controller_name, servable_object)
+        deployment, replica_tag, controller_name, servable_object)


 def _ensure_connected(f: Callable) -> Callable:
@ -987,19 +987,17 @@ def get_handle(

@PublicAPI
 def get_replica_context() -> ReplicaContext:
-    """When called from a backend, returns the backend tag and replica tag.
-
-    When not called from a backend, returns None.
+    """If called from a deployment, returns the deployment and replica tag.

    A replica tag uniquely identifies a single replica for a Ray Serve
-    backend at runtime.  Replica tags are of the form
-    `<backend tag>#<random letters>`.
+    deployment at runtime.  Replica tags are of the form
+    `<deployment_name>#<random letters>`.

    Raises:
-        RayServeException: if not called from within a Ray Serve backend
+        RayServeException: if not called from within a Ray Serve deployment.
    Example:
-        >>> serve.get_replica_context().backend_tag # my_backend
-        >>> serve.get_replica_context().replica_tag # my_backend#krcwoa
+        >>> serve.get_replica_context().deployment # deployment_name
+        >>> serve.get_replica_context().replica_tag # deployment_name#krcwoa
    """
    if _INTERNAL_REPLICA_CONTEXT is None:
        raise RayServeException("`serve.get_replica_context()` "
--- a/python/ray/serve/backend_worker.py
+++ b/python/ray/serve/backend_worker.py
@ -131,7 +131,7 @@ class RayServeReplica:

    def __init__(self, _callable: Callable, backend_config: BackendConfig,
                 is_function: bool, controller_handle: ActorHandle) -> None:
-        self.backend_tag = ray.serve.api.get_replica_context().backend_tag
+        self.backend_tag = ray.serve.api.get_replica_context().deployment
        self.replica_tag = ray.serve.api.get_replica_context().replica_tag
        self.callable = _callable
        self.is_function = is_function
@ -141,11 +141,11 @@ class RayServeReplica:
        self.num_ongoing_requests = 0

        self.request_counter = metrics.Counter(
-            "serve_backend_request_counter",
+            "serve_deployment_request_counter",
            description=("The number of queries that have been "
                         "processed in this replica."),
-            tag_keys=("backend", ))
-        self.request_counter.set_default_tags({"backend": self.backend_tag})
+            tag_keys=("deployment", ))
+        self.request_counter.set_default_tags({"deployment": self.backend_tag})

        self.loop = asyncio.get_event_loop()
        self.long_poll_client = LongPollClient(
@ -158,38 +158,38 @@ class RayServeReplica:
        )

        self.error_counter = metrics.Counter(
-            "serve_backend_error_counter",
+            "serve_deployment_error_counter",
            description=("The number of exceptions that have "
-                         "occurred in the backend."),
-            tag_keys=("backend", ))
-        self.error_counter.set_default_tags({"backend": self.backend_tag})
+                         "occurred in the deployment."),
+            tag_keys=("deployment", ))
+        self.error_counter.set_default_tags({"deployment": self.backend_tag})

        self.restart_counter = metrics.Counter(
-            "serve_backend_replica_starts",
+            "serve_deployment_replica_starts",
            description=("The number of times this replica "
                         "has been restarted due to failure."),
-            tag_keys=("backend", "replica"))
+            tag_keys=("deployment", "replica"))
        self.restart_counter.set_default_tags({
-            "backend": self.backend_tag,
+            "deployment": self.backend_tag,
            "replica": self.replica_tag
        })

        self.processing_latency_tracker = metrics.Histogram(
-            "serve_backend_processing_latency_ms",
+            "serve_deployment_processing_latency_ms",
            description="The latency for queries to be processed.",
            boundaries=DEFAULT_LATENCY_BUCKET_MS,
-            tag_keys=("backend", "replica"))
+            tag_keys=("deployment", "replica"))
        self.processing_latency_tracker.set_default_tags({
-            "backend": self.backend_tag,
+            "deployment": self.backend_tag,
            "replica": self.replica_tag
        })

        self.num_processing_items = metrics.Gauge(
            "serve_replica_processing_queries",
            description="The current number of queries being processed.",
-            tag_keys=("backend", "replica"))
+            tag_keys=("deployment", "replica"))
        self.num_processing_items.set_default_tags({
-            "backend": self.backend_tag,
+            "deployment": self.backend_tag,
            "replica": self.replica_tag
        })

@ -200,7 +200,7 @@ class RayServeReplica:
            handler.setFormatter(
                logging.Formatter(
                    handler.formatter._fmt +
-                    f" component=serve backend={self.backend_tag} "
+                    f" component=serve deployment={self.backend_tag} "
                    f"replica={self.replica_tag}"))

    def get_runner_method(self, request_item: Query) -> Callable:
--- a/python/ray/serve/examples/doc/deployment_logger.py
+++ b/python/ray/serve/examples/doc/deployment_logger.py
--- a/python/ray/serve/examples/doc/snippet_custom_metric.py
+++ b/python/ray/serve/examples/doc/snippet_custom_metric.py
@ -9,14 +9,14 @@ serve.start()


@serve.deployment
-class MyBackend:
+class MyDeployment:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_counter",
            description=("The number of excellent requests to this backend."),
-            tag_keys=("backend", ))
+            tag_keys=("deployment", ))
        self.my_counter.set_default_tags({
-            "backend": serve.get_current_backend_tag()
+            "deployment": serve.get_current_deployment()
        })

    def call(self, excellent=False):
@ -24,9 +24,9 @@ class MyBackend:
            self.my_counter.inc()


-MyBackend.deploy()
+MyDeployment.deploy()

-handle = MyBackend.get_handle()
+handle = MyDeployment.get_handle()
 while True:
    ray.get(handle.call.remote(excellent=True))
    time.sleep(1)
--- a/python/ray/serve/router.py
+++ b/python/ray/serve/router.py
@ -82,13 +82,13 @@ class ReplicaSet:
        self.config_updated_event = asyncio.Event(loop=event_loop)
        self.num_queued_queries = 0
        self.num_queued_queries_gauge = metrics.Gauge(
-            "serve_backend_queued_queries",
+            "serve_deployment_queued_queries",
            description=(
-                "The current number of queries to this backend waiting"
+                "The current number of queries to this deployment waiting"
                " to be assigned to a replica."),
-            tag_keys=("backend", "endpoint"))
+            tag_keys=("deployment", "endpoint"))
        self.num_queued_queries_gauge.set_default_tags({
-            "backend": self.backend_tag
+            "deployment": self.backend_tag
        })

        self.long_poll_client = LongPollClient(
--- a/python/ray/serve/tests/test_metrics.py
+++ b/python/ray/serve/tests/test_metrics.py
@ -33,19 +33,19 @@ def test_serve_metrics(serve_instance):
            # counter
            "num_router_requests_total",
            "num_http_requests_total",
-            "backend_queued_queries_total",
-            "backend_request_counter_requests_total",
-            "backend_worker_starts_restarts_total",
+            "deployment_queued_queries_total",
+            "deployment_request_counter_requests_total",
+            "deployment_worker_starts_restarts_total",
            # histogram
-            "backend_processing_latency_ms_bucket",
-            "backend_processing_latency_ms_count",
-            "backend_processing_latency_ms_sum",
+            "deployment_processing_latency_ms_bucket",
+            "deployment_processing_latency_ms_count",
+            "deployment_processing_latency_ms_sum",
            # gauge
            "replica_processing_queries",
            # handle
            "serve_handle_request_counter",
            # ReplicaSet
-            "backend_queued_queries"
+            "deployment_queued_queries"
        ]
        for metric in expected_metrics:
            # For the final error round
@ -63,8 +63,8 @@ def test_serve_metrics(serve_instance):
        verify_metrics()


-def test_backend_logger(serve_instance):
-    # Tests that backend tag and replica tag appear in Serve log output.
+def test_deployment_logger(serve_instance):
+    # Tests that deployment tag and replica tag appear in Serve log output.
    logger = logging.getLogger("ray")

    @serve.deployment(name="counter")
@ -83,7 +83,7 @@ def test_backend_logger(serve_instance):

        def counter_log_success():
            s = f.getvalue()
-            return "backend" in s and "replica" in s and "count" in s
+            return "deployment" in s and "replica" in s and "count" in s

        wait_for_condition(counter_log_success)