ray/doc/source/serve/end_to_end_tutorial.rst

.. _end_to_end_tutorial:

===================
End-to-End Tutorial
===================

By the end of this tutorial you will have learned how to deploy a machine
learning model locally via Ray Serve.

First, install Ray Serve and all of its dependencies by running the following
command in your terminal:

.. code-block:: bash

  $ pip install "ray[serve]"

For this tutorial, we'll use `HuggingFace's SummarizationPipeline <https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.SummarizationPipeline>`_
to access a model that summarizes text.


Example Model
=============

Let's first take a look at how the model works, without using Ray Serve.
This is the code for the model:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_local.py
  :linenos:
  :language: python
  :start-after: __local_model_start__
  :end-before: __local_model_end__

The Python file, called ``local_model.py`` uses the ``summarize`` function to
generate summaries of text.

- The ``summarizer`` variable on line 7 inside ``summarize`` points to a
  function that uses the `t5-small <https://huggingface.co/t5-small>`_
  model to summarize text.
- When ``summarizer`` is called on a Python String, it returns summarized text
  inside a dictionary formatted as ``[{"summary_text": "...", ...}, ...]``.
- ``summarize`` then extracts the summarized text on line 13 by indexing into
  the dictionary.

The file can be run locally by executing the Python script, which uses the
model to summarize an article about the Apollo 11 moon landing [#f1]_.

.. code-block:: bash

  $ python local_model.py

  "two astronauts steered their fragile lunar module safely and smoothly to the
  historic landing . the first men to reach the moon -- Armstrong and his
  co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
  rest on a level, rock-strewn plain ."


Keep in mind that the ``SummarizationPipeline`` is an example machine learning
model for this tutorial. You can follow along using arbitrary models in any
framework that has a Python API. Check out our tutorials on sckit-learn,
PyTorch, and Tensorflow for more info and examples:

- :ref:`serve-sklearn-tutorial`
- :ref:`serve-pytorch-tutorial`
- :ref:`serve-tensorflow-tutorial`

Converting to Ray Serve Deployment
==================================

This tutorial's goal is to deploy this model using Ray Serve, so it can be
scaled up and queried over HTTP. We'll start by converting the above Python
function into a Ray Serve deployment that can be launched locally on a laptop.

We start by opening a new Python file. First, we need to import ``ray`` and
``ray serve``, to use features in Ray Serve such as ``deployments``, which
provide HTTP access to our model.

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
    :language: python
    :start-after: __import_start__
    :end-before: __import_end__

After these imports, we can include our model code from above.
We won't call our ``summarize`` function just yet though!
We will soon add logic to handle HTTP requests, so the ``summarize`` function
can operate on article text sent via HTTP request.

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
  :language: python
  :start-after: __local_model_start__
  :end-before: __local_model_end__

Ray Serve needs to run on top of a Ray cluster, so we connect to a local one.
See :ref:`serve-deploy-tutorial` to learn more about starting a Ray Serve
instance and deploying to a Ray cluster.

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
  :language: python
  :start-after: __start_ray_cluster_start__
  :end-before: __start_ray_cluster_end__

The ``address`` parameter in ``ray.init()`` connects your Serve script to a
running local Ray cluster. Later, we'll discuss how to start a local Ray
cluster.

.. note::

  ``ray.init()`` connects to or starts a single-node Ray cluster on your
  local machine,  which allows you to use all your CPU cores to serve
  requests in parallel. To start a multi-node cluster, see
  :ref:`serve-deploy-tutorial`.

Next, we start the Ray Serve runtime:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
  :language: python
  :start-after: __start_serve_start__
  :end-before: __start_serve_end__

.. note::

  ``detached=True`` means Ray Serve will continue running even when the Python
  script exits. If you would rather stop Ray Serve after the script exits, use
  ``serve.start()`` instead (see :ref:`ray-serve-instance-lifetime` for
  details).

Now that we have defined our ``summarize`` function, connected to a Ray
Cluster, and started the Ray Serve runtime, we can define a function that
accepts HTTP requests and routes them to the ``summarize`` function. We
define a function called ``router`` that takes in a Starlette ``request``
object [#f2]_:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
  :linenos:
  :language: python
  :start-after: __router_start__
  :end-before: __router_end__

- In line 1, we add the decorator ``@serve.deployment``
  to the ``router`` function to turn the function into a Serve ``Deployment``
  object.
- In line 3, ``router`` uses the ``"txt"`` query parameter in the ``request``
  to get the article text to summarize.
- In line 4, it then passes this article text into the ``summarize`` function
  and returns the value.

.. note::

  Lines 3 and 4 define our HTTP request schema. The HTTP requests sent to this
  endpoint must have a ``"txt"`` query parameter that contains a string.
  In general, you can accept HTTP data using query parameters or the
  request body. Additionally, you can add other Serve deployments with
  different names to create more endpoints that can accept different schemas.
  For more complex validation, you can also use FastAPI (see
  :ref:`serve-fastapi-http` for more info).

.. tip::
  This routing function's name doesn't have to be ``router``.
  It can be any function name as long as the corresponding name is present in
  the HTTP request. If you want the function name to be different than the name
  in the HTTP request, you can add the ``name`` keyword parameter to the
  ``@serve.deployment`` decorator to specify the name sent in the HTTP request.

  For example, if the decorator is ``@serve.deployment(name="responder")`` and
  the function signature is ``def request_manager(request)``, the HTTP request
  should use ``responder``, not ``request_manager``. If no ``name`` is passed
  into ``@serve.deployment``, the ``request`` uses the function's name by
  default. For example, if the decorator is ``@serve.deployment`` and the
  function's signature is ``def manager(request)``, the HTTP request should use
  ``manager``.

Since ``@serve.deployment`` makes ``router`` a ``Deployment`` object, it can be
deployed using ``router.deploy()``:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment.py
  :language: python
  :start-after: __router_deploy_start__
  :end-before: __router_deploy_end__

Once we deploy ``router``, we can query the model over HTTP.
With that, we can run our model on Ray Serve!
Here's the full Ray Serve deployment script that we built for our model:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_deployment_full.py
  :linenos:
  :language: python
  :start-after: __deployment_full_start__
  :end-before: __deployment_full_end__

To deploy ``router``, we first start a local Ray cluster:

.. code-block:: bash

  $ ray start --head

The Ray cluster that this command launches is the same Ray cluster that the
Python code connects to using ``ray.init(address="auto", namespace="serve")``.
It is also the same Ray cluster that keeps Ray Serve (and any deployments on
it, such as ``router``) alive even after the Python script exits as long as
``detached=True`` inside ``serve.start()``.

.. tip::
  To stop the Ray cluster, run the command ``ray stop``.

After starting the Ray cluster, we can run the Python file to deploy ``router``
and begin accepting HTTP requests:

.. code-block:: bash

  $ python model_on_ray_serve.py


Testing the Ray Serve Deployment
================================

We can now test our model over HTTP. The structure of our HTTP query is:

``http://127.0.0.1:8000/[Deployment Name]?[Parameter Name-1]=[Parameter Value-1]&[Parameter Name-2]=[Parameter Value-2]&...&[Parameter Name-n]=[Parameter Value-n]``

Since the cluster is deployed locally in this tutorial, the ``127.0.0.1:8000``
refers to a localhost with port 8000. The ``[Deployment Name]`` refers to
either the name of the function that we called ``.deploy()`` on (in our case,
this is ``router``), or the ``name`` keyword parameter's value in
``@serve.deployment`` (see the Tip under the ``router`` function definition
above for more info).

Each ``[Parameter Name]`` refers to a field's name in the
request's ``query_params`` dictionary for our deployed function. In our
example, the only parameter we need to pass in is ``txt``. This parameter is
referenced in the ``txt = request.query_params["txt"]`` line in the ``router``
function. Each [Parameter Name] object has a corresponding [Parameter Value]
object. The ``txt``'s [Parameter Value] is a string containing the article
text to summarize. We can chain together any number of the name-value pairs
using the ``&`` symbol in the request URL.

Now that the ``summarize`` function is deployed on Ray Serve, we can make HTTP
requests to it. Here's a client script that requests a summary from the same
article as the original Python script:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_router_client.py
  :language: python
  :start-after: __client_function_start__
  :end-before: __client_function_end__

We can run this script while the model is deployed to get a response over HTTP:

.. code-block:: bash

  $ python router_client.py

  "two astronauts steered their fragile lunar module safely and smoothly to the
  historic landing . the first men to reach the moon -- Armstrong and his
  co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
  rest on a level, rock-strewn plain ."


Using Classes in the Ray Serve Deployment
=========================================

Our application is still a bit inefficient though. In particular, the
``summarize`` function loads the model on each call when it sets the
``summarizer`` variable. However, the model never changes, so it would be more
efficient to define ``summarizer`` only once and keep its value in memory
instead of reloading it for each HTTP query.

We can achieve this by converting our ``summarize`` function into a class:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_class_deployment.py
  :linenos:
  :language: python
  :start-after: __deployment_class_start__
  :end-before: __deployment_class_end__

In this configuration, we can query the ``Summarizer`` class directly.
The ``Summarizer`` is initialized once (after calling ``Summarizer.deploy()``).
In line 13, its ``__init__`` function loads and stores the model in
``self.summarize``. HTTP queries for the ``Summarizer`` class are routed to its
``__call__`` method by default, which takes in the Starlette ``request``
object. The ``Summarizer`` class can then take the request's ``txt`` data and
call the ``self.summarize`` function on it without loading the model on each
query.

.. tip::
  Instance variables can also store state. For example, to
  count the number of requests served, a ``@serve.deployment`` class can define
  a ``self.counter`` instance variable in its ``__init__`` function and set it
  to 0. When the class is queried, it can increment the ``self.counter``
  variable inside of the function responding to the query. The ``self.counter``
  will keep track of the number of requests served across requests.

HTTP queries for the Ray Serve class deployments follow a similar format to Ray
Serve function deployments. Here's an example client script for the
``Summarizer`` class. Notice that the only difference from the ``router``'s
client script is that the URL uses the ``Summarizer`` path instead of
``router``.

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_summarizer_client.py
  :language: python
  :start-after: __client_class_start__
  :end-before: __client_class_end__

We can deploy the class-based model on Serve without stopping the Ray cluster.
However, for the purposes of this tutorial, let's restart the cluster, deploy
the model, and query it over HTTP:

.. code-block:: bash

  $ ray stop
  $ ray start --head
  $ python summarizer_on_ray_serve.py
  $ python summarizer_client.py

  "two astronauts steered their fragile lunar module safely and smoothly to the
  historic landing . the first men to reach the moon -- Armstrong and his
  co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
  rest on a level, rock-strewn plain ."


Adding Functionality with FastAPI
=================================

Now suppose we want to expose additional functionality in our model. In
particular, the ``summarize`` function also has ``min_length`` and
``max_length`` parameters. Although we could expose these options as additional
parameters in URL, Ray Serve also allows us to add more route options to the
URL itself and handle each route separately.

Because this logic can get complex, Serve integrates with
`FastAPI <https://fastapi.tiangolo.com/>`_. This allows us to define a Serve
deployment by adding the ``@serve.ingress`` decorator to a FastAPI app. For
more info about FastAPI with Serve, please see :ref:`serve-fastapi-http`.

As an example of FastAPI, here's a modified version of our ``Summarizer`` class
with route options to request a minimum or maximum length of ten words in the
summaries:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_fastapi_deployment.py
  :linenos:
  :language: python
  :start-after: __fastapi_start__
  :end-before: __fastapi_end__

The class now exposes three routes:

- ``/Summarizer``: As before, this route takes in article text and returns
  a summary.
- ``/Summarizer/min10``: This route takes in article text and returns a
  summary with at least 10 words.
- ``/Summarizer/max10``: This route takes in article text and returns a
  summary with at most 10 words.

Notice that ``Summarizer``'s methods no longer take in a Starlette ``request``
object. Instead, they take in the URL's `txt` parameter directly with FastAPI's
`query parameter <https://fastapi.tiangolo.com/tutorial/query-params/>`_
feature.

Since we still deploy our model locally, the full URL still uses the
localhost IP. This means each of our three routes comes after the
``http://127.0.0.1:8000`` IP and port address. As an example, we can make
requests to the ``max10`` route using this client script:

.. literalinclude:: ../../../python/ray/serve/examples/doc/e2e_fastapi_client.py
  :language: python
  :start-after: __client_fastapi_start__
  :end-before: __client_fastapi_end__

.. code-block:: bash

  $ ray stop
  $ ray start --head
  $ python serve_with_fastapi.py
  $ python fastapi_client.py

  "two astronauts steered their fragile lunar"

Congratulations! You just built and deployed a machine learning model on Ray
Serve! You should now have enough context to dive into the :doc:`core-apis` to
get a deeper understanding of Ray Serve.

To learn more about how to start a multi-node cluster for your Ray Serve
deployments, see :ref:`serve-deploy-tutorial`. For more interesting example
applications, including integrations with popular machine learning frameworks
and Python web servers, be sure to check out :doc:`tutorials/index`.

.. rubric:: Footnotes

.. [#f1] The article text comes from the New York Times article "Astronauts
   Land on Plain; Collect Rocks, Plant Flag" archived
   `here <https://archive.nytimes.com/www.nytimes.com/library/national/science/nasa/072169sci-nasa.html>`_.

.. [#f2] `Starlette <https://www.starlette.io/>`_ is a web server framework
   used by Ray Serve. Its `Request <https://www.starlette.io/requests/>`_ class
   provides a nice interface for incoming HTTP requests.