ray/doc/source/serve/getting_started.md

15 KiB

(getting-started)=

Getting Started

This tutorial will walk you through the process of using Ray Serve to deploy a single model behind HTTP locally.

We'll be using HuggingFace's SummarizationPipeline to deploy a model that summarizes text.

:::{tip} If you have suggestions on how to improve this tutorial, please let us know! :::

To run this example, you will need to install the following:

$ pip install "ray[serve]" transformers

Example Model

Let's first take a look at how the model works without using Ray Serve. This is the code for the model:

:end-before: __local_model_end__
:language: python
:linenos: true
:start-after: __local_model_start__

The Python file, called local_model.py uses the summarize function to generate summaries of text.

  • The summarizer variable on line 7 inside summarize points to a function that uses the t5-small model to summarize text.
  • When summarizer is called on a Python String, it returns summarized text inside a dictionary formatted as [{"summary_text": "...", ...}, ...].
  • summarize then extracts the summarized text on line 13 by indexing into the dictionary.

The file can be run locally by executing the Python script, which uses the model to summarize an article about the Apollo 11 moon landing 1.

$ python local_model.py

"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."

Keep in mind that the SummarizationPipeline is an example machine learning model for this tutorial. You can follow along using arbitrary models in any framework that has a Python API. Check out our tutorials on sckit-learn, PyTorch, and Tensorflow for more info and examples:

  • {ref}serve-sklearn-tutorial
  • {ref}serve-pytorch-tutorial
  • {ref}serve-tensorflow-tutorial

(converting-to-ray-serve-deployment)=

Converting to a Ray Serve Deployment

This tutorial's goal is to deploy this model using Ray Serve, so it can be scaled up and queried over HTTP. We'll start by converting the above Python function into a Ray Serve deployment that can be launched locally on a laptop.

We start by opening a new Python file. First, we need to import ray and ray serve, to use features in Ray Serve such as deployments, which provide HTTP access to our model.

:end-before: __import_end__
:language: python
:start-after: __import_start__

After these imports, we can include our model code from above. We won't call our summarize function just yet though! We will soon add logic to handle HTTP requests, so the summarize function can operate on article text sent via HTTP request.

:end-before: __local_model_end__
:language: python
:start-after: __local_model_start__

Ray Serve needs to run on top of a Ray cluster, so we connect to a local one. See {ref}serve-deploy-tutorial to learn more about starting a Ray Serve instance and deploying to a Ray cluster.

:end-before: __start_ray_cluster_end__
:language: python
:start-after: __start_ray_cluster_start__

The address parameter in ray.init() connects your Serve script to a running local Ray cluster. Later, we'll discuss how to start a local Ray cluster.

:::{note} ray.init() connects to or starts a single-node Ray cluster on your local machine, which allows you to use all your CPU cores to serve requests in parallel. To start a multi-node cluster, see {ref}serve-deploy-tutorial. :::

Next, we start the Ray Serve runtime:

:end-before: __start_serve_end__
:language: python
:start-after: __start_serve_start__

:::{note} detached=True means Ray Serve will continue running even when the Python script exits. If you would rather stop Ray Serve after the script exits, use serve.start() instead (see {ref}ray-serve-instance-lifetime for details). :::

Now that we have defined our summarize function, connected to a Ray Cluster, and started the Ray Serve runtime, we can define a function that accepts HTTP requests and routes them to the summarize function. We define a function called router that takes in a Starlette request object 2:

:end-before: __router_end__
:language: python
:linenos: true
:start-after: __router_start__
  • In line 1, we add the decorator @serve.deployment to the router function to turn the function into a Serve Deployment object.
  • In line 3, router uses the "txt" query parameter in the request to get the article text to summarize.
  • In line 4, it then passes this article text into the summarize function and returns the value.

:::{note} Lines 3 and 4 define our HTTP request schema. The HTTP requests sent to this endpoint must have a "txt" query parameter that contains a string. In general, you can accept HTTP data using query parameters or the request body. Additionally, you can add other Serve deployments with different names to create more endpoints that can accept different schemas. For more complex validation, you can also use FastAPI (see {ref}serve-fastapi-http for more info). :::

:::{tip} This routing function's name doesn't have to be router. It can be any function name as long as the corresponding name is present in the HTTP request. If you want the function name to be different than the name in the HTTP request, you can add the name keyword parameter to the @serve.deployment decorator to specify the name sent in the HTTP request.

For example, if the decorator is @serve.deployment(name="responder") and the function signature is def request_manager(request), the HTTP request should use responder, not request_manager. If no name is passed into @serve.deployment, the request uses the function's name by default. For example, if the decorator is @serve.deployment and the function's signature is def manager(request), the HTTP request should use manager. :::

Since @serve.deployment makes router a Deployment object, it can be deployed using router.deploy():

:end-before: __router_deploy_end__
:language: python
:start-after: __router_deploy_start__

Once we deploy router, we can query the model over HTTP. With that, we can run our model on Ray Serve! Here's the full Ray Serve deployment script that we built for our model:

:end-before: __deployment_full_end__
:language: python
:linenos: true
:start-after: __deployment_full_start__

To deploy router, we first start a local Ray cluster:

$ ray start --head

The Ray cluster that this command launches is the same Ray cluster that the Python code connects to using ray.init(address="auto"). It is also the same Ray cluster that keeps Ray Serve (and any deployments on it, such as router) alive even after the Python script exits as long as detached=True inside serve.start().

:::{tip} To stop the Ray cluster, run the command ray stop. :::

After starting the Ray cluster, we can run the Python file to deploy router and begin accepting HTTP requests:

$ python model_on_ray_serve.py

Testing the Ray Serve Deployment

We can now test our model over HTTP. The structure of our HTTP query is:

http://127.0.0.1:8000/[Deployment Name]?[Parameter Name-1]=[Parameter Value-1]&[Parameter Name-2]=[Parameter Value-2]&...&[Parameter Name-n]=[Parameter Value-n]

Since the cluster is deployed locally in this tutorial, the 127.0.0.1:8000 refers to a localhost with port 8000. The [Deployment Name] refers to either the name of the function that we called .deploy() on (in our case, this is router), or the name keyword parameter's value in @serve.deployment (see the Tip under the router function definition above for more info).

Each [Parameter Name] refers to a field's name in the request's query_params dictionary for our deployed function. In our example, the only parameter we need to pass in is txt. This parameter is referenced in the txt = request.query_params["txt"] line in the router function. Each [Parameter Name] object has a corresponding [Parameter Value] object. The txt's [Parameter Value] is a string containing the article text to summarize. We can chain together any number of the name-value pairs using the & symbol in the request URL.

Now that the summarize function is deployed on Ray Serve, we can make HTTP requests to it. Here's a client script that requests a summary from the same article as the original Python script:

:end-before: __client_function_end__
:language: python
:start-after: __client_function_start__

We can run this script while the model is deployed to get a response over HTTP:

$ python router_client.py

"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."

Using Classes in the Ray Serve Deployment

Our application is still a bit inefficient though. In particular, the summarize function loads the model on each call when it sets the summarizer variable. However, the model never changes, so it would be more efficient to define summarizer only once and keep its value in memory instead of reloading it for each HTTP query.

We can achieve this by converting our summarize function into a class:

:end-before: __deployment_class_end__
:language: python
:linenos: true
:start-after: __deployment_class_start__

In this configuration, we can query the Summarizer class directly. The Summarizer is initialized once (after calling Summarizer.deploy()). In line 13, its __init__ function loads and stores the model in self.summarize. HTTP queries for the Summarizer class are routed to its __call__ method by default, which takes in the Starlette request object. The Summarizer class can then take the request's txt data and call the self.summarize function on it without loading the model on each query.

:::{tip} Instance variables can also store state. For example, to count the number of requests served, a @serve.deployment class can define a self.counter instance variable in its __init__ function and set it to 0. When the class is queried, it can increment the self.counter variable inside of the function responding to the query. The self.counter will keep track of the number of requests served across requests. :::

HTTP queries for the Ray Serve class deployments follow a similar format to Ray Serve function deployments. Here's an example client script for the Summarizer class. Notice that the only difference from the router's client script is that the URL uses the Summarizer path instead of router.

:end-before: __client_class_end__
:language: python
:start-after: __client_class_start__

We can deploy the class-based model on Serve without stopping the Ray cluster. However, for the purposes of this tutorial, let's restart the cluster, deploy the model, and query it over HTTP:

$ ray stop
$ ray start --head
$ python summarizer_on_ray_serve.py
$ python summarizer_client.py

"two astronauts steered their fragile lunar module safely and smoothly to the
historic landing . the first men to reach the moon -- Armstrong and his
co-pilot, col. Edwin E. Aldrin Jr. of the air force -- brought their ship to
rest on a level, rock-strewn plain ."

Advanced HTTP Functionality with FastAPI

Now suppose we want to expose additional functionality in our model. In particular, the summarize function also has min_length and max_length parameters. Although we could expose these options as additional parameters in URL, Ray Serve also allows us to add more route options to the URL itself and handle each route separately.

Because this logic can get complex, Serve integrates with FastAPI. This allows us to define a Serve deployment by adding the @serve.ingress decorator to a FastAPI app. For more info about FastAPI with Serve, please see {ref}serve-fastapi-http.

As an example of FastAPI, here's a modified version of our Summarizer class with route options to request a minimum or maximum length of ten words in the summaries:

:end-before: __fastapi_end__
:language: python
:linenos: true
:start-after: __fastapi_start__

The class now exposes three routes:

  • /Summarizer: As before, this route takes in article text and returns a summary.
  • /Summarizer/min10: This route takes in article text and returns a summary with at least 10 words.
  • /Summarizer/max10: This route takes in article text and returns a summary with at most 10 words.

Notice that Summarizer's methods no longer take in a Starlette request object. Instead, they take in the URL's txt parameter directly with FastAPI's query parameter feature.

Since we still deploy our model locally, the full URL still uses the localhost IP. This means each of our three routes comes after the http://127.0.0.1:8000 IP and port address. As an example, we can make requests to the max10 route using this client script:

:end-before: __client_fastapi_end__
:language: python
:start-after: __client_fastapi_start__
$ ray stop
$ ray start --head
$ python serve_with_fastapi.py
$ python fastapi_client.py

"two astronauts steered their fragile lunar"

Congratulations! You just built and deployed a machine learning model on Ray Serve! You should now have enough context to dive into the {doc}key-concepts to get a deeper understanding of Ray Serve.

Next Steps

  • Dive into the {doc}key-concepts to get a deeper understanding of Ray Serve.
  • Learn more about how to deploy your Ray Serve application to a multi-node cluster: {ref}serve-deploy-tutorial.
  • Check more in-depth tutorials for popular machine learning frameworks: {doc}tutorials/index.

  1. The article text comes from the New York Times article "Astronauts Land on Plain; Collect Rocks, Plant Flag" archived here. ↩︎

  2. Starlette is a web server framework used by Ray Serve. Its Request class provides a nice interface for incoming HTTP requests. ↩︎