ray/doc/source/serve/managing-deployments.md

(serve-managing-deployments-guide)=

# Managing Deployments

This section should help you:

- create, query, update and configure deployments
- configure resources of your deployments
- specify different Python dependencies across different deployment using Runtime Environments

:::{tip}
Get in touch with us if you're using or considering using [Ray Serve](https://docs.google.com/forms/d/1l8HT35jXMPtxVUtQPeGoe09VGp5jcvSv0TqPgyz6lGU).
:::

```{contents}
```

## Updating a Deployment

Often you want to be able to update your code or configuration options for a deployment over time.
Deployments can be updated simply by updating the code or configuration options and calling `deploy()` again.

```python
@serve.deployment(name="my_deployment", num_replicas=1)
class SimpleDeployment:
    pass

# Creates one initial replica.
SimpleDeployment.deploy()

# Re-deploys, creating an additional replica.
# This could be the SAME Python script, modified and re-run.
@serve.deployment(name="my_deployment", num_replicas=2)
class SimpleDeployment:
    pass

SimpleDeployment.deploy()

# You can also use Deployment.options() to change options without redefining
# the class. This is useful for programmatically updating deployments.
SimpleDeployment.options(num_replicas=2).deploy()
```

By default, each call to `.deploy()` will cause a redeployment, even if the underlying code and options didn't change.
This could be detrimental if you have many deployments in a script and and only want to update one: if you re-run the script, all of the deployments will be redeployed, not just the one you updated.
To prevent this, you may provide a `version` string for the deployment as a keyword argument in the decorator or `Deployment.options()`.
If provided, the replicas will only be updated if the value of `version` is updated; if the value of `version` is unchanged, the call to `.deploy()` will be a no-op.
When a redeployment happens, Serve will perform a rolling update, bringing down at most 20% of the replicas at any given time.

(configuring-a-deployment)=

## Configuring a Deployment

There are a number of things you'll likely want to do with your serving application including
scaling out or configuring the maximum number of in-flight requests for a deployment.
All of these options can be specified either in {mod}`@serve.deployment <ray.serve.api.deployment>` or in `Deployment.options()`.

To update the config options for a running deployment, simply redeploy it with the new options set.

(scaling-out-a-deployment)=

### Scaling Out

To scale out a deployment to many processes, simply configure the number of replicas.

```python
# Create with a single replica.
@serve.deployment(num_replicas=1)
def func(*args):
    pass

func.deploy()

# Scale up to 10 replicas.
func.options(num_replicas=10).deploy()

# Scale back down to 1 replica.
func.options(num_replicas=1).deploy()
```

(ray-serve-autoscaling)=

#### Autoscaling

Serve also has the support for a demand-based replica autoscaler.
It reacts to traffic spikes via observing queue sizes and making scaling decisions.
To configure it, you can set the `autoscaling` field in deployment options.


```python
@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "max_replicas": 5,
        "target_num_ongoing_requests_per_replica": 10,
    })
def func(_):
    time.sleep(1)
    return ""

func.deploy() # The func deployment will now autoscale based on requests demand.
```

The `min_replicas` and `max_replicas` fields configure the range of replicas which the
Serve autoscaler chooses from.  Deployments will start with `min_replicas` initially.

The `target_num_ongoing_requests_per_replica` configuration specifies how aggressively the
autoscaler should react to traffic. Serve will try to make sure that each replica has roughly that number
of requests being processed and waiting in the queue. For example, if your processing time is `10ms`
and the latency constraint is `100ms`, you can have at most `10` requests ongoing per replica so
the last requests can finish within the latency constraint. We recommend you benchmark your application
code and set this number based on end to end latency objective.


:::{note}
The Ray Serve Autoscaler is an application-level autoscaler that sits on top of the [Ray Autoscaler](cluster-index).
Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand.
If the Ray Autoscaler determines there aren't enough available CPUs to place these actors, it responds by requesting more nodes.
The underlying cloud provider will then respond by adding more nodes.
Similarly, when Ray Serve scales down and terminates some replica actors, it may result in some nodes being empty, at which point the Ray autoscaler will remove those nodes.
:::

(serve-cpus-gpus)=

### Resource Management (CPUs, GPUs)

To assign hardware resources per replica, you can pass resource requirements to
`ray_actor_options`.
By default, each replica requires one CPU.
To learn about options to pass in, take a look at [Resources with Actor](actor-resource-guide) guide.

For example, to create a deployment where each replica uses a single GPU, you can do the
following:

```python
@serve.deployment(ray_actor_options={"num_gpus": 1})
def func(*args):
    return do_something_with_my_gpu()
```

(serve-fractional-resources-guide)=

### Fractional Resources

The resources specified in `ray_actor_options` can also be *fractional*.
This allows you to flexibly share resources between replicas.
For example, if you have two models and each doesn't fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each.
The same could be done to multiplex over CPUs.

```python
@serve.deployment(name="deployment1", ray_actor_options={"num_gpus": 0.5})
def func(*args):
    return do_something_with_my_gpu()

@serve.deployment(name="deployment2", ray_actor_options={"num_gpus": 0.5})
def func(*args):
    return do_something_with_my_gpu()
```

### Configuring Parallelism with OMP_NUM_THREADS

Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference.
The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable.
To [avoid contention](omp-num-thread-note), Ray sets `OMP_NUM_THREADS=1` by default because Ray workers and actors use a single CPU by default.
If you *do* want to enable this parallelism in your Serve deployment, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:

```bash
OMP_NUM_THREADS=12 ray start --head
OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS
```

```python
@serve.deployment
class MyDeployment:
    def __init__(self, parallelism):
        os.environ["OMP_NUM_THREADS"] = parallelism
        # Download model weights, initialize model, etc.

MyDeployment.deploy()
```

:::{note}
Some other libraries may not respect `OMP_NUM_THREADS` and have their own way to configure parallelism.
For example, if you're using OpenCV, you'll need to manually set the number of threads using `cv2.setNumThreads(num_threads)` (set to 0 to disable multi-threading).
You can check the configuration using `cv2.getNumThreads()` and `cv2.getNumberOfCPUs()`.
:::

(managing-deployments-user-configuration)=

### User Configuration (Experimental)

Suppose you want to update a parameter in your model without needing to restart
the replicas in your deployment.  You can do this by writing a `reconfigure` method
for the class underlying your deployment.  At runtime, you can then pass in your
new parameters by setting the `user_config` option.

The following simple example will make the usage clear:

```{literalinclude} ../../../python/ray/serve/examples/doc/snippet_reconfigure.py
```

The `reconfigure` method is called when the class is created if `user_config`
is set.  In particular, it's also called when new replicas are created in the
future if scale up your deployment later.  The `reconfigure` method is also  called
each time `user_config` is updated.

:::{note}
The `user_config` and its contents must be JSON-serializable.
:::
[Serve] [Docs] Update "Getting Started" documentation (#26745) 2022-08-01 16:31:48 -07:00			`(serve-managing-deployments-guide)=`

[Serve][Doc] Split core-apis to key concepts and user guide (#24713) 2022-05-20 10:56:34 -07:00			`# Managing Deployments`
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00
			`This section should help you:`

			`- create, query, update and configure deployments`
			`- configure resources of your deployments`
			`- specify different Python dependencies across different deployment using Runtime Environments`

			`:::{tip}`
			`Get in touch with us if you're using or considering using [Ray Serve](https://docs.google.com/forms/d/1l8HT35jXMPtxVUtQPeGoe09VGp5jcvSv0TqPgyz6lGU).`
			`:::`

			```{contents}
			```

			`## Updating a Deployment`

			`Often you want to be able to update your code or configuration options for a deployment over time.`
			Deployments can be updated simply by updating the code or configuration options and calling `deploy()` again.

			```python
			`@serve.deployment(name="my_deployment", num_replicas=1)`
			`class SimpleDeployment:`
			`pass`

			`# Creates one initial replica.`
			`SimpleDeployment.deploy()`

			`# Re-deploys, creating an additional replica.`
			`# This could be the SAME Python script, modified and re-run.`
			`@serve.deployment(name="my_deployment", num_replicas=2)`
			`class SimpleDeployment:`
			`pass`

			`SimpleDeployment.deploy()`

			`# You can also use Deployment.options() to change options without redefining`
			`# the class. This is useful for programmatically updating deployments.`
			`SimpleDeployment.options(num_replicas=2).deploy()`
			```

			By default, each call to `.deploy()` will cause a redeployment, even if the underlying code and options didn't change.
			`This could be detrimental if you have many deployments in a script and and only want to update one: if you re-run the script, all of the deployments will be redeployed, not just the one you updated.`
			To prevent this, you may provide a `version` string for the deployment as a keyword argument in the decorator or `Deployment.options()`.
			If provided, the replicas will only be updated if the value of `version` is updated; if the value of `version` is unchanged, the call to `.deploy()` will be a no-op.
			`When a redeployment happens, Serve will perform a rolling update, bringing down at most 20% of the replicas at any given time.`

			`(configuring-a-deployment)=`

			`## Configuring a Deployment`

			`There are a number of things you'll likely want to do with your serving application including`
			`scaling out or configuring the maximum number of in-flight requests for a deployment.`
			All of these options can be specified either in {mod}`@serve.deployment <ray.serve.api.deployment>` or in `Deployment.options()`.

			`To update the config options for a running deployment, simply redeploy it with the new options set.`

[Serve] [Docs] Update "Getting Started" documentation (#26745) 2022-08-01 16:31:48 -07:00			`(scaling-out-a-deployment)=`

[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`### Scaling Out`

			`To scale out a deployment to many processes, simply configure the number of replicas.`

			```python
			`# Create with a single replica.`
			`@serve.deployment(num_replicas=1)`
			`def func(*args):`
			`pass`

			`func.deploy()`

			`# Scale up to 10 replicas.`
			`func.options(num_replicas=10).deploy()`

			`# Scale back down to 1 replica.`
			`func.options(num_replicas=1).deploy()`
			```

[Serve] [Docs] Update "Getting Started" documentation (#26745) 2022-08-01 16:31:48 -07:00			`(ray-serve-autoscaling)=`

[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`#### Autoscaling`

[Serve] Promote autoscaling feature (#26393) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows 2022-07-13 12:38:38 -07:00			`Serve also has the support for a demand-based replica autoscaler.`
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`It reacts to traffic spikes via observing queue sizes and making scaling decisions.`
[Serve] Promote autoscaling feature (#26393) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows 2022-07-13 12:38:38 -07:00			To configure it, you can set the `autoscaling` field in deployment options.
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00

			```python
			`@serve.deployment(`
[Serve] Promote autoscaling feature (#26393) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows 2022-07-13 12:38:38 -07:00			`autoscaling_config={`
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`"min_replicas": 1,`
			`"max_replicas": 5,`
			`"target_num_ongoing_requests_per_replica": 10,`
[Serve][Doc] Autoscaling (#25646) - new section of doc for autoscaling (introduction of serve autoscaling and config parameter) - Remove the version requirement note inside the doc Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> 2022-06-22 13:32:18 -07:00			`})`
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`def func(_):`
			`time.sleep(1)`
			`return ""`

			`func.deploy() # The func deployment will now autoscale based on requests demand.`
			```

			The `min_replicas` and `max_replicas` fields configure the range of replicas which the
			Serve autoscaler chooses from. Deployments will start with `min_replicas` initially.

			The `target_num_ongoing_requests_per_replica` configuration specifies how aggressively the
			`autoscaler should react to traffic. Serve will try to make sure that each replica has roughly that number`
			of requests being processed and waiting in the queue. For example, if your processing time is `10ms`
			and the latency constraint is `100ms`, you can have at most `10` requests ongoing per replica so
			`the last requests can finish within the latency constraint. We recommend you benchmark your application`
			`code and set this number based on end to end latency objective.`


			`:::{note}`
			`The Ray Serve Autoscaler is an application-level autoscaler that sits on top of the [Ray Autoscaler](cluster-index).`
			`Concretely, this means that the Ray Serve autoscaler asks Ray to start a number of replica actors based on the request demand.`
[Doc] Update Serve architecture doc for 2.0 (#26861) - Move autoscaling architecture from autoscaling page to architecture page - Update architecture page - Remove "Router" actor - Update description of ServeHandle - Update defaults about HTTPproxy (default one on each node -> default just one per cluster, on the head node) - Add note about fault tolerance in different failure scenarios - Assorted typos/usage nits Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Co-authored-by: Simon Mo <simon.mo@hey.com> 2022-08-03 12:30:33 -07:00			`If the Ray Autoscaler determines there aren't enough available CPUs to place these actors, it responds by requesting more nodes.`
			`The underlying cloud provider will then respond by adding more nodes.`
[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`Similarly, when Ray Serve scales down and terminates some replica actors, it may result in some nodes being empty, at which point the Ray autoscaler will remove those nodes.`
			`:::`

			`(serve-cpus-gpus)=`

			`### Resource Management (CPUs, GPUs)`

			`To assign hardware resources per replica, you can pass resource requirements to`
			`ray_actor_options`.
			`By default, each replica requires one CPU.`
			`To learn about options to pass in, take a look at [Resources with Actor](actor-resource-guide) guide.`

			`For example, to create a deployment where each replica uses a single GPU, you can do the`
			`following:`

			```python
			`@serve.deployment(ray_actor_options={"num_gpus": 1})`
			`def func(*args):`
			`return do_something_with_my_gpu()`
			```

[Serve] [Docs] Update "Getting Started" documentation (#26745) 2022-08-01 16:31:48 -07:00			`(serve-fractional-resources-guide)=`

[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`### Fractional Resources`

			The resources specified in `ray_actor_options` can also be fractional.
			`This allows you to flexibly share resources between replicas.`
			`For example, if you have two models and each doesn't fully saturate a GPU, you might want to have them share a GPU by allocating 0.5 GPUs each.`
			`The same could be done to multiplex over CPUs.`

			```python
			`@serve.deployment(name="deployment1", ray_actor_options={"num_gpus": 0.5})`
			`def func(*args):`
			`return do_something_with_my_gpu()`

			`@serve.deployment(name="deployment2", ray_actor_options={"num_gpus": 0.5})`
			`def func(*args):`
			`return do_something_with_my_gpu()`
			```

			`### Configuring Parallelism with OMP_NUM_THREADS`

			`Deep learning models like PyTorch and Tensorflow often use multithreading when performing inference.`
			`The number of CPUs they use is controlled by the OMP_NUM_THREADS environment variable.`
			To [avoid contention](omp-num-thread-note), Ray sets `OMP_NUM_THREADS=1` by default because Ray workers and actors use a single CPU by default.
			`If you do want to enable this parallelism in your Serve deployment, just set OMP_NUM_THREADS to the desired value either when starting Ray or in your function/class definition:`

			```bash
			`OMP_NUM_THREADS=12 ray start --head`
			`OMP_NUM_THREADS=12 ray start --address=$HEAD_NODE_ADDRESS`
			```

			```python
			`@serve.deployment`
			`class MyDeployment:`
			`def __init__(self, parallelism):`
			`os.environ["OMP_NUM_THREADS"] = parallelism`
			`# Download model weights, initialize model, etc.`

			`MyDeployment.deploy()`
			```

			`:::{note}`
			Some other libraries may not respect `OMP_NUM_THREADS` and have their own way to configure parallelism.
			For example, if you're using OpenCV, you'll need to manually set the number of threads using `cv2.setNumThreads(num_threads)` (set to 0 to disable multi-threading).
			You can check the configuration using `cv2.getNumThreads()` and `cv2.getNumberOfCPUs()`.
			`:::`

[Serve] [Docs] Create end-to-end documentation example for Serve REST API and CLI (#25936) 2022-06-24 14:44:39 -07:00			`(managing-deployments-user-configuration)=`

[Serve][Doc] Convert Serve doc sources from `rst` to `myst` (#24657) 2022-05-10 14:04:17 -07:00			`### User Configuration (Experimental)`

			`Suppose you want to update a parameter in your model without needing to restart`
			the replicas in your deployment. You can do this by writing a `reconfigure` method
			`for the class underlying your deployment. At runtime, you can then pass in your`
			new parameters by setting the `user_config` option.

			`The following simple example will make the usage clear:`

			```{literalinclude} ../../../python/ray/serve/examples/doc/snippet_reconfigure.py
			```

			The `reconfigure` method is called when the class is created if `user_config`
			`is set. In particular, it's also called when new replicas are created in the`
			future if scale up your deployment later. The `reconfigure` method is also called
			each time `user_config` is updated.

[Serve] Serialize `user_config` with JSON instead of Pickle (#26235) 2022-08-01 17:53:43 -07:00			`:::{note}`
			The `user_config` and its contents must be JSON-serializable.
			`:::`