mirror of
https://github.com/vale981/ray
synced 2025-03-06 18:41:40 -05:00

* Install start_ray and stop_ray scripts in setup.py. * Update documentation. * Fix docker tests. * Implement stop_ray script in python. * Fix linting.
236 lines
7.5 KiB
Markdown
236 lines
7.5 KiB
Markdown
# Using Ray and Docker on a Cluster (EXPERIMENTAL)
|
|
|
|
Packaging and deploying an application using Docker can provide certain advantages. It can make managing dependencies easier, help ensure that each cluster node receives a uniform configuration, and facilitate swapping hardware resources between applications.
|
|
|
|
|
|
## Create your Docker image
|
|
|
|
First build a Ray Docker image by following the instructions for [Installation on Docker](install-on-docker.md).
|
|
This will allow you to create the `ray-project/deploy` image that serves as a basis for using Ray on a cluster with Docker.
|
|
|
|
Docker images encapsulate the system state that will be used to run nodes in the cluster.
|
|
We recommend building on top of the Ray-provided Docker images to add your application code and dependencies.
|
|
|
|
You can do this in one of two ways: by building from a customized Dockerfile or by saving an image after entering commands manually into a running container.
|
|
We describe both approaches below.
|
|
|
|
### Creating a customized Dockerfile
|
|
|
|
We recommend that you read the official Docker documentation for [Building your own image](https://docs.docker.com/engine/getstarted/step_four/) ahead of starting this section.
|
|
Your customized Dockerfile is a script of commands needed to set up your application,
|
|
possibly packaged in a folder with related resources.
|
|
|
|
A simple template Dockerfile for a Ray application looks like this:
|
|
|
|
```
|
|
# Application Dockerfile template
|
|
FROM ray-project/deploy
|
|
RUN git clone <my-project-url>
|
|
RUN <my-project-installation-script>
|
|
```
|
|
|
|
This file instructs Docker to load the image tagged `ray-project/deploy`, check out the git
|
|
repository at `<my-project-url>`, and then run the script `<my-project-installation-script>`.
|
|
|
|
Build the image by running something like:
|
|
```
|
|
docker build -t <my-app> .
|
|
```
|
|
Replace `<app-tag>` with a tag of your choice.
|
|
|
|
|
|
### Creating a Docker image manually
|
|
|
|
Launch the `ray-project/deploy` image interactively
|
|
|
|
```
|
|
docker run -t -i ray-project/deploy
|
|
```
|
|
|
|
Next, run whatever commands are needed to install your application.
|
|
When you are finished type `exit` to stop the container.
|
|
|
|
Run
|
|
```
|
|
docker ps -a
|
|
```
|
|
to identify the id of the container you just exited.
|
|
|
|
Next, commit the container
|
|
```
|
|
docker commit -t <app-tag> <container-id>
|
|
```
|
|
|
|
Replace `<app-tag>` with a name for your container and replace `<container-id>` id with the hash id of the container used in configuration.
|
|
|
|
## Publishing your Docker image to a repository
|
|
|
|
When using Amazon EC2 it can be practical to publish images using the Repositories feature of Elastic Container Service.
|
|
Follow the steps below and see [documentation for creating a repository](http://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html) for additional context.
|
|
|
|
First ensure that the AWS command-line interface is installed.
|
|
|
|
```
|
|
sudo apt-get install -y awscli
|
|
```
|
|
|
|
Next create a repository in Amazon's Elastic Container Registry.
|
|
This results in a shared resource for storing Docker images that will be accessible from all nodes.
|
|
|
|
|
|
```
|
|
aws ecr create-repository --repository-name <repository-name> --region=<region>
|
|
```
|
|
|
|
Replace `<repository-name>` with a string describing the application.
|
|
Replace `<region>` with the AWS region string, e.g., `us-west-2`.
|
|
This should produce output like the following:
|
|
|
|
```
|
|
{
|
|
"repository": {
|
|
"repositoryUri": "123456789012.dkr.ecr.us-west-2.amazonaws.com/my-app",
|
|
"createdAt": 1487227244.0,
|
|
"repositoryArn": "arn:aws:ecr:us-west-2:123456789012:repository/my-app",
|
|
"registryId": "123456789012",
|
|
"repositoryName": "my-app"
|
|
}
|
|
}
|
|
```
|
|
|
|
Take note of the `repositoryUri` string, in this example `123456789012.dkr.ecr.us-west-2.amazonaws.com/my-app`.
|
|
|
|
|
|
Tag the Docker image with the repository URI.
|
|
|
|
```
|
|
docker tag <app-tag> <repository-uri>
|
|
```
|
|
|
|
Replace the `<app-tag>` with the container name used previously and replace `<repository-uri>` with URI returned by the command used to create the repository.
|
|
|
|
Log into the repository:
|
|
|
|
```
|
|
eval $(aws ecr get-login --region <region>)
|
|
```
|
|
|
|
Replace `<region>` with your selected AWS region.
|
|
|
|
Push the image to the repository:
|
|
```
|
|
docker push <repository-uri>
|
|
```
|
|
Replace `<repository-uri>` with the URI of your repository. Now other hosts will be able to access your application Docker image.
|
|
|
|
|
|
## Starting a cluster
|
|
|
|
We assume a cluster configuration like that described in instructions for [using Ray on a large cluster](using-ray-on-a-large-cluster.md).
|
|
In particular, we assume that there is a head node that has ssh access to all of the worker nodes, and that there is a file `workers.txt` listing the IP addresses of all worker nodes.
|
|
|
|
### Install the Docker image on all nodes
|
|
|
|
Create a script called `setup-docker.sh` on the head node.
|
|
```
|
|
# setup-docker.sh
|
|
sudo apt-get install -y docker.io
|
|
sudo service docker start
|
|
sudo usermod -a -G docker ubuntu
|
|
exec sudo su -l ubuntu
|
|
eval $(aws ecr get-login --region <region>)
|
|
docker pull <repository-uri>
|
|
```
|
|
|
|
Replace `<repository-uri>` with the URI of the repository created in the previous section.
|
|
Replace `<region>` with the AWS region in which you created that repository.
|
|
This script will install Docker, authenticate the session with the container registry, and download the container image from that registry.
|
|
|
|
Run `setup-docker.sh` on the head node (if you used the head node to build the Docker image then you can skip this step):
|
|
```
|
|
bash setup-docker.sh
|
|
```
|
|
|
|
Run `setup-docker.sh` on the worker nodes:
|
|
```
|
|
parallel-ssh -h workers.txt -P -t 0 -I < setup-docker.sh
|
|
```
|
|
|
|
### Launch Ray cluster using Docker
|
|
|
|
To start Ray on the head node run the following command:
|
|
|
|
```
|
|
eval $(aws ecr get-login --region <region>)
|
|
docker run \
|
|
-d --shm-size=<shm-size> --net=host \
|
|
<repository-uri> \
|
|
ray start --head \
|
|
--object-manager-port=8076 \
|
|
--redis-port=6379 \
|
|
--num-workers=<num-workers>
|
|
```
|
|
|
|
Replace `<repository-uri>` with the URI of the repository.
|
|
Replace `<region>` with the region of the repository.
|
|
Replace `<num-workers>` with the number of workers, e.g., typically a number similar to the number of cores in the system.
|
|
Replace `<shm-size>` with the the amount of shared memory to make available within the Docker container, e.g., `8G`.
|
|
|
|
|
|
To start Ray on the worker nodes create a script `start-worker-docker.sh` with content like the following:
|
|
```
|
|
eval $(aws ecr get-login --region <region>)
|
|
docker run -d --shm-size=<shm-size> --net=host \
|
|
<repository-uri> \
|
|
ray start \
|
|
--object-manager-port=8076 \
|
|
--redis-address=<redis-address> \
|
|
--num-workers=<num-workers>
|
|
|
|
```
|
|
|
|
Replace `<redis-address>` with the string `<head-node-private-ip>:6379` where `<head-node-private-ip>` is the private network IP address of the head node.
|
|
|
|
Execute the script on the worker nodes:
|
|
```
|
|
parallel-ssh -h workers.txt -P -t 0 -I < setup-worker-docker.sh
|
|
```
|
|
|
|
|
|
## Running jobs on a cluster
|
|
|
|
On the head node, identify the id of the container that you launched as the Ray head.
|
|
|
|
```
|
|
docker ps
|
|
```
|
|
|
|
the container id appears in the first column of the output.
|
|
|
|
Now launch an interactive shell within the container:
|
|
|
|
```
|
|
docker exec -t -i <container-id> bash
|
|
```
|
|
|
|
Replace `<container-id>` with the container id found in the previous step.
|
|
|
|
Next, launch your application program.
|
|
The Python program should contain an initialization command that takes the Redis address as a parameter:
|
|
|
|
```
|
|
ray.init(redis_address="<redis-address>")
|
|
```
|
|
|
|
|
|
## Shutting down a cluster
|
|
|
|
Kill all running Docker images on the worker nodes:
|
|
```
|
|
parallel-ssh -h workers.txt -P 'docker kill $(docker ps -q)'
|
|
```
|
|
|
|
Kill all running Docker images on the head node:
|
|
```
|
|
docker kill $(docker ps -q)
|
|
```
|