mirror of
https://github.com/vale981/ray
synced 2025-03-05 10:01:43 -05:00
Update cluster guide (#347)
* clarify cluster setup instructions * update multinode documentation, update cluster script, fix minor bug in worker.py * clarify cluster documentation and fix update_user_code
This commit is contained in:
parent
de200ff912
commit
3ee0fd8f34
4 changed files with 106 additions and 55 deletions
3
.gitignore
vendored
3
.gitignore
vendored
|
@ -57,3 +57,6 @@
|
|||
*_pb2.py
|
||||
*.pb.h
|
||||
*.pb.cc
|
||||
|
||||
# Ray cluster configuration
|
||||
scripts/nodes.txt
|
||||
|
|
|
@ -21,12 +21,18 @@ permissions for the private key file to `600` (i.e. only you can read and write
|
|||
it) so that `ssh` will work.
|
||||
- Whenever you want to use the `ec2.py` script, set the environment variables
|
||||
`AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to your Amazon EC2 access key ID
|
||||
and secret access key. These can be obtained from the [AWS
|
||||
homepage](http://aws.amazon.com/) by clicking Account > Security Credentials >
|
||||
Access Credentials.
|
||||
and secret access key. These can be generated from the [AWS
|
||||
homepage](http://aws.amazon.com/) by clicking My Account > Security Credentials >
|
||||
Access Keys, or by [creating an IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html).
|
||||
|
||||
### Launching a Cluster
|
||||
|
||||
- Install the required dependencies on the machine you will be using to run the
|
||||
cluster launch scripts.
|
||||
```
|
||||
sudo pip install --upgrade boto
|
||||
```
|
||||
|
||||
- Go into the `ray/scripts` directory.
|
||||
- Run `python ec2.py -k <keypair> -i <key-file> -s <num-slaves> launch
|
||||
<cluster-name>`, where `<keypair>` is the name of your EC2 key pair (that you
|
||||
|
@ -39,7 +45,13 @@ and `<cluster-name>` is the name to give to your cluster.
|
|||
```bash
|
||||
export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU
|
||||
export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123
|
||||
python ec2.py --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 launch my-ray-cluster
|
||||
python ec2.py --key-pair=awskey \
|
||||
--identity-file=awskey.pem \
|
||||
--region=us-west-1 \
|
||||
--instance-type=c4.4xlarge \
|
||||
--spot-price=2.50 \
|
||||
--slaves=1 \
|
||||
launch my-ray-cluster
|
||||
```
|
||||
|
||||
The following options are worth pointing out:
|
||||
|
@ -55,6 +67,9 @@ capacity in one zone, and you should try to launch in another.
|
|||
- `--spot-price=<price>` will launch the worker nodes as [Spot
|
||||
Instances](http://aws.amazon.com/ec2/spot-instances/), bidding for the given
|
||||
maximum price (in dollars).
|
||||
- `--slaves=<num-slaves>` will launch a cluster with `(1 + num_slaves)` instances.
|
||||
The first instance is the head node, which in addition to hosting workers runs the
|
||||
Ray scheduler and application driver programs.
|
||||
|
||||
## Getting started with Ray on a cluster
|
||||
|
||||
|
@ -70,24 +85,32 @@ cluster. For example
|
|||
12.34.56.789
|
||||
12.34.567.89
|
||||
The first node in the file is the "head" node. The scheduler will be started on
|
||||
the head node, and the driver should run on the head node as well.
|
||||
the head node, and the driver should run on the head node as well. If the nodes
|
||||
have public and private IP addresses (as in the case of EC2 instances), you can
|
||||
list the `<public-ip-address>, <private-ip-address>` in `nodes.txt` like
|
||||
|
||||
12.34.56.789, 98.76.54.321
|
||||
12.34.567.89, 98.76.543.21
|
||||
The `cluster.py` administrative script will use the public IP addresses to ssh
|
||||
to the nodes. Ray will use the private IP addresses to send messages between the
|
||||
nodes during execution.
|
||||
|
||||
2. Make sure that the nodes can all communicate with one another. On EC2, this
|
||||
can be done by creating a new security group and adding the inbound rule "all
|
||||
traffic" and adding the outbound rule "all traffic". Then add all of the nodes
|
||||
in your cluster to that security group.
|
||||
can be done by creating a new security group with the appropriate inbound and
|
||||
outbound rules and adding all of the nodes in your cluster to that security
|
||||
group. This is done automatically by the `ec2.py` script. If you have used the
|
||||
`ec2.py` script you can log into the hosts with the username `ubuntu`.
|
||||
|
||||
3. Run something like
|
||||
3. From the `ray/scripts` directory, run something like
|
||||
|
||||
```
|
||||
python scripts/cluster.py --nodes nodes.txt \
|
||||
--key-file key.pem \
|
||||
--username ubuntu \
|
||||
--installation-directory /home/ubuntu/
|
||||
python cluster.py --nodes=nodes.txt \
|
||||
--key-file=awskey.pem \
|
||||
--username=ubuntu
|
||||
```
|
||||
where you replace `nodes.txt`, `key.pem`, `ubuntu`, and `/home/ubuntu/` by the
|
||||
appropriate values. This assumes that you can connect to each IP address
|
||||
`<ip-address>` in `nodes.txt` with the command
|
||||
where you replace `nodes.txt`, `key.pem`, and `ubuntu` by the appropriate
|
||||
values. This assumes that you can connect to each IP address `<ip-address>` in
|
||||
`nodes.txt` with the command
|
||||
```
|
||||
ssh -i <key-file> <username>@<ip-address>
|
||||
```
|
||||
|
@ -95,23 +118,42 @@ appropriate values. This assumes that you can connect to each IP address
|
|||
cluster, run `cluster.install_ray()` in the interpreter. The interpreter should
|
||||
block until the installation has completed. The standard output from the nodes
|
||||
will be redirected to your terminal.
|
||||
5. To check that the installation succeeded, you can ssh to each node, cd into
|
||||
the directory `ray/test/`, and run the tests (e.g., `python runtest.py`).
|
||||
6. Start the cluster (the scheduler, object stores, and workers) with the
|
||||
command `cluster.start_ray("~/example_ray_code")`, where the argument is
|
||||
the local path to the directory that contains your Python code. This command will
|
||||
copy this source code to each node and will start the cluster. Each worker that
|
||||
is started will have a local copy of the ~/example_ray_code directory in their
|
||||
PYTHONPATH. After completing successfully, you can connect to the ssh to the
|
||||
head node and attach a shell to the cluster by first running the following code.
|
||||
5. To check that the installation succeeded, you can ssh to each node and run
|
||||
the tests.
|
||||
```
|
||||
cd "$RAY_HOME/../user_source_files/example_ray_code";
|
||||
source "$RAY_HOME/setup-env.sh";
|
||||
cd $HOME/ray/
|
||||
source setup-env.sh # Add Ray to your Python path.
|
||||
python test/runtest.py # This tests basic functionality.
|
||||
python test/array_test.py # This tests some array libraries.
|
||||
```
|
||||
Then within Python, run the following.
|
||||
```python
|
||||
import ray
|
||||
ray.init(scheduler_address="12.34.56.789:10001", objstore_address="12.34.56.789:20001", driver_address="12.34.56.789:30001")
|
||||
|
||||
6. Start the cluster with `cluster.start_ray()`. If you would like to deploy
|
||||
source code to it, you can pass in the local path to the directory that contains
|
||||
your Python code. For example, `cluster.start_ray("~/example_ray_code")`. This
|
||||
will copy your source code to each node on the cluster, placing it in a
|
||||
directory on the PYTHONPATH.
|
||||
|
||||
The `cluster.start_ray` command will start the Ray scheduler, object stores, and
|
||||
workers, and before finishing it will print instructions for connecting to the
|
||||
cluster via ssh.
|
||||
|
||||
7. To connect to the cluster (either with a Python shell or with a script), ssh
|
||||
to the cluster's head node (as described by the output of the
|
||||
`cluster.start_ray` command. E.g.,
|
||||
```
|
||||
The cluster has been started. You can attach to the cluster by sshing to the head node with the following command.
|
||||
|
||||
ssh -i awskey.pem ubuntu@12.34.56.789
|
||||
|
||||
Then run the following commands.
|
||||
|
||||
cd $HOME/ray
|
||||
source $HOME/ray/setup-env.sh # Add Ray to your Python path.
|
||||
|
||||
Then within a Python interpreter, run the following commands.
|
||||
|
||||
import ray
|
||||
ray.init(scheduler_address="98.76.54.321:10001", objstore_address="98.76.54.321:20001", driver_address="98.76.54.321:30001")
|
||||
```
|
||||
|
||||
7. Note that there are several more commands that can be run from within
|
||||
|
@ -126,7 +168,3 @@ Then within Python, run the following.
|
|||
processes).
|
||||
- `cluster.update_ray()` - This pulls the latest Ray source code and builds
|
||||
it.
|
||||
|
||||
Once you've started a Ray cluster using the above instructions, to actually use
|
||||
Ray, ssh to the head node (the first node listed in the `nodes.txt` file) and
|
||||
attach a shell to the already running cluster.
|
||||
|
|
|
@ -658,7 +658,7 @@ def register_module(module, worker=global_worker):
|
|||
_logger().info("registering {}.".format(val.func_name))
|
||||
worker.register_function(val)
|
||||
|
||||
def init(start_ray_local=False, num_workers=None, num_objstores=1, scheduler_address=None, objstore_address=None, driver_address=None, driver_mode=SCRIPT_MODE):
|
||||
def init(start_ray_local=False, num_workers=None, num_objstores=None, scheduler_address=None, objstore_address=None, driver_address=None, driver_mode=SCRIPT_MODE):
|
||||
"""Either connect to an existing Ray cluster or start one and connect to it.
|
||||
|
||||
This method handles two cases. Either a Ray cluster already exists and we
|
||||
|
@ -694,6 +694,7 @@ def init(start_ray_local=False, num_workers=None, num_objstores=1, scheduler_add
|
|||
if driver_mode not in [SCRIPT_MODE, PYTHON_MODE, SILENT_MODE]:
|
||||
raise Exception("If start_ray_local=True, then driver_mode must be in [SCRIPT_MODE, PYTHON_MODE, SILENT_MODE].")
|
||||
num_workers = 1 if num_workers is None else num_workers
|
||||
num_objstores = 1 if num_objstores is None else num_objstores
|
||||
# Start the scheduler, object store, and some workers. These will be killed
|
||||
# by the call to cleanup(), which happens when the Python script exits.
|
||||
scheduler_address, objstore_addresses, driver_addresses = services.start_ray_local(num_objstores=num_objstores, num_workers=num_workers, worker_path=None)
|
||||
|
|
|
@ -10,7 +10,6 @@ parser = argparse.ArgumentParser(description="Parse information about the cluste
|
|||
parser.add_argument("--nodes", type=str, required=True, help="Test file with node IP addresses, one line per address.")
|
||||
parser.add_argument("--key-file", type=str, required=True, help="Path to the file that contains the private key.")
|
||||
parser.add_argument("--username", type=str, required=True, help="User name for logging in.")
|
||||
parser.add_argument("--installation-directory", type=str, required=True, help="The directory in which to install Ray.")
|
||||
|
||||
class RayCluster(object):
|
||||
"""A class for setting up, starting, and stopping Ray on a cluster.
|
||||
|
@ -130,7 +129,7 @@ class RayCluster(object):
|
|||
""".format(self.installation_directory, self.installation_directory)
|
||||
self._run_command_over_ssh_on_all_nodes_in_parallel(install_ray_command)
|
||||
|
||||
def start_ray(self, user_source_directory, num_workers_per_node=10):
|
||||
def start_ray(self, user_source_directory=None, num_workers_per_node=10):
|
||||
"""Start Ray on a cluster.
|
||||
|
||||
This method is used to start Ray on a cluster. It will ssh to the head node,
|
||||
|
@ -139,13 +138,14 @@ class RayCluster(object):
|
|||
workers.
|
||||
|
||||
Args:
|
||||
user_source_directory (str): The path to the local directory containing the
|
||||
user's source code. Files and directories in this directory can be used
|
||||
as modules in remote functions.
|
||||
user_source_directory (Optional[str]): The path to the local directory
|
||||
containing the user's source code. If provided, files and directories in
|
||||
this directory can be used as modules in remote functions.
|
||||
num_workers_per_node (int): The number workers to start on each node.
|
||||
"""
|
||||
# First update the worker code on the nodes.
|
||||
remote_user_source_directory = self._update_user_code(user_source_directory)
|
||||
if user_source_directory is not None:
|
||||
remote_user_source_directory = self._update_user_code(user_source_directory)
|
||||
|
||||
scripts_directory = os.path.join(self.installation_directory, "ray/scripts")
|
||||
# Start the scheduler
|
||||
|
@ -160,16 +160,18 @@ class RayCluster(object):
|
|||
# Start the workers on each node
|
||||
# The triple backslashes are used for two rounds of escaping, something like \\\" -> \" -> "
|
||||
start_workers_commands = []
|
||||
remote_user_source_directory_str = "\\\"{}\\\"".format(remote_user_source_directory) if user_source_directory is not None else "None"
|
||||
for i, node_ip_address in enumerate(self.node_ip_addresses):
|
||||
start_workers_command = """
|
||||
cd "{}";
|
||||
source ../setup-env.sh;
|
||||
python -c "import ray; ray.services.start_node(\\\"{}:10001\\\", \\\"{}\\\", {}, user_source_directory=\\\"{}\\\")" > start_workers.out 2> start_workers.err < /dev/null &
|
||||
""".format(scripts_directory, self.node_private_ip_addresses[0], self.node_private_ip_addresses[i], num_workers_per_node, remote_user_source_directory)
|
||||
""".format(scripts_directory, self.node_private_ip_addresses[0], self.node_private_ip_addresses[i], num_workers_per_node, remote_user_source_directory_str)
|
||||
start_workers_commands.append(start_workers_command)
|
||||
self._run_command_over_ssh_on_all_nodes_in_parallel(start_workers_commands)
|
||||
|
||||
setup_env_path = os.path.join(self.installation_directory, "ray/setup-env.sh")
|
||||
cd_location = remote_user_source_directory if user_source_directory is not None else os.path.join(self.installation_directory, "ray")
|
||||
print """
|
||||
The cluster has been started. You can attach to the cluster by sshing to the head node with the following command.
|
||||
|
||||
|
@ -178,13 +180,13 @@ class RayCluster(object):
|
|||
Then run the following commands.
|
||||
|
||||
cd {}
|
||||
source {}
|
||||
source {} # Add Ray to your Python path.
|
||||
|
||||
Then within a Python interpreter, run the following commands.
|
||||
Then within a Python interpreter or script, run the following commands.
|
||||
|
||||
import ray
|
||||
ray.init(scheduler_address="{}:10001", objstore_address="{}:20001", driver_address="{}:30001")
|
||||
""".format(self.key_file, self.username, self.node_ip_addresses[0], remote_user_source_directory, setup_env_path, self.node_private_ip_addresses[0], self.node_private_ip_addresses[0], self.node_private_ip_addresses[0])
|
||||
""".format(self.key_file, self.username, self.node_ip_addresses[0], cd_location, setup_env_path, self.node_private_ip_addresses[0], self.node_private_ip_addresses[0], self.node_private_ip_addresses[0])
|
||||
|
||||
def stop_ray(self):
|
||||
"""Kill all of the processes in the Ray cluster.
|
||||
|
@ -195,33 +197,39 @@ class RayCluster(object):
|
|||
kill_cluster_command = "killall scheduler objstore python > /dev/null 2> /dev/null"
|
||||
self._run_command_over_ssh_on_all_nodes_in_parallel(kill_cluster_command)
|
||||
|
||||
def update_ray(self):
|
||||
def update_ray(self, branch=None):
|
||||
"""Pull the latest Ray source code and rebuild Ray.
|
||||
|
||||
This method is used for updating the Ray source code on a Ray cluster. It
|
||||
will ssh to each node, will pull the latest source code from the Ray
|
||||
repository, and will rerun the build script (though currently it will not
|
||||
rebuild the third party libraries).
|
||||
|
||||
Args:
|
||||
branch (Optional[str]): The branch to check out. If omitted, then stay on
|
||||
the current branch.
|
||||
"""
|
||||
ray_directory = os.path.join(self.installation_directory, "ray")
|
||||
change_branch_command = "git checkout -f {}".format(branch) if branch is not None else ""
|
||||
update_cluster_command = """
|
||||
cd "{}" &&
|
||||
{}
|
||||
git fetch &&
|
||||
git reset --hard "@{{upstream}}" -- &&
|
||||
(make -C "./build" clean || rm -rf "./build") &&
|
||||
./build.sh
|
||||
""".format(ray_directory)
|
||||
""".format(ray_directory, change_branch_command)
|
||||
self._run_command_over_ssh_on_all_nodes_in_parallel(update_cluster_command)
|
||||
|
||||
def _update_user_code(self, user_source_directory):
|
||||
"""Update the user's source code on each node in the cluster.
|
||||
|
||||
This method is used to update the user's source code on each node in the
|
||||
cluster. The local user_source_directory will be copied under ray_source_files in
|
||||
the ray installation directory on the worker node. For example, if the ray
|
||||
installation directory is "/d/e/f" and we call _update_source_code("~/a/b/c"),
|
||||
then the contents of "~/a/b/c" on the local machine will be copied to
|
||||
"/d/e/f/user_source_files/c" on each node in the cluster.
|
||||
cluster. The local user_source_directory will be copied under
|
||||
ray_source_files in the home directory on the worker node. For example, if
|
||||
we call _update_source_code("~/a/b/c"), then the contents of "~/a/b/c" on
|
||||
the local machine will be copied to "~/user_source_files/c" on each
|
||||
node in the cluster.
|
||||
|
||||
Args:
|
||||
user_source_directory (str): The path on the local machine to the directory
|
||||
|
@ -236,7 +244,7 @@ class RayCluster(object):
|
|||
raise Exception("Directory {} does not exist.".format(user_source_directory))
|
||||
# If user_source_directory is "/a/b/c", then local_directory_name is "c".
|
||||
local_directory_name = os.path.split(os.path.realpath(user_source_directory))[1]
|
||||
remote_directory = os.path.join(self.installation_directory, "user_source_files", local_directory_name)
|
||||
remote_directory = os.path.join("user_source_files", local_directory_name)
|
||||
# Remove and recreate the directory on the node.
|
||||
recreate_directory_command = """
|
||||
rm -r "{}";
|
||||
|
@ -290,7 +298,8 @@ if __name__ == "__main__":
|
|||
args = parser.parse_args()
|
||||
username = args.username
|
||||
key_file = args.key_file
|
||||
installation_directory = args.installation_directory
|
||||
# Install Ray in the user's home directory on the cluster.
|
||||
installation_directory = "$HOME"
|
||||
node_ip_addresses = []
|
||||
node_private_ip_addresses = []
|
||||
for line in open(args.nodes).readlines():
|
||||
|
|
Loading…
Add table
Reference in a new issue