Cluster setup instructions (#233)

* start updating cluster documentation with parallel ssh * add using ray on a large cluster * revert changes to using ray on a cluster * update cluster documentation * update title * Some formatting changes, and added some notes. * clarification * Add warning about public versus private IP addresses. * Typos and wording. * Clarifications. * Clarifications.
2025-03-06 10:31:39 -05:00 · 2017-02-02 16:10:26 -08:00 · 2017-02-02 16:10:26 -08:00 · e5a9fc0032
commit e5a9fc0032
parent 7a7e14ef85
1 changed files with 182 additions and 0 deletions
--- a/doc/using-ray-on-a-large-cluster.md
+++ b/doc/using-ray-on-a-large-cluster.md
@ -0,0 +1,182 @@
 # Using Ray on a large cluster
 Deploying Ray on a cluster currently requires a bit of manual work. The
 instructions here illustrate how to use parallel ssh commands to simplify the
 process of running commands and scripts on many machines simultaneously.
 ## Booting up a cluster on EC2
 * Create an EC2 instance running Ray following instructions for [installation on
 Ubuntu](install-on-ubuntu.md).
    * Add any packages that you may need for running your application.
    * Install the pssh package: `sudo apt-get install pssh`
 * [Create an AMI Image](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html)
 of your installation.
 * Use the EC2 console to launch additional instances using the AMI created.
 ## Deploying Ray on a cluster.
 This section assumes that you have a cluster of machines running and that these
 nodes have network connectivity to one another. It also assumes that Ray is
 installed on each machine.
 Additional assumptions:
 * All of the following commands are run from a machine designated as
  the _head node_.
 * The head node will run Redis and the global scheduler.
 * The head node is the launching point for driver programs and for
  administrative tasks.
 * The head node has ssh access to all other nodes.
 * All nodes are accessible via ssh keys
 * Ray is checked out on each node at the location `$HOME/ray`.
 **Note:** The commands below will probably need to be customized for your specific
 setup.
 ### Connect to the head node
 In order to initiate ssh commands from the cluster head node we suggest enabling
 ssh agent forwarding. This will allow the session that you initiate with the
 head node to connect to other nodes in the cluster to run scripts on them. You
 can enable ssh forwarding by running the following command (replacing
 `<ssh-key>` with the path to the private key that you would use when logging in
 to the nodes in the cluster).
 ```
 ssh-add <ssh-key>
 ```
 Now log in to the head node with the following command, where
 `<head-node-public-ip>` is the public IP address of the head node (just choose
 one of the nodes to be the head node).
 ```
 ssh -A ubuntu@<head-node-public-ip>
 ```
 ### Build a list of node IP addresses
 Populate a file `workers.txt` with one IP address on each line. Do not include
 the head node IP address in this file. These IP addresses should typically be
 private network IP addresses, but any IP addresses which the head node can use
 to ssh to worker nodes will work here.
 ### Confirm that you can ssh to all nodes
 ```
 for host in $(cat workers.txt); do
 	ssh $host uptime
 done
 ```
 You may be prompted to verify the host keys during this process.
 ### Starting Ray
 #### Starting Ray on the head node
 On the head node (just choose some node to be the head node), run the following:
 ```
 ./ray/scripts/start_ray.sh --head --num-workers=<num-workers> --redis-port <redis-port>
 ```
 Replace `<redis-port>` with a port of your choice, e.g., `6379`. Also, replace
 `<num-workers>` with the number of workers that you wish to start.
 #### Start Ray on the worker nodes
 Create a file `start_worker.sh` that contains something like the following:
 ```
 export PATH=/home/ubuntu/anaconda2/bin/:$PATH
 ray/scripts/start_ray.sh --num-workers=<num-workers> --redis-address=<head-node-ip>:<redis-port>
 ```
 This script, when run on the worker nodes, will start up Ray. You will need to
 replace `<head-node-ip>` with the IP address that worker nodes will use to
 connect to the head node (most likely a **private IP address**). In this
 example we also export the path to the Python installation since our remote
 commands will not be executing in a login shell.
 **Warning:** You may need to manually export the correct path to Python (you
 will need to change the first line of `start_worker.sh` to find the version of
 Python that Ray was built against). This is necessary because the `PATH`
 environment variable used by `parallel-ssh` can differ from the `PATH`
 environment variable that gets set when you `ssh` to the machine.
 **Warning:** If the `parallel-ssh` command below appears to hang, `head-node-ip`
 may need to be a private IP address instead of a public IP address (e.g., if you
 are using EC2).
 Now use `parallel-ssh` to start up Ray on each worker node.
 ```
 parallel-ssh -h workers.txt -P -I < start_worker.sh
 ```
 Note that on some distributions the `parallel-ssh` command may be called `pssh`.
 #### Verification
 Now you have started all of the Ray processes on each node. These include:
 - Some worker processes on each machine.
 - An object store on each machine.
 - A local scheduler on each machine.
 - One Redis server (on the head node).
 - One global scheduler (on the head node).
 To confirm that the Ray cluster setup is working, start up Python on one of the
 nodes in the cluster and enter the following commands to connect to the Ray
 cluster.
 ```python
 import ray
 ray.init(redis_address="<redis-address>")
 ```
 Here `<redis-address>` should have the form `<head-node-ip>:<redis-port>`.
 Now you can define remote functions and execute tasks. For example:
 ```python
@ray.remote
 def f(x):
  return x
 ray.get([f.remote(f.remote(f.remote(0))) for _ in range(1000)])
 ```
 ### Stopping Ray
 #### Stop Ray on worker nodes
 ```
 parallel-ssh -h workers.txt -P ray/scripts/stop_ray.sh
 ```
 This command will execute the `stop_ray.sh` script on each of the worker nodes.
 #### Stop Ray on the head node
 ```
 ray/scripts/stop_ray.sh
 ```
 ## Sync Application Files to other nodes
 If you are running an application that reads input files or uses python libraries then you may find it useful to copy a directory on the head to the worker nodes.
 You can do this using the `parallel-rsync` command:
 ```
 parallel-rsync -h workers.txt -r <workload-dir> /home/ubuntu/<workload-dir>
 ```
 where `<workload-dir>` is the directory you want to synchronize.
 Note that the destination argument for this command must represent an absolute path on the worker node.