ray/doc/source/using-ray-on-a-cluster.rst

.. _ref-cluster-setup:

Manual Cluster Setup
====================

.. note::

    If you're using AWS, Azure or GCP you should use the automated `setup commands <autoscaling.html>`_.

The instructions in this document work well for small clusters. For larger
clusters, consider using the pssh package: ``sudo apt-get install pssh`` or
the `setup commands for private clusters <autoscaling.html#quick-start-private-cluster>`_.


Deploying Ray on a Cluster
--------------------------

This section assumes that you have a cluster running and that the nodes in the
cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.

.. _`installation instructions`: http://docs.ray.io/en/latest/installation.html

Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On the head node (just choose some node to be the head node), run the following.
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
random port.

.. code-block:: bash

  ray start --head --port=6379

The command will print out the address of the Redis server that was started
(and some other address information).

**Then on all of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).

.. code-block:: bash

  ray start --address=<address>

If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the `Configuration <configure.html>`__ page for more information.

Now we've started all of the Ray processes on each node Ray. This includes

- Some worker processes on each machine.
- An object store on each machine.
- A raylet on each machine.
- Multiple Redis servers (on the head node).

To run some commands, start up Python on one of the nodes in the cluster, and do
the following.

.. code-block:: python

  import ray
  ray.init(address="<address>")

Now you can define remote functions and execute tasks. For example, to verify
that the correct number of nodes have joined the cluster, you can run the
following.

.. code-block:: python

  import time

  @ray.remote
  def f():
      time.sleep(0.01)
      return ray.services.get_node_ip_address()

  # Get a list of the IP addresses of the nodes that have joined the cluster.
  set(ray.get([f.remote() for _ in range(1000)]))

Stopping Ray
~~~~~~~~~~~~

When you want to stop the Ray processes, run ``ray stop`` on each node.
[docs] Make walkthrough and starting Ray materials clear (#7099) * make starting ray a separate page * concept * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> * more fics * Apply suggestions from code review Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> 2020-02-11 23:17:30 -08:00			`.. _ref-cluster-setup:`

ray exec and ray attach commands (#2560) ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped. 2018-08-15 14:31:50 -07:00			`Manual Cluster Setup`
			`====================`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
Update the pip wheel in example.yaml and add docs (#1381) 2018-01-01 13:02:05 -08:00			`.. note::`

[autoscaler] Adding Azure Support (#7080) * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * adding directory and node_provider entry for azure autoscaler * adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating * adding todos and switching to auth file for service principal authentication * adding role / scope to service principal * resolving issues with app credentials * adding retry for setting service principal role * typo and adding retry to nic creation * adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing * linting * updating cleanup and fixing bugs * minor fixes * first working version :) * added tag support * added msi identity intermediate * enable MSI through user managed identity * updated schema * extend yaml schema remove service principal code add re-use of managed user identity * fix rg_id * fix logging * replace manual cluster yaml validation with json schema - improved error message - support for intellisense in VSCode (or other IDEs) * run linting * updating yaml configs and formatting * updating yaml configs and formatting * typo in example config * pulling default config from example-full * resetting min, init worker prop * adding docs for azure autoscaler and fixing status * add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment * fix for default subscription in azure node provider * vm dev image build * minor change * keeping example-full.yaml in autoscaler/azure, updating azure example config * linting azure config * extending retries on azure config * lint * support for internal ips, fix to azure docs, and new azure gpu example config * linting * Update python/ray/autoscaler/azure/node_provider.py Co-Authored-By: Richard Liaw <rliaw@berkeley.edu> * revert_this * remove_schema * updating configs and removing ssh keygen, tweak azure node provider terminate * minor tweaks Co-authored-by: Markus Cozowicz <marcozo@microsoft.com> Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-15 17:48:27 -04:00			If you're using AWS, Azure or GCP you should use the automated `setup commands <autoscaling.html>`_.
Update the pip wheel in example.yaml and add docs (#1381) 2018-01-01 13:02:05 -08:00
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00			`The instructions in this document work well for small clusters. For larger`
[docs] Improve cluster/docker docs (#3517) - Surfaces local cluster usage - Increases visability of these instructions - Removes some docker docs (that are really out of scope for Ray documentation IMO) Closes #3517. 2018-12-12 10:40:54 -08:00			clusters, consider using the pssh package: ``sudo apt-get install pssh`` or
			the `setup commands for private clusters <autoscaling.html#quick-start-private-cluster>`_.
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00

			`Deploying Ray on a Cluster`
			`--------------------------`

			`This section assumes that you have a cluster running and that the nodes in the`
			`cluster can communicate with each other. It also assumes that Ray is installed`
Add instructions for pip installing the latest wheel. (#1672) 2018-03-12 00:52:00 -07:00			on each machine. To install Ray, follow the `installation instructions`_.
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
Replace all instances of ray.readthedocs.io with ray.io (#7994) 2020-04-13 16:17:05 -07:00			.. _`installation instructions`: http://docs.ray.io/en/latest/installation.html
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`Starting Ray on each machine`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~~~`

			`On the head node (just choose some node to be the head node), run the following.`
Rename redis-port to port and add default (#8406) 2020-05-18 11:25:34 -07:00			If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
			`random port.`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`.. code-block:: bash`

Rename redis-port to port and add default (#8406) 2020-05-18 11:25:34 -07:00			`ray start --head --port=6379`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`The command will print out the address of the Redis server that was started`
			`(and some other address information).`

[docs] Improve cluster/docker docs (#3517) - Surfaces local cluster usage - Increases visability of these instructions - Removes some docker docs (that are really out of scope for Ray documentation IMO) Closes #3517. 2018-12-12 10:40:54 -08:00			`Then on all of the other nodes, run the following. Make sure to replace`
Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			``<address>`` with the value printed by the command on the head node (it
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00			should look something like ``123.45.67.89:6379``).

			`.. code-block:: bash`

Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`ray start --address=<address>`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this`
[docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00			with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the `Configuration <configure.html>`__ page for more information.
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`Now we've started all of the Ray processes on each node Ray. This includes`

			`- Some worker processes on each machine.`
			`- An object store on each machine.`
Remove local/global_scheduler from code and doc. (#4549) 2019-04-04 08:05:09 +08:00			`- A raylet on each machine.`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00			`- Multiple Redis servers (on the head node).`

			`To run some commands, start up Python on one of the nodes in the cluster, and do`
			`the following.`

			`.. code-block:: python`

			`import ray`
Replace --redis-address with --address in test, docs, tune, rllib (#5602) * wip * add tests and tune * add ci * test fix * lint * fix tests * wip * sugar dep 2019-09-01 16:53:02 -07:00			`ray.init(address="<address>")`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`Now you can define remote functions and execute tasks. For example, to verify`
			`that the correct number of nodes have joined the cluster, you can run the`
			`following.`

			`.. code-block:: python`

			`import time`

			`@ray.remote`
			`def f():`
Change Python examples in documentation to use 4 space indentation. (#736) * Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-.rst Ray documentation - changed Python to 4 space indentation for files install-.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-.rst 2017-07-16 22:19:33 -07:00			`time.sleep(0.01)`
			`return ray.services.get_node_ip_address()`
Improve the cluster usage documentation. (#568) * Update cluster documentation and switch md to rst. * Improve cluster documentation. 2017-05-19 11:36:48 -07:00
			`# Get a list of the IP addresses of the nodes that have joined the cluster.`
			`set(ray.get([f.remote() for _ in range(1000)]))`

			`Stopping Ray`
			`~~~~~~~~~~~~`

Enable starting and stopping ray with "ray start" and "ray stop". (#628) * Install start_ray and stop_ray scripts in setup.py. * Update documentation. * Fix docker tests. * Implement stop_ray script in python. * Fix linting. 2017-06-02 13:17:48 -07:00			When you want to stop the Ray processes, run ``ray stop`` on each node.