ray/doc/source/using-ray-on-a-cluster.rst
Scott Graham 37e4d29f87
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00

82 lines
2.5 KiB
ReStructuredText

.. _ref-cluster-setup:
Manual Cluster Setup
====================
.. note::
If you're using AWS, Azure or GCP you should use the automated `setup commands <autoscaling.html>`_.
The instructions in this document work well for small clusters. For larger
clusters, consider using the pssh package: ``sudo apt-get install pssh`` or
the `setup commands for private clusters <autoscaling.html#quick-start-private-cluster>`_.
Deploying Ray on a Cluster
--------------------------
This section assumes that you have a cluster running and that the nodes in the
cluster can communicate with each other. It also assumes that Ray is installed
on each machine. To install Ray, follow the `installation instructions`_.
.. _`installation instructions`: http://ray.readthedocs.io/en/latest/installation.html
Starting Ray on each machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On the head node (just choose some node to be the head node), run the following.
If the ``--redis-port`` argument is omitted, Ray will choose a port at random.
.. code-block:: bash
ray start --head --redis-port=6379
The command will print out the address of the Redis server that was started
(and some other address information).
**Then on all of the other nodes**, run the following. Make sure to replace
``<address>`` with the value printed by the command on the head node (it
should look something like ``123.45.67.89:6379``).
.. code-block:: bash
ray start --address=<address>
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the `Configuration <configure.html>`__ page for more information.
Now we've started all of the Ray processes on each node Ray. This includes
- Some worker processes on each machine.
- An object store on each machine.
- A raylet on each machine.
- Multiple Redis servers (on the head node).
To run some commands, start up Python on one of the nodes in the cluster, and do
the following.
.. code-block:: python
import ray
ray.init(address="<address>")
Now you can define remote functions and execute tasks. For example, to verify
that the correct number of nodes have joined the cluster, you can run the
following.
.. code-block:: python
import time
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(1000)]))
Stopping Ray
~~~~~~~~~~~~
When you want to stop the Ray processes, run ``ray stop`` on each node.