2020-02-11 23:17:30 -08:00
|
|
|
.. _ref-cluster-setup:
|
|
|
|
|
2018-08-15 14:31:50 -07:00
|
|
|
Manual Cluster Setup
|
|
|
|
====================
|
2017-05-19 11:36:48 -07:00
|
|
|
|
2018-01-01 13:02:05 -08:00
|
|
|
.. note::
|
|
|
|
|
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler
* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating
* adding todos and switching to auth file for service principal authentication
* adding role / scope to service principal
* resolving issues with app credentials
* adding retry for setting service principal role
* typo and adding retry to nic creation
* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing
* linting
* updating cleanup and fixing bugs
* adding directory and node_provider entry for azure autoscaler
* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating
* adding todos and switching to auth file for service principal authentication
* adding role / scope to service principal
* resolving issues with app credentials
* adding retry for setting service principal role
* typo and adding retry to nic creation
* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing
* linting
* updating cleanup and fixing bugs
* minor fixes
* first working version :)
* added tag support
* added msi identity intermediate
* enable MSI through user managed identity
* updated schema
* extend yaml schema
remove service principal code
add re-use of managed user identity
* fix rg_id
* fix logging
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
* run linting
* updating yaml configs and formatting
* updating yaml configs and formatting
* typo in example config
* pulling default config from example-full
* resetting min, init worker prop
* adding docs for azure autoscaler and fixing status
* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment
* fix for default subscription in azure node provider
* vm dev image build
* minor change
* keeping example-full.yaml in autoscaler/azure, updating azure example config
* linting azure config
* extending retries on azure config
* lint
* support for internal ips, fix to azure docs, and new azure gpu example config
* linting
* Update python/ray/autoscaler/azure/node_provider.py
Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>
* revert_this
* remove_schema
* updating configs and removing ssh keygen, tweak azure node provider terminate
* minor tweaks
Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 17:48:27 -04:00
|
|
|
If you're using AWS, Azure or GCP you should use the automated `setup commands <autoscaling.html>`_.
|
2018-01-01 13:02:05 -08:00
|
|
|
|
2017-05-19 11:36:48 -07:00
|
|
|
The instructions in this document work well for small clusters. For larger
|
2018-12-12 10:40:54 -08:00
|
|
|
clusters, consider using the pssh package: ``sudo apt-get install pssh`` or
|
|
|
|
the `setup commands for private clusters <autoscaling.html#quick-start-private-cluster>`_.
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
|
|
|
|
Deploying Ray on a Cluster
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
This section assumes that you have a cluster running and that the nodes in the
|
|
|
|
cluster can communicate with each other. It also assumes that Ray is installed
|
2018-03-12 00:52:00 -07:00
|
|
|
on each machine. To install Ray, follow the `installation instructions`_.
|
2017-05-19 11:36:48 -07:00
|
|
|
|
2020-04-13 16:17:05 -07:00
|
|
|
.. _`installation instructions`: http://docs.ray.io/en/latest/installation.html
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
Starting Ray on each machine
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
|
|
On the head node (just choose some node to be the head node), run the following.
|
2020-05-18 11:25:34 -07:00
|
|
|
If the ``--port`` argument is omitted, Ray will choose port 6379, falling back to a
|
|
|
|
random port.
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
2020-05-18 11:25:34 -07:00
|
|
|
ray start --head --port=6379
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
The command will print out the address of the Redis server that was started
|
|
|
|
(and some other address information).
|
|
|
|
|
2018-12-12 10:40:54 -08:00
|
|
|
**Then on all of the other nodes**, run the following. Make sure to replace
|
2019-09-01 16:53:02 -07:00
|
|
|
``<address>`` with the value printed by the command on the head node (it
|
2017-05-19 11:36:48 -07:00
|
|
|
should look something like ``123.45.67.89:6379``).
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
2019-09-01 16:53:02 -07:00
|
|
|
ray start --address=<address>
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
If you wish to specify that a machine has 10 CPUs and 1 GPU, you can do this
|
2019-08-05 23:33:14 -07:00
|
|
|
with the flags ``--num-cpus=10`` and ``--num-gpus=1``. See the `Configuration <configure.html>`__ page for more information.
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
Now we've started all of the Ray processes on each node Ray. This includes
|
|
|
|
|
|
|
|
- Some worker processes on each machine.
|
|
|
|
- An object store on each machine.
|
2019-04-04 08:05:09 +08:00
|
|
|
- A raylet on each machine.
|
2017-05-19 11:36:48 -07:00
|
|
|
- Multiple Redis servers (on the head node).
|
|
|
|
|
|
|
|
To run some commands, start up Python on one of the nodes in the cluster, and do
|
|
|
|
the following.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
import ray
|
2019-09-01 16:53:02 -07:00
|
|
|
ray.init(address="<address>")
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
Now you can define remote functions and execute tasks. For example, to verify
|
|
|
|
that the correct number of nodes have joined the cluster, you can run the
|
|
|
|
following.
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
import time
|
|
|
|
|
|
|
|
@ray.remote
|
|
|
|
def f():
|
2017-07-16 22:19:33 -07:00
|
|
|
time.sleep(0.01)
|
|
|
|
return ray.services.get_node_ip_address()
|
2017-05-19 11:36:48 -07:00
|
|
|
|
|
|
|
# Get a list of the IP addresses of the nodes that have joined the cluster.
|
|
|
|
set(ray.get([f.remote() for _ in range(1000)]))
|
|
|
|
|
|
|
|
Stopping Ray
|
|
|
|
~~~~~~~~~~~~
|
|
|
|
|
2017-06-02 13:17:48 -07:00
|
|
|
When you want to stop the Ray processes, run ``ray stop`` on each node.
|