[tune,autoscaler] Test yaml, add better distributed docs (#5403)

2025-03-06 02:21:39 -05:00 · 2019-08-08 00:59:23 -07:00 · 2019-08-08 00:59:23 -07:00 · ed89897a31
commit ed89897a31
parent 1f8ae17f60
10 changed files with 184 additions and 130 deletions
--- a/doc/source/tune-distributed.rst
+++ b/doc/source/tune-distributed.rst
@ -1,14 +1,27 @@
 Tune Distributed Experiments
 ============================

-Tune is commonly used for large-scale distributed hyperparameter optimization. Tune and Ray provide many utilities that enable an effective workflow for interacting with a cluster, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
+Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview:

-This page will overview the tooling for distributed experiments, covering how to connect to a cluster, how to launch a distributed experiment, and commonly used commands.
+  1. How to setup and launch a distributed experiment,
+  2. `commonly used commands <tune-distributed.html#common-commands>`_, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.

-Connecting to a cluster
-----------------------
+**Quick Summary**: To run a distributed experiment with Tune, you need to:

-One common approach to modifying an existing Tune experiment to go distributed is to set an `argparse` variable so that toggling between distributed and single-node is seamless. This allows Tune to utilize all the resources available to the Ray cluster.
+  1. Make sure your script has ``ray.init(redis_address=...)`` to connect to the existing Ray cluster.
+  2. If a ray cluster does not exist, start a Ray cluster (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
+  3. Run the script on the head node (or use ``ray submit``).
+
+Running a distributed experiment
+--------------------------------
+
+Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
+
+Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``.
+
+To execute a distributed experiment, call ``ray.init(redis_address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster.
+
+One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless.

 .. code-block:: python

@ -20,58 +33,50 @@ One common approach to modifying an existing Tune experiment to go distributed i
    args = parser.parse_args()
    ray.init(redis_address=args.ray_redis_address)

-Note that connecting to cluster requires a pre-existing Ray cluster to be started already (`Manual Cluster Setup <using-ray-on-a-cluster.html>`_). The script should be run on the head node of the Ray cluster. Below, ``tune_script.py`` can be any script that runs a Tune hyperparameter search.
+    tune.run(...)

 .. code-block:: bash

-    # Single-node execution
-    $ python tune_script.py
-
    # On the head node, connect to an existing ray cluster
    $ python tune_script.py --ray-redis-address=localhost:XXXX

-.. literalinclude:: ../../python/ray/tune/examples/mnist_pytorch.py
-   :language: python
-   :start-after: if __name__ == "__main__":
+If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use:

+.. code-block:: bash

-Launching a cloud cluster
-------------------------
+    ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"

 .. tip::

-    If you have already have a list of nodes, skip down to the `Local Cluster Setup`_ section.
-
-Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_.
-
-.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
-   :language: yaml
-   :name: tune-default.yaml
-
-This code starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py``.
-
-.. code-block:: bash
-
-    ray submit tune-default.yaml tune_script.py --start
-
-.. image:: images/tune-upload.png
-    :scale: 50%
-    :align: center
-
-Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
-
-.. code-block:: bash
-
-    # Go to http://localhost:6006 to access TensorBoard.
-    ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
-
-
-Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
+    1. In the examples, the Ray redis address commonly used is ``localhost:6379``.
+    2. If the Ray cluster is already started, you should not need to run anything on the worker nodes.

 Local Cluster Setup
 -------------------

-If you run into issues (or want to add nodes manually), you can use the manual cluster setup `documentation here <using-ray-on-a-cluster.html>`__. At a glance, On the head node, run the following.
+If you have already have a list of nodes, you can follow the local private cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_. Below is an example cluster configuration as ``tune-default.yaml``:
+
+.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
+   :language: yaml
+
+``ray up`` starts Ray on the cluster of nodes.
+
+.. code-block:: bash
+
+    ray up tune-default.yaml
+
+``ray submit`` uploads ``tune_script.py`` to the cluster and runs ``python tune_script.py [args]``.
+
+.. code-block:: bash
+
+    ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"
+
+Manual Local Cluster Setup
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you run into issues using the local cluster setup (or want to add nodes manually), you can use the manual cluster setup. `Full documentation here <using-ray-on-a-cluster.html>`__. At a glance,
+
+**On the head node**:

 .. code-block:: bash

@ -86,8 +91,51 @@ The command will print out the address of the Redis server that was started (and

    $ ray start --redis-address=<redis-address>

-If you have already have a list of nodes, you can follow the private autoscaling cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_.
+Then, you can run your Tune Python script on the head node like:

+.. code-block:: bash
+
+    # On the head node, execute using existing ray cluster
+    $ python tune_script.py --ray-redis-address=<redis-address>
+
+Launching a cloud cluster
+-------------------------
+
+.. tip::
+
+    If you have already have a list of nodes, go to the `Local Cluster Setup`_ section.
+
+Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_. Save the below cluster configuration (``tune-default.yaml``):
+
+.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
+   :language: yaml
+   :name: tune-default.yaml
+
+``ray up`` starts Ray on the cluster of nodes.
+
+.. code-block:: bash
+
+    ray up tune-default.yaml
+
+``ray submit --start`` starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py [args]``.
+
+.. code-block:: bash
+
+    ray submit tune-default.yaml tune_script.py --start --args="--ray-redis-address=localhost:6379"
+
+.. image:: images/tune-upload.png
+    :scale: 50%
+    :align: center
+
+Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
+
+.. code-block:: bash
+
+    # Go to http://localhost:6006 to access TensorBoard.
+    ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
+
+
+Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.

 Pre-emptible Instances (Cloud)
 ------------------------------
@ -180,7 +228,7 @@ To summarize, here are the commands to run:
    # wait a while until after all nodes have started
    ray kill-random-node tune-default.yaml --hard

-You should see Tune eventually continue the trials on a different worker node. See the `Saving and Recovery <tune-usage.html#saving-and-recovery>`__ section for more details.
+You should see Tune eventually continue the trials on a different worker node. See the `Save and Restore <tune-usage.html#save-and-restore>`__ section for more details.

 You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, persisting results in case you want to start and stop your cluster automatically.

--- a/doc/source/tune-schedulers.rst
+++ b/doc/source/tune-schedulers.rst
@ -35,7 +35,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
            })
    tune.run( ... , scheduler=pbt_scheduler)

-When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#saving-and-recovery>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
+When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#save-and-restore>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.

 You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment:

@ -69,7 +69,7 @@ Compared to the original version of HyperBand, this implementation provides bett
 HyperBand
 ---------

-.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#saving-and-recovery>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
+.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#save-and-restore>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.

 Tune also implements the `standard version of HyperBand <https://arxiv.org/abs/1603.06560>`__. You can use it as such:

--- a/doc/source/tune-usage.rst
+++ b/doc/source/tune-usage.rst
@ -255,8 +255,8 @@ If your trainable function / class creates further Ray actors or tasks that also
        }
    )

-Saving and Recovery
-------------------
+Save and Restore
+----------------

 When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for

@ -296,7 +296,7 @@ Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_na


 Trainable (Trial) Checkpointing
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+-------------------------------

 Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.

@ -349,7 +349,7 @@ The checkpoint will be saved at a path that looks like ``local_dir/exp_name/tria
    )

 Fault Tolerance
-~~~~~~~~~~~~~~~
+---------------

 Tune will automatically restart trials from the last checkpoint in case of trial failures/error (if ``max_failures`` is set), both in the single node and distributed setting.

--- a/doc/source/tune.rst
+++ b/doc/source/tune.rst
@ -48,14 +48,6 @@ If TensorBoard is installed, automatically visualize all trial results:
 Distributed Quick Start
 -----------------------

-.. note::
-
-    This assumes that you have already setup your AWS account and AWS credentials (``aws configure``). To run this example, you will need to install the following:
-
-    .. code-block:: bash
-
-        $ pip install ray torch torchvision filelock
-
 1. Import and initialize Ray by appending the following to your example script.

 .. code-block:: python
@ -71,25 +63,35 @@ Distributed Quick Start

 Alternatively, download a full example script here: :download:`mnist_pytorch.py <../../python/ray/tune/examples/mnist_pytorch.py>`

-2. Download an example cluster yaml here: :download:`tune-default.yaml <../../python/ray/tune/examples/tune-default.yaml>`
+2. Download the following example Ray cluster configuration as ``tune-local-default.yaml`` and replace the appropriate fields:
+
+.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
+   :language: yaml
+
+Alternatively, download it here: :download:`tune-local-default.yaml <../../python/ray/tune/examples/tune-local-default.yaml>`. See `Ray cluster docs here <autoscaling.html>`_.
+
 3. Run ``ray submit`` like the following.

 .. code-block:: bash

-    ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
+    ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start

-This will start 3 AWS machines and run a distributed hyperparameter search across them. Append ``[--stop]`` to automatically shutdown your nodes afterwards.
+This will start Ray on all of your machines and run a distributed hyperparameter search across them.

 To summarize, here are the full set of commands:

 .. code-block:: bash

    wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/examples/mnist_pytorch.py
-    wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-default.yaml
-    ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
+    wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-local-default.yaml
+    ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start


-Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including setting up distributed experiments on local machines, using GCP, adding resilience to spot instance usage, and more.
+Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including:
+
+ 1. Setting up distributed experiments on your local cluster
+ 2. Using AWS and GCP
+ 3. Spot instance usage/pre-emptible instances, and more.

 Getting Started
 ---------------
--- a/python/ray/tests/test_autoscaler_yaml.py
+++ b/python/ray/tests/test_autoscaler_yaml.py
@ -0,0 +1,34 @@
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import os
+import unittest
+import yaml
+
+from ray.autoscaler.autoscaler import fillout_defaults, validate_config
+from ray.tests.utils import recursive_fnmatch
+
+RAY_PATH = os.path.abspath(os.path.join(__file__, "../../"))
+CONFIG_PATHS = recursive_fnmatch(
+    os.path.join(RAY_PATH, "autoscaler"), "*.yaml")
+
+CONFIG_PATHS += recursive_fnmatch(
+    os.path.join(RAY_PATH, "tune/examples/"), "*.yaml")
+
+
+class AutoscalingConfigTest(unittest.TestCase):
+    def testValidateDefaultConfig(self):
+
+        for config_path in CONFIG_PATHS:
+            with open(config_path) as f:
+                config = yaml.safe_load(f)
+            config = fillout_defaults(config)
+            try:
+                validate_config(config)
+            except Exception:
+                self.fail("Config did not pass validation test!")
+
+
+if __name__ == "__main__":
+    unittest.main(verbosity=2)
--- a/python/ray/tests/utils.py
+++ b/python/ray/tests/utils.py
@ -2,6 +2,7 @@ from __future__ import absolute_import
 from __future__ import division
 from __future__ import print_function

+import fnmatch
 import os
 import subprocess
 import sys
@ -116,3 +117,15 @@ def wait_for_condition(condition_predictor,
        time_elapsed += retry_interval_ms
        time.sleep(retry_interval_ms / 1000.0)
    return False
+
+
+def recursive_fnmatch(dirpath, pattern):
+    """Looks at a file directory subtree for a filename pattern.
+
+    Similar to glob.glob(..., recursive=True) but also supports 2.7
+    """
+    matches = []
+    for root, dirnames, filenames in os.walk(dirpath):
+        for filename in fnmatch.filter(filenames, pattern):
+            matches.append(os.path.join(root, filename))
+    return matches
--- a/python/ray/tune/examples/tune-default.yaml
+++ b/python/ray/tune/examples/tune-default.yaml
@ -1,51 +1,10 @@
-# An unique identifier for the head node and workers of this cluster.
-cluster_name: tune-example
-
-# The minimum number of workers nodes to launch in addition to the head
-# node. This number should be >= 0.
-min_workers: 2
-
-# The maximum number of workers nodes to launch in addition to the head
-# node. This takes precedence over min_workers.
-max_workers: 2
-
-# Cloud-provider specific configuration.
-provider:
-    type: aws
-    region: us-west-2
-    # Availability zone(s), comma-separated, that nodes may be launched in.
-    # Nodes are currently spread between zones by a round-robin approach,
-    # however this implementation detail should not be relied upon.
-    availability_zone: us-west-2a,us-west-2b
-
-# How Ray will authenticate with newly launched nodes.
-# By default Ray creates a new private keypair, but you can also use your own.
-auth:
-    ssh_user: ubuntu
-
-# Provider-specific config for the head node, e.g. instance type.
-head_node:
-    InstanceType: c5.xlarge
-    ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
-
-# Provider-specific config for worker nodes, e.g. instance type.
-worker_nodes:
-    InstanceType: c5.xlarge
-    ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
-
-    # Run workers on spot by default. Comment this out to use on-demand.
-    InstanceMarketOptions:
-        MarketType: spot
-
-# Files or directories to copy to the head and worker nodes. The format is a
-# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
-file_mounts: {
-#    "/path1/on/remote/machine": "/path1/on/local/machine",
-#    "/path2/on/remote/machine": "/path2/on/local/machine",
-}
-
-# List of shell commands to run to set up each node.
-setup_commands:
-    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
-    - pip install torch torchvision tabulate tensorboard filelock
-
+cluster_name: tune-default
+provider: {type: aws, region: us-west-2}
+auth: {ssh_user: ubuntu}
+min_workers: 3
+max_workers: 3
+# Deep Learning AMI (Ubuntu) Version 21.0
+head_node: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
+worker_nodes: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
+setup_commands: # Set up each node.
+    - pip install ray torch torchvision tabulate tensorboard
--- a/python/ray/tune/examples/tune-local-default.yaml
+++ b/python/ray/tune/examples/tune-local-default.yaml
@ -0,0 +1,11 @@
+cluster_name: local-default
+provider:
+    type: local
+    head_ip: YOUR_HEAD_NODE_HOSTNAME
+    worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ]
+auth: {ssh_user: YOUR_USERNAME, ssh_private_key: ~/.ssh/id_rsa}
+## Typically for local clusters, min_workers == max_workers.
+min_workers: 3
+max_workers: 3
+setup_commands:  # Set up each node.
+    - pip install ray torch torchvision tabulate tensorboard
--- a/python/ray/tune/tests/test_tune_restore.py
+++ b/python/ray/tune/tests/test_tune_restore.py
@ -10,7 +10,8 @@ import unittest

 import ray
 from ray import tune
-from ray.tune.util import recursive_fnmatch, validate_save_restore
+from ray.tests.utils import recursive_fnmatch
+from ray.tune.util import validate_save_restore
 from ray.rllib import _register_all


--- a/python/ray/tune/util.py
+++ b/python/ray/tune/util.py
@ -4,9 +4,7 @@ from __future__ import print_function

 import base64
 import copy
-import fnmatch
 import logging
-import os
 import threading
 import time
 from collections import defaultdict
@ -213,18 +211,6 @@ def _from_pinnable(obj):
    return obj[0]


-def recursive_fnmatch(dirpath, pattern):
-    """Looks at a file directory subtree for a filename pattern.
-
-    Similar to glob.glob(..., recursive=True) but also supports 2.7
-    """
-    matches = []
-    for root, dirnames, filenames in os.walk(dirpath):
-        for filename in fnmatch.filter(filenames, pattern):
-            matches.append(os.path.join(root, filename))
-    return matches
-
-
 def validate_save_restore(trainable_cls, config=None, use_object_store=False):
    """Helper method to check if your Trainable class will resume correctly.