[tune,autoscaler] Test yaml, add better distributed docs (#5403)

This commit is contained in:
Richard Liaw 2019-08-08 00:59:23 -07:00 committed by GitHub
parent 1f8ae17f60
commit ed89897a31
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
10 changed files with 184 additions and 130 deletions

View file

@ -1,14 +1,27 @@
Tune Distributed Experiments
============================
Tune is commonly used for large-scale distributed hyperparameter optimization. Tune and Ray provide many utilities that enable an effective workflow for interacting with a cluster, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview:
This page will overview the tooling for distributed experiments, covering how to connect to a cluster, how to launch a distributed experiment, and commonly used commands.
1. How to setup and launch a distributed experiment,
2. `commonly used commands <tune-distributed.html#common-commands>`_, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
Connecting to a cluster
-----------------------
**Quick Summary**: To run a distributed experiment with Tune, you need to:
One common approach to modifying an existing Tune experiment to go distributed is to set an `argparse` variable so that toggling between distributed and single-node is seamless. This allows Tune to utilize all the resources available to the Ray cluster.
1. Make sure your script has ``ray.init(redis_address=...)`` to connect to the existing Ray cluster.
2. If a ray cluster does not exist, start a Ray cluster (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
3. Run the script on the head node (or use ``ray submit``).
Running a distributed experiment
--------------------------------
Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``.
To execute a distributed experiment, call ``ray.init(redis_address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster.
One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless.
.. code-block:: python
@ -20,58 +33,50 @@ One common approach to modifying an existing Tune experiment to go distributed i
args = parser.parse_args()
ray.init(redis_address=args.ray_redis_address)
Note that connecting to cluster requires a pre-existing Ray cluster to be started already (`Manual Cluster Setup <using-ray-on-a-cluster.html>`_). The script should be run on the head node of the Ray cluster. Below, ``tune_script.py`` can be any script that runs a Tune hyperparameter search.
tune.run(...)
.. code-block:: bash
# Single-node execution
$ python tune_script.py
# On the head node, connect to an existing ray cluster
$ python tune_script.py --ray-redis-address=localhost:XXXX
.. literalinclude:: ../../python/ray/tune/examples/mnist_pytorch.py
:language: python
:start-after: if __name__ == "__main__":
If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use:
.. code-block:: bash
Launching a cloud cluster
-------------------------
ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"
.. tip::
If you have already have a list of nodes, skip down to the `Local Cluster Setup`_ section.
Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_.
.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
:language: yaml
:name: tune-default.yaml
This code starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py``.
.. code-block:: bash
ray submit tune-default.yaml tune_script.py --start
.. image:: images/tune-upload.png
:scale: 50%
:align: center
Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
.. code-block:: bash
# Go to http://localhost:6006 to access TensorBoard.
ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
1. In the examples, the Ray redis address commonly used is ``localhost:6379``.
2. If the Ray cluster is already started, you should not need to run anything on the worker nodes.
Local Cluster Setup
-------------------
If you run into issues (or want to add nodes manually), you can use the manual cluster setup `documentation here <using-ray-on-a-cluster.html>`__. At a glance, On the head node, run the following.
If you have already have a list of nodes, you can follow the local private cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_. Below is an example cluster configuration as ``tune-default.yaml``:
.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
:language: yaml
``ray up`` starts Ray on the cluster of nodes.
.. code-block:: bash
ray up tune-default.yaml
``ray submit`` uploads ``tune_script.py`` to the cluster and runs ``python tune_script.py [args]``.
.. code-block:: bash
ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"
Manual Local Cluster Setup
~~~~~~~~~~~~~~~~~~~~~~~~~~
If you run into issues using the local cluster setup (or want to add nodes manually), you can use the manual cluster setup. `Full documentation here <using-ray-on-a-cluster.html>`__. At a glance,
**On the head node**:
.. code-block:: bash
@ -86,8 +91,51 @@ The command will print out the address of the Redis server that was started (and
$ ray start --redis-address=<redis-address>
If you have already have a list of nodes, you can follow the private autoscaling cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_.
Then, you can run your Tune Python script on the head node like:
.. code-block:: bash
# On the head node, execute using existing ray cluster
$ python tune_script.py --ray-redis-address=<redis-address>
Launching a cloud cluster
-------------------------
.. tip::
If you have already have a list of nodes, go to the `Local Cluster Setup`_ section.
Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_. Save the below cluster configuration (``tune-default.yaml``):
.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
:language: yaml
:name: tune-default.yaml
``ray up`` starts Ray on the cluster of nodes.
.. code-block:: bash
ray up tune-default.yaml
``ray submit --start`` starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py [args]``.
.. code-block:: bash
ray submit tune-default.yaml tune_script.py --start --args="--ray-redis-address=localhost:6379"
.. image:: images/tune-upload.png
:scale: 50%
:align: center
Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
.. code-block:: bash
# Go to http://localhost:6006 to access TensorBoard.
ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
Pre-emptible Instances (Cloud)
------------------------------
@ -180,7 +228,7 @@ To summarize, here are the commands to run:
# wait a while until after all nodes have started
ray kill-random-node tune-default.yaml --hard
You should see Tune eventually continue the trials on a different worker node. See the `Saving and Recovery <tune-usage.html#saving-and-recovery>`__ section for more details.
You should see Tune eventually continue the trials on a different worker node. See the `Save and Restore <tune-usage.html#save-and-restore>`__ section for more details.
You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, persisting results in case you want to start and stop your cluster automatically.

View file

@ -35,7 +35,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
})
tune.run( ... , scheduler=pbt_scheduler)
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#saving-and-recovery>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#save-and-restore>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment:
@ -69,7 +69,7 @@ Compared to the original version of HyperBand, this implementation provides bett
HyperBand
---------
.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#saving-and-recovery>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#save-and-restore>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
Tune also implements the `standard version of HyperBand <https://arxiv.org/abs/1603.06560>`__. You can use it as such:

View file

@ -255,8 +255,8 @@ If your trainable function / class creates further Ray actors or tasks that also
}
)
Saving and Recovery
-------------------
Save and Restore
----------------
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for
@ -296,7 +296,7 @@ Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_na
Trainable (Trial) Checkpointing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-------------------------------
Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.
@ -349,7 +349,7 @@ The checkpoint will be saved at a path that looks like ``local_dir/exp_name/tria
)
Fault Tolerance
~~~~~~~~~~~~~~~
---------------
Tune will automatically restart trials from the last checkpoint in case of trial failures/error (if ``max_failures`` is set), both in the single node and distributed setting.

View file

@ -48,14 +48,6 @@ If TensorBoard is installed, automatically visualize all trial results:
Distributed Quick Start
-----------------------
.. note::
This assumes that you have already setup your AWS account and AWS credentials (``aws configure``). To run this example, you will need to install the following:
.. code-block:: bash
$ pip install ray torch torchvision filelock
1. Import and initialize Ray by appending the following to your example script.
.. code-block:: python
@ -71,25 +63,35 @@ Distributed Quick Start
Alternatively, download a full example script here: :download:`mnist_pytorch.py <../../python/ray/tune/examples/mnist_pytorch.py>`
2. Download an example cluster yaml here: :download:`tune-default.yaml <../../python/ray/tune/examples/tune-default.yaml>`
2. Download the following example Ray cluster configuration as ``tune-local-default.yaml`` and replace the appropriate fields:
.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
:language: yaml
Alternatively, download it here: :download:`tune-local-default.yaml <../../python/ray/tune/examples/tune-local-default.yaml>`. See `Ray cluster docs here <autoscaling.html>`_.
3. Run ``ray submit`` like the following.
.. code-block:: bash
ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
This will start 3 AWS machines and run a distributed hyperparameter search across them. Append ``[--stop]`` to automatically shutdown your nodes afterwards.
This will start Ray on all of your machines and run a distributed hyperparameter search across them.
To summarize, here are the full set of commands:
.. code-block:: bash
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/examples/mnist_pytorch.py
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-default.yaml
ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-local-default.yaml
ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including setting up distributed experiments on local machines, using GCP, adding resilience to spot instance usage, and more.
Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including:
1. Setting up distributed experiments on your local cluster
2. Using AWS and GCP
3. Spot instance usage/pre-emptible instances, and more.
Getting Started
---------------

View file

@ -0,0 +1,34 @@
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import os
import unittest
import yaml
from ray.autoscaler.autoscaler import fillout_defaults, validate_config
from ray.tests.utils import recursive_fnmatch
RAY_PATH = os.path.abspath(os.path.join(__file__, "../../"))
CONFIG_PATHS = recursive_fnmatch(
os.path.join(RAY_PATH, "autoscaler"), "*.yaml")
CONFIG_PATHS += recursive_fnmatch(
os.path.join(RAY_PATH, "tune/examples/"), "*.yaml")
class AutoscalingConfigTest(unittest.TestCase):
def testValidateDefaultConfig(self):
for config_path in CONFIG_PATHS:
with open(config_path) as f:
config = yaml.safe_load(f)
config = fillout_defaults(config)
try:
validate_config(config)
except Exception:
self.fail("Config did not pass validation test!")
if __name__ == "__main__":
unittest.main(verbosity=2)

View file

@ -2,6 +2,7 @@ from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import fnmatch
import os
import subprocess
import sys
@ -116,3 +117,15 @@ def wait_for_condition(condition_predictor,
time_elapsed += retry_interval_ms
time.sleep(retry_interval_ms / 1000.0)
return False
def recursive_fnmatch(dirpath, pattern):
"""Looks at a file directory subtree for a filename pattern.
Similar to glob.glob(..., recursive=True) but also supports 2.7
"""
matches = []
for root, dirnames, filenames in os.walk(dirpath):
for filename in fnmatch.filter(filenames, pattern):
matches.append(os.path.join(root, filename))
return matches

View file

@ -1,51 +1,10 @@
# An unique identifier for the head node and workers of this cluster.
cluster_name: tune-example
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 2
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
# Nodes are currently spread between zones by a round-robin approach,
# however this implementation detail should not be relied upon.
availability_zone: us-west-2a,us-west-2b
# How Ray will authenticate with newly launched nodes.
# By default Ray creates a new private keypair, but you can also use your own.
auth:
ssh_user: ubuntu
# Provider-specific config for the head node, e.g. instance type.
head_node:
InstanceType: c5.xlarge
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
# Provider-specific config for worker nodes, e.g. instance type.
worker_nodes:
InstanceType: c5.xlarge
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
# Run workers on spot by default. Comment this out to use on-demand.
InstanceMarketOptions:
MarketType: spot
# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
}
# List of shell commands to run to set up each node.
setup_commands:
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
- pip install torch torchvision tabulate tensorboard filelock
cluster_name: tune-default
provider: {type: aws, region: us-west-2}
auth: {ssh_user: ubuntu}
min_workers: 3
max_workers: 3
# Deep Learning AMI (Ubuntu) Version 21.0
head_node: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
worker_nodes: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
setup_commands: # Set up each node.
- pip install ray torch torchvision tabulate tensorboard

View file

@ -0,0 +1,11 @@
cluster_name: local-default
provider:
type: local
head_ip: YOUR_HEAD_NODE_HOSTNAME
worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ]
auth: {ssh_user: YOUR_USERNAME, ssh_private_key: ~/.ssh/id_rsa}
## Typically for local clusters, min_workers == max_workers.
min_workers: 3
max_workers: 3
setup_commands: # Set up each node.
- pip install ray torch torchvision tabulate tensorboard

View file

@ -10,7 +10,8 @@ import unittest
import ray
from ray import tune
from ray.tune.util import recursive_fnmatch, validate_save_restore
from ray.tests.utils import recursive_fnmatch
from ray.tune.util import validate_save_restore
from ray.rllib import _register_all

View file

@ -4,9 +4,7 @@ from __future__ import print_function
import base64
import copy
import fnmatch
import logging
import os
import threading
import time
from collections import defaultdict
@ -213,18 +211,6 @@ def _from_pinnable(obj):
return obj[0]
def recursive_fnmatch(dirpath, pattern):
"""Looks at a file directory subtree for a filename pattern.
Similar to glob.glob(..., recursive=True) but also supports 2.7
"""
matches = []
for root, dirnames, filenames in os.walk(dirpath):
for filename in fnmatch.filter(filenames, pattern):
matches.append(os.path.join(root, filename))
return matches
def validate_save_restore(trainable_cls, config=None, use_object_store=False):
"""Helper method to check if your Trainable class will resume correctly.