mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
223 lines
9.7 KiB
ReStructuredText
223 lines
9.7 KiB
ReStructuredText
Ray Tune: Hyperparameter Optimization Framework
|
|
===============================================
|
|
|
|
Ray Tune is a hyperparameter optimization framework for long-running tasks such as RL and deep learning training. Ray Tune makes it easy to go from running one or more experiments on a single machine to running on a large cluster with efficient search algorithms.
|
|
|
|
Getting Started
|
|
---------------
|
|
|
|
To use Ray Tune, add a two-line modification to a function:
|
|
|
|
.. code-block:: python
|
|
:emphasize-lines: 1,5
|
|
|
|
def my_func(config, reporter): # add the reporter parameter
|
|
import time, numpy as np
|
|
i = 0
|
|
while True:
|
|
reporter(timesteps_total=i, mean_accuracy=i ** config["alpha"])
|
|
i += config["beta"]
|
|
time.sleep(.01)
|
|
|
|
Then, kick off your experiment:
|
|
|
|
.. code-block:: python
|
|
|
|
tune.register_trainable("my_func", my_func)
|
|
ray.init()
|
|
|
|
tune.run_experiments({
|
|
"my_experiment": {
|
|
"run": "my_func",
|
|
"stop": { "mean_accuracy": 100 },
|
|
"config": {
|
|
"alpha": tune.grid_search([0.2, 0.4, 0.6]),
|
|
"beta": tune.grid_search([1, 2]),
|
|
}
|
|
}
|
|
})
|
|
|
|
|
|
This script runs a small grid search over the ``my_func`` function using Ray Tune, reporting status on the command line until the stopping condition of ``mean_accuracy >= 100`` is reached (for metrics like _loss_ that decrease over time, specify `neg_mean_loss <https://github.com/ray-project/ray/blob/master/python/ray/tune/result.py#L40>`__ as a condition instead):
|
|
|
|
::
|
|
|
|
== Status ==
|
|
Using FIFO scheduling algorithm.
|
|
Resources used: 4/8 CPUs, 0/0 GPUs
|
|
Result logdir: ~/ray_results/my_experiment
|
|
- my_func_0_alpha=0.2,beta=1: RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc
|
|
- my_func_1_alpha=0.4,beta=1: RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc
|
|
- my_func_2_alpha=0.6,beta=1: TERMINATED [pid=6789], 21 s, 2190 ts, 101 acc
|
|
- my_func_3_alpha=0.2,beta=2: RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc
|
|
- my_func_4_alpha=0.4,beta=2: RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc
|
|
- my_func_5_alpha=0.6,beta=2: TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc
|
|
|
|
In order to report incremental progress, ``my_func`` periodically calls the ``reporter`` function passed in by Ray Tune to return the current timestep and other metrics as defined in `ray.tune.result.TrainingResult <https://github.com/ray-project/ray/blob/master/python/ray/tune/result.py>`__. Incremental results will be synced to local disk on the head node of the cluster.
|
|
|
|
Learn more `about specifying experiments <tune-config.html>`__ .
|
|
|
|
Features
|
|
--------
|
|
|
|
Ray Tune has the following features:
|
|
|
|
- Scalable implementations of search algorithms such as `Population Based Training (PBT) <pbt.html>`__, `Median Stopping Rule <hyperband.html#median-stopping-rule>`__, Model-Based Optimization (HyperOpt), and `HyperBand <hyperband.html>`__.
|
|
|
|
- Integration with visualization tools such as `TensorBoard <https://www.tensorflow.org/get_started/summaries_and_tensorboard>`__, `rllab's VisKit <https://media.readthedocs.org/pdf/rllab/latest/rllab.pdf>`__, and a `parallel coordinates visualization <https://en.wikipedia.org/wiki/Parallel_coordinates>`__.
|
|
|
|
- Flexible trial variant generation, including grid search, random search, and conditional parameter distributions.
|
|
|
|
- Resource-aware scheduling, including support for concurrent runs of algorithms that may themselves be parallel and distributed.
|
|
|
|
|
|
Concepts
|
|
--------
|
|
|
|
.. image:: tune-api.svg
|
|
|
|
Ray Tune schedules a number of *trials* in a cluster. Each trial runs a user-defined Python function or class and is parameterized by a *config* variation passed to the user code.
|
|
|
|
In order to run any given function, you need to run ``register_trainable`` to a name. This makes all Ray workers aware of the function.
|
|
|
|
.. autofunction:: ray.tune.register_trainable
|
|
|
|
Ray Tune provides a ``run_experiments`` function that generates and runs the trials described by the experiment specification. The trials are scheduled and managed by a *trial scheduler* that implements the search algorithm (default is FIFO).
|
|
|
|
.. autofunction:: ray.tune.run_experiments
|
|
|
|
Ray Tune can be used anywhere Ray can, e.g. on your laptop with ``ray.init()`` embedded in a Python script, or in an `auto-scaling cluster <autoscaling.html>`__ for massive parallelism.
|
|
|
|
You can find the code for Ray Tune `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/tune>`__.
|
|
|
|
|
|
Trial Schedulers
|
|
----------------
|
|
|
|
By default, Ray Tune schedules trials in serial order with the ``FIFOScheduler`` class. However, you can also specify a custom scheduling algorithm that can early stop trials, perturb parameters, or incorporate suggestions from an external service. Currently implemented trial schedulers include `Population Based Training (PBT) <pbt.html>`__, `Median Stopping Rule <hyperband.html#median-stopping-rule>`__, Model-Based Optimization (HyperOpt), and `HyperBand <hyperband.html>`__.
|
|
|
|
.. code-block:: python
|
|
|
|
run_experiments({...}, scheduler=AsyncHyperBandScheduler())
|
|
|
|
|
|
HyperOpt Integration
|
|
--------------------
|
|
|
|
The``HyperOptScheduler`` is a Trial Scheduler that is backed by HyperOpt to perform sequential model-based hyperparameter optimization.
|
|
In order to use this scheduler, you will need to install HyperOpt via the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ pip install --upgrade git+git://github.com/hyperopt/hyperopt.git
|
|
|
|
An example of this can be found in `hyperopt_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperopt_example.py>`__.
|
|
|
|
.. autoclass:: ray.tune.hpo_scheduler.HyperOptScheduler
|
|
|
|
|
|
Visualizing Results
|
|
-------------------
|
|
|
|
Ray Tune logs trial results to a unique directory per experiment, e.g. ``~/ray_results/my_experiment`` in the above example. The log records are compatible with a number of visualization tools:
|
|
|
|
To visualize learning in tensorboard, install TensorFlow:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ pip install tensorflow
|
|
|
|
Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ``ssh -L 6006:localhost:6006 <address>``:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ tensorboard --logdir=~/ray_results/my_experiment
|
|
|
|
.. image:: ray-tune-tensorboard.png
|
|
|
|
To use rllab's VisKit (you may have to install some dependencies), run:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ git clone https://github.com/rll/rllab.git
|
|
$ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment
|
|
|
|
.. image:: ray-tune-viskit.png
|
|
|
|
Finally, to view the results with a `parallel coordinates visualization <https://en.wikipedia.org/wiki/Parallel_coordinates>`__, open `ParallelCoordinatesVisualization.ipynb <https://github.com/ray-project/ray/blob/master/python/ray/tune/ParallelCoordinatesVisualization.ipynb>`__ as follows and run its cells:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ cd $RAY_HOME/python/ray/tune
|
|
$ jupyter-notebook ParallelCoordinatesVisualization.ipynb
|
|
|
|
.. image:: ray-tune-parcoords.png
|
|
|
|
|
|
Trial Checkpointing
|
|
-------------------
|
|
|
|
To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined ``Trainable`` class and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand and PBT.
|
|
|
|
For TensorFlow model training, this would look something like this `(full tensorflow example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/tune_mnist_ray_hyperband.py>`__:
|
|
|
|
.. code-block:: python
|
|
|
|
class MyClass(Trainable):
|
|
def _setup(self):
|
|
self.saver = tf.train.Saver()
|
|
self.sess = ...
|
|
self.iteration = 0
|
|
|
|
def _train(self):
|
|
self.sess.run(...)
|
|
self.iteration += 1
|
|
|
|
def _save(self, checkpoint_dir):
|
|
return self.saver.save(
|
|
self.sess, checkpoint_dir + "/save",
|
|
global_step=self.iteration)
|
|
|
|
def _restore(self, path):
|
|
return self.saver.restore(self.sess, path)
|
|
|
|
|
|
Additionally, checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting ``checkpoint_freq: N`` and ``max_failures: M`` to checkpoint trials every *N* iterations and recover from up to *M* crashes per trial, e.g.:
|
|
|
|
.. code-block:: python
|
|
|
|
run_experiments({
|
|
"my_experiment": {
|
|
...
|
|
"checkpoint_freq": 10,
|
|
"max_failures": 5,
|
|
},
|
|
})
|
|
|
|
The class interface that must be implemented to enable checkpointing is as follows:
|
|
|
|
.. autoclass:: ray.tune.trainable.Trainable
|
|
|
|
|
|
Client API
|
|
----------
|
|
|
|
You can modify an ongoing experiment by adding or deleting trials using the Tune Client API. To do this, verify that you have the ``requests`` library installed:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ pip install requests
|
|
|
|
To use the Client API, you can start your experiment with ``with_server=True``:
|
|
|
|
.. code-block:: python
|
|
|
|
run_experiments({...}, with_server=True, server_port=4321)
|
|
|
|
Then, on the client side, you can use the following class. The server address defaults to ``localhost:4321``. If on a cluster, you may want to forward this port (e.g. ``ssh -L <local_port>:localhost:<remote_port> <address>``) so that you can use the Client on your local machine.
|
|
|
|
.. autoclass:: ray.tune.web_server.TuneClient
|
|
:members:
|
|
|
|
|
|
For an example notebook for using the Client API, see the `Client API Example <https://github.com/ray-project/ray/tree/master/python/ray/tune/TuneClient.ipynb>`__.
|