Ray Tune: Hyperparameter Optimization Framework =============================================== Ray Tune is a hyperparameter optimization framework for long-running tasks such as RL and deep learning training. Ray Tune makes it easy to go from running one or more experiments on a single machine to running on a large cluster with efficient search algorithms. Getting Started --------------- To use Ray Tune, add a two-line modification to a function: .. code-block:: python :emphasize-lines: 1,5 def my_func(config, reporter): # add the reporter parameter import time, numpy as np i = 0 while True: reporter(timesteps_total=i, mean_accuracy=i ** config["alpha"]) i += config["beta"] time.sleep(.01) Then, kick off your experiment: .. code-block:: python tune.register_trainable("my_func", my_func) ray.init() tune.run_experiments({ "my_experiment": { "run": "my_func", "stop": { "mean_accuracy": 100 }, "config": { "alpha": tune.grid_search([0.2, 0.4, 0.6]), "beta": tune.grid_search([1, 2]), } } }) This script runs a small grid search over the ``my_func`` function using Ray Tune, reporting status on the command line until the stopping condition of ``mean_accuracy >= 100`` is reached (for metrics like _loss_ that decrease over time, specify `neg_mean_loss `__ as a condition instead): :: == Status == Using FIFO scheduling algorithm. Resources used: 4/8 CPUs, 0/0 GPUs Result logdir: ~/ray_results/my_experiment - my_func_0_alpha=0.2,beta=1: RUNNING [pid=6778], 209 s, 20604 ts, 7.29 acc - my_func_1_alpha=0.4,beta=1: RUNNING [pid=6780], 208 s, 20522 ts, 53.1 acc - my_func_2_alpha=0.6,beta=1: TERMINATED [pid=6789], 21 s, 2190 ts, 101 acc - my_func_3_alpha=0.2,beta=2: RUNNING [pid=6791], 208 s, 41004 ts, 8.37 acc - my_func_4_alpha=0.4,beta=2: RUNNING [pid=6800], 209 s, 41204 ts, 70.1 acc - my_func_5_alpha=0.6,beta=2: TERMINATED [pid=6809], 10 s, 2164 ts, 100 acc In order to report incremental progress, ``my_func`` periodically calls the ``reporter`` function passed in by Ray Tune to return the current timestep and other metrics as defined in `ray.tune.result.TrainingResult `__. Incremental results will be synced to local disk on the head node of the cluster. Learn more `about specifying experiments `__ . Features -------- Ray Tune has the following features: - Scalable implementations of search algorithms such as `Population Based Training (PBT) `__, `Median Stopping Rule `__, Model-Based Optimization (HyperOpt), and `HyperBand `__. - Integration with visualization tools such as `TensorBoard `__, `rllab's VisKit `__, and a `parallel coordinates visualization `__. - Flexible trial variant generation, including grid search, random search, and conditional parameter distributions. - Resource-aware scheduling, including support for concurrent runs of algorithms that may themselves be parallel and distributed. Concepts -------- .. image:: tune-api.svg Ray Tune schedules a number of *trials* in a cluster. Each trial runs a user-defined Python function or class and is parameterized by a *config* variation passed to the user code. In order to run any given function, you need to run ``register_trainable`` to a name. This makes all Ray workers aware of the function. .. autofunction:: ray.tune.register_trainable Ray Tune provides a ``run_experiments`` function that generates and runs the trials described by the experiment specification. The trials are scheduled and managed by a *trial scheduler* that implements the search algorithm (default is FIFO). .. autofunction:: ray.tune.run_experiments Ray Tune can be used anywhere Ray can, e.g. on your laptop with ``ray.init()`` embedded in a Python script, or in an `auto-scaling cluster `__ for massive parallelism. You can find the code for Ray Tune `here on GitHub `__. Trial Schedulers ---------------- By default, Ray Tune schedules trials in serial order with the ``FIFOScheduler`` class. However, you can also specify a custom scheduling algorithm that can early stop trials, perturb parameters, or incorporate suggestions from an external service. Currently implemented trial schedulers include `Population Based Training (PBT) `__, `Median Stopping Rule `__, Model-Based Optimization (HyperOpt), and `HyperBand `__. .. code-block:: python run_experiments({...}, scheduler=AsyncHyperBandScheduler()) HyperOpt Integration -------------------- The``HyperOptScheduler`` is a Trial Scheduler that is backed by HyperOpt to perform sequential model-based hyperparameter optimization. In order to use this scheduler, you will need to install HyperOpt via the following command: .. code-block:: bash $ pip install --upgrade git+git://github.com/hyperopt/hyperopt.git An example of this can be found in `hyperopt_example.py `__. .. autoclass:: ray.tune.hpo_scheduler.HyperOptScheduler Visualizing Results ------------------- Ray Tune logs trial results to a unique directory per experiment, e.g. ``~/ray_results/my_experiment`` in the above example. The log records are compatible with a number of visualization tools: To visualize learning in tensorboard, install TensorFlow: .. code-block:: bash $ pip install tensorflow Then, after you run a experiment, you can visualize your experiment with TensorBoard by specifying the output directory of your results. Note that if you running Ray on a remote cluster, you can forward the tensorboard port to your local machine through SSH using ``ssh -L 6006:localhost:6006
``: .. code-block:: bash $ tensorboard --logdir=~/ray_results/my_experiment .. image:: ray-tune-tensorboard.png To use rllab's VisKit (you may have to install some dependencies), run: .. code-block:: bash $ git clone https://github.com/rll/rllab.git $ python rllab/rllab/viskit/frontend.py ~/ray_results/my_experiment .. image:: ray-tune-viskit.png Finally, to view the results with a `parallel coordinates visualization `__, open `ParallelCoordinatesVisualization.ipynb `__ as follows and run its cells: .. code-block:: bash $ cd $RAY_HOME/python/ray/tune $ jupyter-notebook ParallelCoordinatesVisualization.ipynb .. image:: ray-tune-parcoords.png Trial Checkpointing ------------------- To enable checkpointing, you must implement a Trainable class (Trainable functions are not checkpointable, since they never return control back to their caller). The easiest way to do this is to subclass the pre-defined ``Trainable`` class and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) `__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand and PBT. For TensorFlow model training, this would look something like this `(full tensorflow example) `__: .. code-block:: python class MyClass(Trainable): def _setup(self): self.saver = tf.train.Saver() self.sess = ... self.iteration = 0 def _train(self): self.sess.run(...) self.iteration += 1 def _save(self, checkpoint_dir): return self.saver.save( self.sess, checkpoint_dir + "/save", global_step=self.iteration) def _restore(self, path): return self.saver.restore(self.sess, path) Additionally, checkpointing can be used to provide fault-tolerance for experiments. This can be enabled by setting ``checkpoint_freq: N`` and ``max_failures: M`` to checkpoint trials every *N* iterations and recover from up to *M* crashes per trial, e.g.: .. code-block:: python run_experiments({ "my_experiment": { ... "checkpoint_freq": 10, "max_failures": 5, }, }) The class interface that must be implemented to enable checkpointing is as follows: .. autoclass:: ray.tune.trainable.Trainable Client API ---------- You can modify an ongoing experiment by adding or deleting trials using the Tune Client API. To do this, verify that you have the ``requests`` library installed: .. code-block:: bash $ pip install requests To use the Client API, you can start your experiment with ``with_server=True``: .. code-block:: python run_experiments({...}, with_server=True, server_port=4321) Then, on the client side, you can use the following class. The server address defaults to ``localhost:4321``. If on a cluster, you may want to forward this port (e.g. ``ssh -L :localhost:
``) so that you can use the Client on your local machine. .. autoclass:: ray.tune.web_server.TuneClient :members: For an example notebook for using the Client API, see the `Client API Example `__.