..tip:: To run everything sequentially, use :ref:`Ray Local Mode <tune-debugging>`.
Parallelism is determined by ``resources_per_trial`` (defaulting to 1 CPU, 0 GPU per trial) and the resources available to Tune (``ray.cluster_resources()``).
Tune will allocate the specified GPU and CPU from ``resources_per_trial`` to each individual trial. A trial will not be scheduled unless at least that amount of resources is available, preventing the cluster from being overloaded.
By default, Tune automatically runs N concurrent trials, where N is the number of CPUs (cores) on your machine.
..code-block:: python
# If you have 4 CPUs on your machine, this will run 4 concurrent trials at a time.
tune.run(trainable, num_samples=10)
You can override this parallelism with ``resources_per_trial``:
..code-block:: python
# If you have 4 CPUs on your machine, this will run 2 concurrent trials at a time.
By default, each random variable and grid search point is sampled once. To take multiple random samples, add ``num_samples: N`` to the experiment config. If `grid_search` is provided as an argument, the grid will be repeated ``num_samples`` of times.
Note that search spaces may not be interoperable across different search algorithms. For example, for many search algorithms, you will not be able to use a ``grid_search`` parameter. Read about this in the :ref:`Search Space API <tune-search-space>` page.
During training, Tune will automatically log the below metrics in addition to the user-provided values. All of these can be used as stopping conditions or passed as a parameter to Trial Schedulers/Search Algorithms.
You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoint_dir>)`` By doing this, you can change whatever experiments' configuration such as the experiment's name:
On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <ref-automatic-cluster>` and also requires rsync to be installed.
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing. Also, if running Tune on Kubernetes, be sure to use the :ref:`KubernetesSyncer <tune-kubernetes>` to transfer files between different pods.
If a dictionary is passed in, the keys may be any field in the return result of ``tune.report`` in the Function API or ``step()`` (including the results from ``step`` and auto-filled metrics).
In the example below, each trial will be stopped either when it completes 10 iterations OR when it reaches a mean accuracy of 0.98. These metrics are assumed to be **increasing**.
..code-block:: python
# training_iteration is an auto-filled metric by Tune.
For more flexibility, you can pass in a function instead. If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean (``True`` if trial should be stopped and ``False`` otherwise).
Finally, you can implement the ``Stopper`` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:
..code-block:: python
from ray.tune import Stopper
class CustomStopper(Stopper):
def __init__(self):
self.should_stop = False
def __call__(self, trial_id, result):
if not self.should_stop and result['foo'] > 10:
self.should_stop = True
return self.should_stop
def stop_all(self):
"""Returns whether to stop trials and prevent new ones from starting."""
return self.should_stop
stopper = CustomStopper()
tune.run(my_trainable, stop=stopper)
Note that in the above example the currently running trials will not stop immediately but will do so once their current iterations are complete. See the :ref:`tune-stop-ref` documentation.
Tune by default will log results for Tensorboard, CSV, and JSON formats. If you need to log something lower level like model weights or gradients, see :ref:`Trainable Logging <trainable-logging>`.
**Learn more about logging and customizations here**: :ref:`loggers-docstring`.
If you are running Ray on a remote multi-user cluster where you do not have sudo access, you can run the following commands to make sure tensorboard is able to write to the tmp directory:
You can use a :ref:`Reporter <tune-reporter-doc>` object to customize the console output.
Uploading Results
-----------------
If an upload directory is provided, Tune will automatically sync results from the ``local_dir`` to the given directory, natively supporting standard S3/gsutil URIs.
You can customize this to specify arbitrary storages with the ``sync_to_cloud`` argument in ``tune.SyncConfig``. This argument supports either strings with the same replacement fields OR arbitrary functions.
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``, like ``s3 sync {source} {target}``. Alternatively, a function can be provided with the following signature:
By default, syncing occurs every 300 seconds. To change the frequency of syncing, set the ``TUNE_CLOUD_SYNC_S`` environment variable in the driver to the desired syncing period.
Note that uploading only happens when global experiment state is collected, and the frequency of this is determined by the ``TUNE_GLOBAL_CHECKPOINT_S`` environment variable. So the true upload period is given by ``max(TUNE_CLOUD_SYNC_S, TUNE_GLOBAL_CHECKPOINT_S)``.
By default, Tune will run hyperparameter evaluations on multiple processes. However, if you need to debug your training process, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process with ``local_mode`` by calling the following before ``tune.run``.
..code-block:: python
ray.init(local_mode=True)
Local mode with multiple configuration evaluations will interleave computation, so it is most naturally used when running a single configuration evaluation.
Stopping after the first failure
--------------------------------
By default, ``tune.run`` will continue executing until all trials have terminated or errored. To stop the entire Tune run as soon as **any** trial errors:
..code-block:: python
tune.run(trainable, fail_fast=True)
This is useful when you are trying to setup a large hyperparameter experiment.
Further Questions or Issues?
----------------------------
You can post questions or issues or feedback through the following channels:
1.`StackOverflow`_: For questions about how to use Ray.
2.`GitHub Issues`_: For bug reports and feature requests.