ray/doc/source/tune/tutorials/tune-stopping.rst

Stopping and Resuming Tune Trials
=================================

Ray Tune periodically checkpoints the experiment state so that it can be restarted when it fails or stops.
The checkpointing period is dynamically adjusted so that at least 95% of the time is used for handling
training results and scheduling.

If you send a SIGINT signal to the process running ``tune.run()`` (which is
usually what happens when you press Ctrl+C in the console), Ray Tune shuts
down training gracefully and saves a final experiment-level checkpoint.

How to resume a Tune run?
-------------------------

If you've stopped a run and and want to resume from where you left off,
you can then call ``tune.run()`` with ``resume=True`` like this:

.. code-block:: python
    :emphasize-lines: 5

    tune.run(
        train,
        # other configuration
        name="my_experiment",
        resume=True
    )

You will have to pass a ``name`` if you are using ``resume=True`` so that Ray Tune can detect the experiment
folder (which is usually stored at e.g. ``~/ray_results/my_experiment``).
If you forgot to pass a name in the first call, you can still pass the name when you resume the run.
Please note that in this case it is likely that your experiment name has a date suffix, so if you
ran ``tune.run(my_trainable)``, the ``name`` might look like something like this:
``my_trainable_2021-01-29_10-16-44``.

You can see which name you need to pass by taking a look at the results table
of your original tuning run:

.. code-block::
    :emphasize-lines: 5

    == Status ==
    Memory usage on this node: 11.0/16.0 GiB
    Using FIFO scheduling algorithm.
    Resources requested: 1/16 CPUs, 0/0 GPUs, 0.0/4.69 GiB heap, 0.0/1.61 GiB objects
    Result logdir: /Users/ray/ray_results/my_trainable_2021-01-29_10-16-44
    Number of trials: 1/1 (1 RUNNING)

Another useful option to know about is ``resume="AUTO"``, which will attempt to resume the experiment if possible,
and otherwise will start a new experiment.
For more details and other options for ``resume``, see the :ref:`Tune run API documentation <tune-run-ref>`.

.. _tune-stopping-ref:

How to stop Tune runs programmatically?
---------------------------------------

We've just covered the case in which you manually interrupt a Tune run.
But you can also control when trials are stopped early by passing the ``stop`` argument to ``tune.run``.
This argument takes, a dictionary, a function, or a :class:`Stopper <ray.tune.stopper.Stopper>` class as an argument.

If a dictionary is passed in, the keys may be any field in the return result of ``tune.report`` in the
Function API or ``step()`` (including the results from ``step`` and auto-filled metrics).

Stopping with a dictionary
~~~~~~~~~~~~~~~~~~~~~~~~~~

In the example below, each trial will be stopped either when it completes ``10`` iterations or when it
reaches a mean accuracy of ``0.98``.
These metrics are assumed to be **increasing**.

.. code-block:: python

    # training_iteration is an auto-filled metric by Tune.
    tune.run(
        my_trainable,
        stop={"training_iteration": 10, "mean_accuracy": 0.98}
    )

Stopping with a function
~~~~~~~~~~~~~~~~~~~~~~~~

For more flexibility, you can pass in a function instead.
If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean
(``True`` if trial should be stopped and ``False`` otherwise).

.. code-block:: python

    def stopper(trial_id, result):
        return result["mean_accuracy"] / result["training_iteration"] > 5

    tune.run(my_trainable, stop=stopper)

Stopping with a class
~~~~~~~~~~~~~~~~~~~~~

Finally, you can implement the :class:`Stopper <ray.tune.stopper.Stopper>` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:

.. code-block:: python

    from ray.tune import Stopper

    class CustomStopper(Stopper):
        def __init__(self):
            self.should_stop = False

        def __call__(self, trial_id, result):
            if not self.should_stop and result['foo'] > 10:
                self.should_stop = True
            return self.should_stop

        def stop_all(self):
            """Returns whether to stop trials and prevent new ones from starting."""
            return self.should_stop

    stopper = CustomStopper()
    tune.run(my_trainable, stop=stopper)


Note that in the above example the currently running trials will not stop immediately but will do so
once their current iterations are complete.

Ray Tune comes with a set of out-of-the-box stopper classes. See the :ref:`Stopper <tune-stoppers>` documentation.


Stopping after the first failure
--------------------------------

By default, ``tune.run`` will continue executing until all trials have terminated or errored.
To stop the entire Tune run as soon as **any** trial errors:

.. code-block:: python

    tune.run(trainable, fail_fast=True)

This is useful when you are trying to setup a large hyperparameter experiment.
[Docs ] Tune docs overhaul (first part) (#22112) Continuing docs overhaul, tune now has: - [x] better landing page - [x] a getting started guide - [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials - [x] the new user guide contains guides to tune features and practical integrations - [x] we rewrote some of the feature guides for clarity - [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower) - [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030. - [x] Examples are tested in the new framework, of course. There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com> 2022-02-07 16:47:03 +01:00			`Stopping and Resuming Tune Trials`
			`=================================`

			`Ray Tune periodically checkpoints the experiment state so that it can be restarted when it fails or stops.`
			`The checkpointing period is dynamically adjusted so that at least 95% of the time is used for handling`
			`training results and scheduling.`

			If you send a SIGINT signal to the process running ``tune.run()`` (which is
			`usually what happens when you press Ctrl+C in the console), Ray Tune shuts`
			`down training gracefully and saves a final experiment-level checkpoint.`

			`How to resume a Tune run?`
			`-------------------------`

			`If you've stopped a run and and want to resume from where you left off,`
			you can then call ``tune.run()`` with ``resume=True`` like this:

			`.. code-block:: python`
			`:emphasize-lines: 5`

			`tune.run(`
			`train,`
			`# other configuration`
			`name="my_experiment",`
			`resume=True`
			`)`

			You will have to pass a ``name`` if you are using ``resume=True`` so that Ray Tune can detect the experiment
			folder (which is usually stored at e.g. ``~/ray_results/my_experiment``).
			`If you forgot to pass a name in the first call, you can still pass the name when you resume the run.`
			`Please note that in this case it is likely that your experiment name has a date suffix, so if you`
			ran ``tune.run(my_trainable)``, the ``name`` might look like something like this:
			``my_trainable_2021-01-29_10-16-44``.

			`You can see which name you need to pass by taking a look at the results table`
			`of your original tuning run:`

			`.. code-block::`
			`:emphasize-lines: 5`

			`== Status ==`
			`Memory usage on this node: 11.0/16.0 GiB`
			`Using FIFO scheduling algorithm.`
			`Resources requested: 1/16 CPUs, 0/0 GPUs, 0.0/4.69 GiB heap, 0.0/1.61 GiB objects`
			`Result logdir: /Users/ray/ray_results/my_trainable_2021-01-29_10-16-44`
			`Number of trials: 1/1 (1 RUNNING)`

			Another useful option to know about is ``resume="AUTO"``, which will attempt to resume the experiment if possible,
			`and otherwise will start a new experiment.`
			For more details and other options for ``resume``, see the :ref:`Tune run API documentation <tune-run-ref>`.

			`.. _tune-stopping-ref:`

			`How to stop Tune runs programmatically?`
			`---------------------------------------`

			`We've just covered the case in which you manually interrupt a Tune run.`
			But you can also control when trials are stopped early by passing the ``stop`` argument to ``tune.run``.
			This argument takes, a dictionary, a function, or a :class:`Stopper <ray.tune.stopper.Stopper>` class as an argument.

			If a dictionary is passed in, the keys may be any field in the return result of ``tune.report`` in the
			Function API or ``step()`` (including the results from ``step`` and auto-filled metrics).

			`Stopping with a dictionary`
			`~~~~~~~~~~~~~~~~~~~~~~~~~~`

			In the example below, each trial will be stopped either when it completes ``10`` iterations or when it
			reaches a mean accuracy of ``0.98``.
			`These metrics are assumed to be increasing.`

			`.. code-block:: python`

			`# training_iteration is an auto-filled metric by Tune.`
			`tune.run(`
			`my_trainable,`
			`stop={"training_iteration": 10, "mean_accuracy": 0.98}`
			`)`

			`Stopping with a function`
			`~~~~~~~~~~~~~~~~~~~~~~~~`

			`For more flexibility, you can pass in a function instead.`
			If a function is passed in, it must take ``(trial_id, result)`` as arguments and return a boolean
			(``True`` if trial should be stopped and ``False`` otherwise).

			`.. code-block:: python`

			`def stopper(trial_id, result):`
			`return result["mean_accuracy"] / result["training_iteration"] > 5`

			`tune.run(my_trainable, stop=stopper)`

			`Stopping with a class`
			`~~~~~~~~~~~~~~~~~~~~~`

			Finally, you can implement the :class:`Stopper <ray.tune.stopper.Stopper>` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:

			`.. code-block:: python`

			`from ray.tune import Stopper`

			`class CustomStopper(Stopper):`
			`def __init__(self):`
			`self.should_stop = False`

			`def __call__(self, trial_id, result):`
			`if not self.should_stop and result['foo'] > 10:`
			`self.should_stop = True`
			`return self.should_stop`

			`def stop_all(self):`
			`"""Returns whether to stop trials and prevent new ones from starting."""`
			`return self.should_stop`

			`stopper = CustomStopper()`
			`tune.run(my_trainable, stop=stopper)`


			`Note that in the above example the currently running trials will not stop immediately but will do so`
			`once their current iterations are complete.`

			Ray Tune comes with a set of out-of-the-box stopper classes. See the :ref:`Stopper <tune-stoppers>` documentation.


			`Stopping after the first failure`
			`--------------------------------`

			By default, ``tune.run`` will continue executing until all trials have terminated or errored.
			`To stop the entire Tune run as soon as any trial errors:`

			`.. code-block:: python`

			`tune.run(trainable, fail_fast=True)`

			`This is useful when you are trying to setup a large hyperparameter experiment.`