2022-02-07 16:47:03 +01:00
Stopping and Resuming Tune Trials
=================================
Ray Tune periodically checkpoints the experiment state so that it can be restarted when it fails or stops.
The checkpointing period is dynamically adjusted so that at least 95% of the time is used for handling
training results and scheduling.
2022-07-29 02:43:45 -07:00
If you send a SIGINT signal to the process running `` Tuner.fit() `` (which is
2022-02-07 16:47:03 +01:00
usually what happens when you press Ctrl+C in the console), Ray Tune shuts
down training gracefully and saves a final experiment-level checkpoint.
2022-04-22 11:27:38 +01:00
Ray Tune also accepts the SIGUSR1 signal to interrupt training gracefully. This
should be used when running Ray Tune in a remote process (e.g. via Ray client)
as Ray will filter out SIGINT and SIGTERM signals per default.
2022-02-07 16:47:03 +01:00
How to resume a Tune run?
-------------------------
If you've stopped a run and and want to resume from where you left off,
2022-07-29 02:43:45 -07:00
you can then call `` Tuner.restore() `` like this:
2022-02-07 16:47:03 +01:00
.. code-block :: python
2022-07-29 02:43:45 -07:00
:emphasize-lines: 4
2022-02-07 16:47:03 +01:00
2022-07-29 02:43:45 -07:00
tuner = Tuner.restore(
path="~/ray_results/my_experiment"
2022-02-07 16:47:03 +01:00
)
2022-07-29 02:43:45 -07:00
tuner.fit()
There are a few options for restoring an experiment:
"resume_unfinished", "resume_errored" and "restart_errored". See `` Tuner.restore() `` for more details.
2022-02-07 16:47:03 +01:00
2022-07-29 02:43:45 -07:00
`` path `` here is determined by the `` air.RunConfig.name `` you supplied to your `` Tuner() `` .
If you didn't supply name to `` Tuner `` , it is likely that your `` path `` looks something like:
"~/ray_results/my_trainable_2021-01-29_10-16-44".
2022-02-07 16:47:03 +01:00
You can see which name you need to pass by taking a look at the results table
of your original tuning run:
.. code-block ::
:emphasize-lines: 5
== Status ==
Memory usage on this node: 11.0/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/16 CPUs, 0/0 GPUs, 0.0/4.69 GiB heap, 0.0/1.61 GiB objects
Result logdir: /Users/ray/ray_results/my_trainable_2021-01-29_10-16-44
Number of trials: 1/1 (1 RUNNING)
.. _tune-stopping-ref:
How to stop Tune runs programmatically?
---------------------------------------
We've just covered the case in which you manually interrupt a Tune run.
2022-07-29 02:43:45 -07:00
But you can also control when trials are stopped early by passing the `` stop `` argument to `` Tuner `` .
2022-02-07 16:47:03 +01:00
This argument takes, a dictionary, a function, or a :class: `Stopper <ray.tune.stopper.Stopper>` class as an argument.
2022-06-30 10:37:31 -07:00
If a dictionary is passed in, the keys may be any field in the return result of `` session.report `` in the
2022-02-07 16:47:03 +01:00
Function API or `` step() `` (including the results from `` step `` and auto-filled metrics).
Stopping with a dictionary
~~~~~~~~~~~~~~~~~~~~~~~~~~
In the example below, each trial will be stopped either when it completes `` 10 `` iterations or when it
reaches a mean accuracy of `` 0.98 `` .
These metrics are assumed to be **increasing** .
.. code-block :: python
# training_iteration is an auto-filled metric by Tune.
2022-07-29 02:43:45 -07:00
tune.Tuner(
2022-02-07 16:47:03 +01:00
my_trainable,
2022-07-29 02:43:45 -07:00
run_config=air.RunConfig(stop={"training_iteration": 10, "mean_accuracy": 0.98})
).fit()
2022-02-07 16:47:03 +01:00
Stopping with a function
~~~~~~~~~~~~~~~~~~~~~~~~
For more flexibility, you can pass in a function instead.
If a function is passed in, it must take `` (trial_id, result) `` as arguments and return a boolean
(`` True `` if trial should be stopped and `` False `` otherwise).
.. code-block :: python
def stopper(trial_id, result):
return result["mean_accuracy"] / result["training_iteration"] > 5
2022-07-29 02:43:45 -07:00
tune.Tuner(my_trainable, run_config=air.RunConfig(stop=stopper)).fit()
2022-02-07 16:47:03 +01:00
Stopping with a class
~~~~~~~~~~~~~~~~~~~~~
Finally, you can implement the :class: `Stopper <ray.tune.stopper.Stopper>` abstract class for stopping entire experiments. For example, the following example stops all trials after the criteria is fulfilled by any individual trial, and prevents new ones from starting:
.. code-block :: python
from ray.tune import Stopper
class CustomStopper(Stopper):
def __init__(self):
self.should_stop = False
def __call__(self, trial_id, result):
if not self.should_stop and result['foo'] > 10:
self.should_stop = True
return self.should_stop
def stop_all(self):
"""Returns whether to stop trials and prevent new ones from starting."""
return self.should_stop
stopper = CustomStopper()
2022-07-29 02:43:45 -07:00
tune.Tuner(my_trainable, run_config=air.RunConfig(stop=stopper)).fit()
2022-02-07 16:47:03 +01:00
Note that in the above example the currently running trials will not stop immediately but will do so
once their current iterations are complete.
Ray Tune comes with a set of out-of-the-box stopper classes. See the :ref: `Stopper <tune-stoppers>` documentation.
Stopping after the first failure
--------------------------------
2022-07-29 02:43:45 -07:00
By default, `` Tuner.fit() `` will continue executing until all trials have terminated or errored.
2022-02-07 16:47:03 +01:00
To stop the entire Tune run as soon as **any** trial errors:
.. code-block :: python
2022-07-29 02:43:45 -07:00
tune.Tuner(trainable, run_config=air.RunConfig(failure_config=air.FailureConfig(fail_fast=True))).fit()
2022-02-07 16:47:03 +01:00
This is useful when you are trying to setup a large hyperparameter experiment.