2020-09-21 11:27:50 -07:00
|
|
|
|
How does Tune work?
|
|
|
|
|
===================
|
|
|
|
|
|
|
|
|
|
This page provides an overview of Tune's inner workings.
|
2022-07-24 15:45:24 +01:00
|
|
|
|
We describe in detail what happens when you call ``Tuner.fit()``, what the lifecycle of a Tune trial looks like
|
2022-02-27 08:07:34 +01:00
|
|
|
|
and what the architectural components of Tune are.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
.. tip:: Before you continue, be sure to have read :ref:`the Tune Key Concepts page <tune-60-seconds>`.
|
|
|
|
|
|
2022-07-24 15:45:24 +01:00
|
|
|
|
What happens in ``Tuner.fit``?
|
|
|
|
|
------------------------------
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
When calling the following:
|
|
|
|
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
|
|
|
|
|
|
space = {"x": tune.uniform(0, 1)}
|
2022-07-24 15:45:24 +01:00
|
|
|
|
tuner = tune.Tuner(my_trainable, param_space=space, tune_config=tune.TuneConfig(num_samples=10))
|
|
|
|
|
results = tuner.fit()
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
The provided ``my_trainable`` is evaluated multiple times in parallel
|
|
|
|
|
with different hyperparameters (sampled from ``uniform(0, 1)``).
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Every Tune run consists of "driver process" and many "worker processes".
|
2022-07-24 15:45:24 +01:00
|
|
|
|
The driver process is the python process that calls ``Tuner.fit()`` (which calls ``ray.init()`` underneath the hood).
|
|
|
|
|
The Tune driver process runs on the node where you run your script (which calls ``Tuner.fit()``),
|
2022-02-27 08:07:34 +01:00
|
|
|
|
while Ray Tune trainable "actors" run on any node (either on the same node or on worker nodes (distributed Ray only)).
|
|
|
|
|
|
|
|
|
|
.. note:: :ref:`Ray Actors <actor-guide>` allow you to parallelize an instance of a class in Python.
|
|
|
|
|
When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process
|
|
|
|
|
either on the same machine (or another distributed machine, if running a Ray cluster).
|
|
|
|
|
This actor can then asynchronously execute method calls and maintain its own internal state.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
The driver spawns parallel worker processes (:ref:`Ray actors <actor-guide>`)
|
2022-02-27 08:07:34 +01:00
|
|
|
|
that are responsible for evaluating each trial using its hyperparameter configuration and the provided trainable
|
2022-06-03 16:25:38 +01:00
|
|
|
|
(see the `ray trial executor source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__).
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
While the Trainable is executing (:ref:`trainable-execution`), the Tune Driver communicates with each actor
|
|
|
|
|
via actor methods to receive intermediate training results and pause/stop actors (see :ref:`trial-lifecycle`).
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
When the Trainable terminates (or is stopped), the actor is also terminated.
|
|
|
|
|
|
|
|
|
|
.. _trainable-execution:
|
|
|
|
|
|
|
|
|
|
The execution of a trainable
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Tune uses :ref:`Ray actors <actor-guide>` to parallelize the evaluation of multiple hyperparameter configurations.
|
|
|
|
|
Each actor is a Python process that executes an instance of the user-provided Trainable.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
The definition of the user-provided Trainable will be
|
|
|
|
|
:ref:`serialized via cloudpickle <serialization-guide>`) and sent to each actor process.
|
|
|
|
|
Each Ray actor will start an instance of the Trainable to be executed.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
If the Trainable is a class, it will be executed iteratively by calling ``train/step``.
|
|
|
|
|
After each invocation, the driver is notified that a "result dict" is ready.
|
|
|
|
|
The driver will then pull the result via ``ray.get``.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread.
|
2022-06-30 10:37:31 -07:00
|
|
|
|
Whenever ``session.report`` is called, the execution thread is paused and waits for the driver to pull a
|
2022-06-23 21:50:55 +01:00
|
|
|
|
result (see `function_trainable.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/function_trainable.py>`__.
|
2022-02-27 08:07:34 +01:00
|
|
|
|
After pulling, the actor’s execution thread will automatically resume.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Resource Management in Tune
|
|
|
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Before running a trial, the Ray Tune driver will check whether there are available
|
|
|
|
|
resources on the cluster (see :ref:`resource-requirements`).
|
2022-02-07 16:47:03 +01:00
|
|
|
|
It will compare the available resources with the resources required by the trial.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-07 16:47:03 +01:00
|
|
|
|
If there is space on the cluster, then the Tune Driver will start a Ray actor (worker).
|
|
|
|
|
This actor will be scheduled and executed on some node where the resources are available.
|
|
|
|
|
See :doc:`tune-resources` for more information.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
|
|
|
|
.. _trial-lifecycle:
|
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Lifecycle of a Trial
|
2020-09-21 11:27:50 -07:00
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
|
|
A trial's life cycle consists of 6 stages:
|
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
* **Initialization** (generation): A trial is first generated as a hyperparameter sample,
|
2022-07-24 15:45:24 +01:00
|
|
|
|
and its parameters are configured according to what was provided in ``Tuner``.
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Trials are then placed into a queue to be executed (with status PENDING).
|
|
|
|
|
|
|
|
|
|
* **PENDING**: A pending trial is a trial to be executed on the machine.
|
|
|
|
|
Every trial is configured with resource values. Whenever the trial’s resource values are available,
|
|
|
|
|
Tune will run the trial (by starting a ray actor holding the config and the training function.
|
|
|
|
|
|
|
|
|
|
* **RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel.
|
|
|
|
|
See the :ref:`trainable execution <trainable-execution>` section for more details.
|
|
|
|
|
|
|
|
|
|
* **ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored.
|
|
|
|
|
Note that exceptions can be propagated from an actor to the main Tune driver process.
|
|
|
|
|
If max_retries is set, Tune will set the trial back into "PENDING" and later start it from the last checkpoint.
|
|
|
|
|
|
|
|
|
|
* **TERMINATED**: A trial is terminated if it is stopped by a Stopper/Scheduler.
|
|
|
|
|
If using the Function API, the trial is also terminated when the function stops.
|
|
|
|
|
|
|
|
|
|
* **PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped.
|
|
|
|
|
A paused trial can later be resumed from the most recent checkpoint.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tune's Architecture
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
.. image:: ../../images/tune-arch.png
|
|
|
|
|
|
|
|
|
|
The blue boxes refer to internal components, while green boxes are public-facing.
|
|
|
|
|
|
|
|
|
|
Tune's main components consist of ``TrialRunner``, ``Trial`` objects, ``TrialExecutor``, ``SearchAlg``,
|
|
|
|
|
``TrialScheduler``, and ``Trainable``.
|
|
|
|
|
|
|
|
|
|
.. _trial-runner-flow:
|
|
|
|
|
|
|
|
|
|
This is an illustration of the high-level training flow and how some of the components interact:
|
|
|
|
|
|
|
|
|
|
*Note: This figure is horizontally scrollable*
|
|
|
|
|
|
|
|
|
|
.. figure:: ../../images/tune-trial-runner-flow-horizontal.png
|
|
|
|
|
:class: horizontal-scroll
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
TrialRunner
|
|
|
|
|
~~~~~~~~~~~
|
|
|
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
|
|
|
|
|
This is the main driver of the training loop. This component
|
|
|
|
|
uses the TrialScheduler to prioritize and execute trials,
|
|
|
|
|
queries the SearchAlgorithm for new
|
|
|
|
|
configurations to evaluate, and handles the fault tolerance logic.
|
|
|
|
|
|
|
|
|
|
**Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
|
|
|
|
|
is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
|
|
|
|
|
For example, if a node is lost while a trial (specifically, the corresponding
|
|
|
|
|
Trainable of the trial) is still executing on that node and checkpointing
|
|
|
|
|
is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
|
|
|
|
|
from the last available checkpoint when it is run.
|
|
|
|
|
The TrialRunner is also in charge of checkpointing the entire experiment execution state
|
|
|
|
|
upon each loop iteration. This allows users to restart their experiment
|
|
|
|
|
in case of machine failure.
|
|
|
|
|
|
|
|
|
|
See the docstring at :ref:`trialrunner-docstring`.
|
|
|
|
|
|
|
|
|
|
Trial objects
|
|
|
|
|
~~~~~~~~~~~~~
|
|
|
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
|
|
|
|
|
This is an internal data structure that contains metadata about each training run. Each Trial
|
|
|
|
|
object is mapped one-to-one with a Trainable object but are not themselves
|
|
|
|
|
distributed/remote. Trial objects transition among
|
|
|
|
|
the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
|
|
|
|
|
``"TERMINATED"``.
|
|
|
|
|
|
|
|
|
|
See the docstring at :ref:`trial-docstring`.
|
|
|
|
|
|
2022-06-03 16:25:38 +01:00
|
|
|
|
RayTrialExecutor
|
|
|
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__]
|
|
|
|
|
The RayTrialExecutor is a component that interacts with the underlying execution framework.
|
|
|
|
|
It also manages resources to ensure the cluster isn't overloaded.
|
2022-02-27 08:07:34 +01:00
|
|
|
|
|
|
|
|
|
See the docstring at :ref:`raytrialexecutor-docstring`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SearchAlg
|
|
|
|
|
~~~~~~~~~
|
2022-06-25 14:55:30 +01:00
|
|
|
|
[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/search>`__]
|
2022-02-27 08:07:34 +01:00
|
|
|
|
The SearchAlgorithm is a user-provided object
|
|
|
|
|
that is used for querying new hyperparameter configurations to evaluate.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
SearchAlgorithms will be notified every time a trial finishes
|
|
|
|
|
executing one training step (of ``train()``), every time a trial
|
|
|
|
|
errors, and every time a trial completes.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
TrialScheduler
|
|
|
|
|
~~~~~~~~~~~~~~
|
|
|
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__]
|
|
|
|
|
TrialSchedulers operate over a set of possible trials to run,
|
|
|
|
|
prioritizing trial execution given available cluster resources.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
TrialSchedulers are given the ability to kill or pause trials,
|
|
|
|
|
and also are given the ability to reorder/prioritize incoming trials.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Trainables
|
|
|
|
|
~~~~~~~~~~
|
2022-06-24 13:38:01 +01:00
|
|
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/trainable.py>`__]
|
2022-02-27 08:07:34 +01:00
|
|
|
|
These are user-provided objects that are used for
|
|
|
|
|
the training process. If a class is provided, it is expected to conform to the
|
|
|
|
|
Trainable interface. If a function is provided. it is wrapped into a
|
|
|
|
|
Trainable class, and the function itself is executed on a separate thread.
|
2020-09-21 11:27:50 -07:00
|
|
|
|
|
2022-02-27 08:07:34 +01:00
|
|
|
|
Trainables will execute one step of ``train()`` before notifying the TrialRunner.
|