ray/doc/source/ray-tune/api_docs/internals.rst

Tune Internals
==============

This page overviews the design and architectures of Tune and provides docstrings for internal components.

.. image:: ../../images/tune-arch.png

The blue boxes refer to internal components, and green boxes are public-facing.

Main Components
---------------

Tune's main components consist of TrialRunner, Trial objects, TrialExecutor, SearchAlg, TrialScheduler, and Trainable.

.. _trial-runner-flow:

This is an illustration of the high-level training flow and how some of the components interact:

*Note: This figure is horizontally scrollable*

.. figure:: ../../images/tune-trial-runner-flow-horizontal.png
    :class: horizontal-scroll


TrialRunner
~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
This is the main driver of the training loop. This component
uses the TrialScheduler to prioritize and execute trials,
queries the SearchAlgorithm for new
configurations to evaluate, and handles the fault tolerance logic.

**Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
For example, if a node is lost while a trial (specifically, the corresponding
Trainable of the trial) is still executing on that node and checkpointing
is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
from the last available checkpoint when it is run.
The TrialRunner is also in charge of checkpointing the entire experiment execution state
upon each loop iteration. This allows users to restart their experiment
in case of machine failure.

See the docstring at :ref:`trialrunner-docstring`.

Trial objects
~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
This is an internal data structure that contains metadata about each training run. Each Trial
object is mapped one-to-one with a Trainable object but are not themselves
distributed/remote. Trial objects transition among
the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
``"TERMINATED"``.

See the docstring at :ref:`trial-docstring`.

TrialExecutor
~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_executor.py>`__]
The TrialExecutor is a component that interacts with the underlying execution framework.
It also manages resources to ensure the cluster isn't overloaded. By default, the TrialExecutor uses Ray to execute trials.

See the docstring at :ref:`raytrialexecutor-docstring`.


SearchAlg
~~~~~~~~~
[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/suggest>`__] The SearchAlgorithm is a user-provided object
that is used for querying new hyperparameter configurations to evaluate.

SearchAlgorithms will be notified every time a trial finishes
executing one training step (of ``train()``), every time a trial
errors, and every time a trial completes.

TrialScheduler
~~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__] TrialSchedulers operate over a set of possible trials to run,
prioritizing trial execution given available cluster resources.

TrialSchedulers are given the ability to kill or pause trials,
and also are given the ability to reorder/prioritize incoming trials.

Trainables
~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable.py>`__]
These are user-provided objects that are used for
the training process. If a class is provided, it is expected to conform to the
Trainable interface. If a function is provided. it is wrapped into a
Trainable class, and the function itself is executed on a separate thread.

Trainables will execute one step of ``train()`` before notifying the TrialRunner.


.. _raytrialexecutor-docstring:

RayTrialExecutor
----------------

.. autoclass:: ray.tune.ray_trial_executor.RayTrialExecutor
    :show-inheritance:
    :members:

.. _trialexecutor-docstring:

TrialExecutor
-------------

.. autoclass:: ray.tune.trial_executor.TrialExecutor
    :members:

.. _trialrunner-docstring:

TrialRunner
-----------

.. autoclass:: ray.tune.trial_runner.TrialRunner

.. _trial-docstring:

Trial
-----

.. autoclass:: ray.tune.trial.Trial

.. _tune-callbacks-docs:

Callbacks
---------

.. autoclass:: ray.tune.callback.Callback
   :members:


.. _resources-docstring:

PlacementGroupFactory
---------------------

.. autoclass:: ray.tune.utils.placement_groups.PlacementGroupFactory


Registry
--------

.. autofunction:: ray.tune.register_trainable

.. autofunction:: ray.tune.register_env
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			`Tune Internals`
			`==============`
[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			`This page overviews the design and architectures of Tune and provides docstrings for internal components.`
[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			`.. image:: ../../images/tune-arch.png`
[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00
			`The blue boxes refer to internal components, and green boxes are public-facing.`

			`Main Components`
			`---------------`

			`Tune's main components consist of TrialRunner, Trial objects, TrialExecutor, SearchAlg, TrialScheduler, and Trainable.`

[tune/docs] Add high level trial runner flow to documentation (#14468) * [tune/docs] Add high level trial runner flow to documentation * Apply suggestions from code review 2021-03-08 10:35:54 +01:00			`.. _trial-runner-flow:`

			`This is an illustration of the high-level training flow and how some of the components interact:`

			`Note: This figure is horizontally scrollable`

			`.. figure:: ../../images/tune-trial-runner-flow-horizontal.png`
			`:class: horizontal-scroll`


[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00			`TrialRunner`
			`~~~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
			`This is the main driver of the training loop. This component`
			`uses the TrialScheduler to prioritize and execute trials,`
			`queries the SearchAlgorithm for new`
			`configurations to evaluate, and handles the fault tolerance logic.`

			Fault Tolerance: The TrialRunner executes checkpointing if ``checkpoint_freq``
			is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
			`For example, if a node is lost while a trial (specifically, the corresponding`
			`Trainable of the trial) is still executing on that node and checkpointing`
			is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
			`from the last available checkpoint when it is run.`
			`The TrialRunner is also in charge of checkpointing the entire experiment execution state`
			`upon each loop iteration. This allows users to restart their experiment`
			`in case of machine failure.`

[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			See the docstring at :ref:`trialrunner-docstring`.

[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00			`Trial objects`
			`~~~~~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
			`This is an internal data structure that contains metadata about each training run. Each Trial`
			`object is mapped one-to-one with a Trainable object but are not themselves`
			`distributed/remote. Trial objects transition among`
			the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
			``"TERMINATED"``.

[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			See the docstring at :ref:`trial-docstring`.

[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00			`TrialExecutor`
			`~~~~~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_executor.py>`__]
			`The TrialExecutor is a component that interacts with the underlying execution framework.`
			`It also manages resources to ensure the cluster isn't overloaded. By default, the TrialExecutor uses Ray to execute trials.`

[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			See the docstring at :ref:`raytrialexecutor-docstring`.


[tune] Contributor Guide and Design Page (#4716) * Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments 2019-05-05 00:04:13 -07:00			`SearchAlg`
			`~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/suggest>`__] The SearchAlgorithm is a user-provided object
			`that is used for querying new hyperparameter configurations to evaluate.`

			`SearchAlgorithms will be notified every time a trial finishes`
			executing one training step (of ``train()``), every time a trial
			`errors, and every time a trial completes.`

			`TrialScheduler`
			`~~~~~~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__] TrialSchedulers operate over a set of possible trials to run,
			`prioritizing trial execution given available cluster resources.`

			`TrialSchedulers are given the ability to kill or pause trials,`
			`and also are given the ability to reorder/prioritize incoming trials.`

			`Trainables`
			`~~~~~~~~~~`
			[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable.py>`__]
			`These are user-provided objects that are used for`
			`the training process. If a class is provided, it is expected to conform to the`
			`Trainable interface. If a function is provided. it is wrapped into a`
			`Trainable class, and the function itself is executed on a separate thread.`

			Trainables will execute one step of ``train()`` before notifying the TrialRunner.
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00

			`.. _raytrialexecutor-docstring:`

			`RayTrialExecutor`
			`----------------`

			`.. autoclass:: ray.tune.ray_trial_executor.RayTrialExecutor`
			`:show-inheritance:`
			`:members:`

			`.. _trialexecutor-docstring:`

			`TrialExecutor`
			`-------------`

			`.. autoclass:: ray.tune.trial_executor.TrialExecutor`
			`:members:`

			`.. _trialrunner-docstring:`

			`TrialRunner`
			`-----------`

			`.. autoclass:: ray.tune.trial_runner.TrialRunner`

			`.. _trial-docstring:`

			`Trial`
			`-----`

			`.. autoclass:: ray.tune.trial.Trial`

[tune] Callbacks for tune runs (#11001) 2020-09-28 00:50:07 +01:00			`.. _tune-callbacks-docs:`

			`Callbacks`
			`---------`

[tune] logger refactor part 1: move classes and utilities to own files (#11746) * [tune] logger refactor part 1: move classes and utilities to own files * Fix circular dependency * Remove uneeded pretty print copy * Apply suggestions from code review 2020-11-03 16:48:09 +01:00			`.. autoclass:: ray.tune.callback.Callback`
[tune] Callbacks for tune runs (#11001) 2020-09-28 00:50:07 +01:00			`:members:`


[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00			`.. _resources-docstring:`

[tune] enable placement groups per default (#13906) * Refactor placement group factory object to accept placement_group arguments instead of callables * Convert resources to pgf * Enable placement groups per default * Fix tests WIP * Fix stop/resume with placement groups * Fix progress reporter test * Fix trial executor tests * Check resource for trial, not resource object * Move ENV vars into class * Fix tests * Sphinx * Wait for trial start in PBT * Revert merge errors * Support trial reuse with placement groups * Better check for just staged trials * Fix trial queuing * Wait for pg after trial termination * Clean up PGs before tune run * No PG settings in pbt scheduler * Fix buffering tests * Skip test if ray reports erroneous available resources * Disable PG for cluster resource counting test * Debug output for tests * Output in-use resources for placement groups * Don't start new trial on trial start failure * Add docs * Cleanup PGs once futures returned * Fix placement group shutdown * Use updated_queue flag * Apply suggestions from code review * Apply suggestions from code review * Update docs * Reuse placement groups independently from actors * Do not remove placement groups for paused trials * Only continue enqueueing trials if it didn't fail the first time * Rename parameter * Fix pause trial * Code review + try_recover * Update python/ray/tune/utils/placement_groups.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Move placement group lifecycle management * Move total used resources to pg manager * Update FAQ example * Requeue trial if start was unsuccessful * Do not cleanup pgs at start of run * Revert "Do not cleanup pgs at start of run" This reverts commit 933d9c4c * Delayed PG removal * Fix trial requeue test * Trigger pg cleanup on status update * Fix tests * Fix docs * fix-test Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2021-02-23 18:46:02 +01:00			`PlacementGroupFactory`
			`---------------------`
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00
[tune] enable placement groups per default (#13906) * Refactor placement group factory object to accept placement_group arguments instead of callables * Convert resources to pgf * Enable placement groups per default * Fix tests WIP * Fix stop/resume with placement groups * Fix progress reporter test * Fix trial executor tests * Check resource for trial, not resource object * Move ENV vars into class * Fix tests * Sphinx * Wait for trial start in PBT * Revert merge errors * Support trial reuse with placement groups * Better check for just staged trials * Fix trial queuing * Wait for pg after trial termination * Clean up PGs before tune run * No PG settings in pbt scheduler * Fix buffering tests * Skip test if ray reports erroneous available resources * Disable PG for cluster resource counting test * Debug output for tests * Output in-use resources for placement groups * Don't start new trial on trial start failure * Add docs * Cleanup PGs once futures returned * Fix placement group shutdown * Use updated_queue flag * Apply suggestions from code review * Apply suggestions from code review * Update docs * Reuse placement groups independently from actors * Do not remove placement groups for paused trials * Only continue enqueueing trials if it didn't fail the first time * Rename parameter * Fix pause trial * Code review + try_recover * Update python/ray/tune/utils/placement_groups.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Move placement group lifecycle management * Move total used resources to pg manager * Update FAQ example * Requeue trial if start was unsuccessful * Do not cleanup pgs at start of run * Revert "Do not cleanup pgs at start of run" This reverts commit 933d9c4c * Delayed PG removal * Fix trial requeue test * Trigger pg cleanup on status update * Fix tests * Fix docs * fix-test Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2021-02-23 18:46:02 +01:00			`.. autoclass:: ray.tune.utils.placement_groups.PlacementGroupFactory`
[tune] Reformat Sections of API Reference (#7706) * moveit * moveit * docstrings to ref * Update tune-usage.rst Co-authored-by: Sven Mika <sven@anyscale.io> 2020-03-23 12:23:21 -07:00


			`Registry`
			`--------`

			`.. autofunction:: ray.tune.register_trainable`

			`.. autofunction:: ray.tune.register_env`