ray/doc/source/tune/tutorials/tune-lifecycle.rst

How does Tune work?
===================

This page provides an overview of Tune's inner workings.
We describe in detail what happens when you call ``Tuner.fit()``, what the lifecycle of a Tune trial looks like
and what the architectural components of Tune are.

.. tip:: Before you continue, be sure to have read :ref:`the Tune Key Concepts page <tune-60-seconds>`.

What happens in ``Tuner.fit``?
------------------------------

When calling the following:

.. code-block:: python

    space = {"x": tune.uniform(0, 1)}
    tuner = tune.Tuner(my_trainable, param_space=space, tune_config=tune.TuneConfig(num_samples=10))
    results = tuner.fit()

The provided ``my_trainable`` is evaluated multiple times in parallel
with different hyperparameters (sampled from ``uniform(0, 1)``).

Every Tune run consists of "driver process" and many "worker processes".
The driver process is the python process that calls ``Tuner.fit()`` (which calls ``ray.init()`` underneath the hood).
The Tune driver process runs on the node where you run your script (which calls ``Tuner.fit()``),
while Ray Tune trainable "actors" run on any node (either on the same node or on worker nodes (distributed Ray only)).

.. note:: :ref:`Ray Actors <actor-guide>` allow you to parallelize an instance of a class in Python.
    When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process
    either on the same machine (or another distributed machine, if running a Ray cluster).
    This actor can then asynchronously execute method calls and maintain its own internal state.

The driver spawns parallel worker processes (:ref:`Ray actors <actor-guide>`)
that are responsible for evaluating each trial using its hyperparameter configuration and the provided trainable
(see the `ray trial executor source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__).

While the Trainable is executing (:ref:`trainable-execution`), the Tune Driver communicates with each actor
via actor methods to receive intermediate training results and pause/stop actors (see :ref:`trial-lifecycle`).

When the Trainable terminates (or is stopped), the actor is also terminated.

.. _trainable-execution:

The execution of a trainable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Tune uses :ref:`Ray actors <actor-guide>` to parallelize the evaluation of multiple hyperparameter configurations.
Each actor is a Python process that executes an instance of the user-provided Trainable.

The definition of the user-provided Trainable will be
:ref:`serialized via cloudpickle <serialization-guide>`) and sent to each actor process.
Each Ray actor will start an instance of the Trainable to be executed.

If the Trainable is a class, it will be executed iteratively by calling ``train/step``.
After each invocation, the driver is notified that a "result dict" is ready.
The driver will then pull the result via ``ray.get``.

If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread.
Whenever ``session.report`` is called, the execution thread is paused and waits for the driver to pull a
result (see `function_trainable.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/function_trainable.py>`__.
After pulling, the actor’s execution thread will automatically resume.


Resource Management in Tune
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before running a trial, the Ray Tune driver will check whether there are available
resources on the cluster (see :ref:`resource-requirements`).
It will compare the available resources with the resources required by the trial.

If there is space on the cluster, then the Tune Driver will start a Ray actor (worker).
This actor will be scheduled and executed on some node where the resources are available.
See :doc:`tune-resources` for more information.

.. _trial-lifecycle:

Lifecycle of a Trial
--------------------

A trial's life cycle consists of 6 stages:

* **Initialization** (generation): A trial is first generated as a hyperparameter sample,
  and its parameters are configured according to what was provided in ``Tuner``.
  Trials are then placed into a queue to be executed (with status PENDING).

* **PENDING**: A pending trial is a trial to be executed on the machine.
  Every trial is configured with resource values. Whenever the trial’s resource values are available,
  Tune will run the trial (by starting a ray actor holding the config and the training function.

* **RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel.
  See the :ref:`trainable execution <trainable-execution>` section for more details.

* **ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored.
  Note that exceptions can be propagated from an actor to the main Tune driver process.
  If max_retries is set, Tune will set the trial back into "PENDING" and later start it from the last checkpoint.

* **TERMINATED**: A trial is terminated if it is stopped by a Stopper/Scheduler.
  If using the Function API, the trial is also terminated when the function stops.

* **PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped.
  A paused trial can later be resumed from the most recent checkpoint.


Tune's Architecture
-------------------

.. image:: ../../images/tune-arch.png

The blue boxes refer to internal components, while green boxes are public-facing.

Tune's main components consist of ``TrialRunner``, ``Trial`` objects, ``TrialExecutor``, ``SearchAlg``,
``TrialScheduler``, and ``Trainable``.

.. _trial-runner-flow:

This is an illustration of the high-level training flow and how some of the components interact:

*Note: This figure is horizontally scrollable*

.. figure:: ../../images/tune-trial-runner-flow-horizontal.png
    :class: horizontal-scroll


TrialRunner
~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
This is the main driver of the training loop. This component
uses the TrialScheduler to prioritize and execute trials,
queries the SearchAlgorithm for new
configurations to evaluate, and handles the fault tolerance logic.

**Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
For example, if a node is lost while a trial (specifically, the corresponding
Trainable of the trial) is still executing on that node and checkpointing
is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
from the last available checkpoint when it is run.
The TrialRunner is also in charge of checkpointing the entire experiment execution state
upon each loop iteration. This allows users to restart their experiment
in case of machine failure.

See the docstring at :ref:`trialrunner-docstring`.

Trial objects
~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
This is an internal data structure that contains metadata about each training run. Each Trial
object is mapped one-to-one with a Trainable object but are not themselves
distributed/remote. Trial objects transition among
the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
``"TERMINATED"``.

See the docstring at :ref:`trial-docstring`.

RayTrialExecutor
~~~~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__]
The RayTrialExecutor is a component that interacts with the underlying execution framework.
It also manages resources to ensure the cluster isn't overloaded.

See the docstring at :ref:`raytrialexecutor-docstring`.


SearchAlg
~~~~~~~~~
[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/search>`__]
The SearchAlgorithm is a user-provided object
that is used for querying new hyperparameter configurations to evaluate.

SearchAlgorithms will be notified every time a trial finishes
executing one training step (of ``train()``), every time a trial
errors, and every time a trial completes.

TrialScheduler
~~~~~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__]
TrialSchedulers operate over a set of possible trials to run,
prioritizing trial execution given available cluster resources.

TrialSchedulers are given the ability to kill or pause trials,
and also are given the ability to reorder/prioritize incoming trials.

Trainables
~~~~~~~~~~
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/trainable.py>`__]
These are user-provided objects that are used for
the training process. If a class is provided, it is expected to conform to the
Trainable interface. If a function is provided. it is wrapped into a
Trainable class, and the function itself is executed on a separate thread.

Trainables will execute one step of ``train()`` before notifying the TrialRunner.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
+								How does Tune work?
 								===================
 								This page provides an overview of Tune's inner workings.
-												[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)

Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig


											
										
										
											2022-07-24 15:45:24 +01:00
+								We describe in detail what happens when you call ``Tuner.fit()``, what the lifecycle of a Tune trial looks like
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								and what the architectural components of Tune are.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								.. tip:: Before you continue, be sure to have read :ref:`the Tune Key Concepts page <tune-60-seconds>`.
-												[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)

Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig


											
										
										
											2022-07-24 15:45:24 +01:00
+								What happens in ``Tuner.fit``?
 								------------------------------
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								When calling the following:
 								.. code-block:: python
 								    space = {"x": tune.uniform(0, 1)}
-												[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)

Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig


											
										
										
											2022-07-24 15:45:24 +01:00
+								    tuner = tune.Tuner(my_trainable, param_space=space, tune_config=tune.TuneConfig(num_samples=10))
 								    results = tuner.fit()
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								The provided ``my_trainable`` is evaluated multiple times in parallel
 								with different hyperparameters (sampled from ``uniform(0, 1)``).
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Every Tune run consists of "driver process" and many "worker processes".
-												[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)

Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig


											
										
										
											2022-07-24 15:45:24 +01:00
+								The driver process is the python process that calls ``Tuner.fit()`` (which calls ``ray.init()`` underneath the hood).
 								The Tune driver process runs on the node where you run your script (which calls ``Tuner.fit()``),
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								while Ray Tune trainable "actors" run on any node (either on the same node or on worker nodes (distributed Ray only)).
 								.. note:: :ref:`Ray Actors <actor-guide>` allow you to parallelize an instance of a class in Python.
 								    When you instantiate a class that is a Ray actor, Ray will start a instance of that class on a separate process
 								    either on the same machine (or another distributed machine, if running a Ray cluster).
 								    This actor can then asynchronously execute method calls and maintain its own internal state.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								The driver spawns parallel worker processes (:ref:`Ray actors <actor-guide>`)
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								that are responsible for evaluating each trial using its hyperparameter configuration and the provided trainable
-												[tune/docs] Trial executor doc fix (#25440)


											
										
										
											2022-06-03 16:25:38 +01:00
+								(see the `ray trial executor source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__).
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								While the Trainable is executing (:ref:`trainable-execution`), the Tune Driver communicates with each actor
 								via actor methods to receive intermediate training results and pause/stop actors (see :ref:`trial-lifecycle`).
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								When the Trainable terminates (or is stopped), the actor is also terminated.
 								.. _trainable-execution:
 								The execution of a trainable
 								~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Tune uses :ref:`Ray actors <actor-guide>` to parallelize the evaluation of multiple hyperparameter configurations.
 								Each actor is a Python process that executes an instance of the user-provided Trainable.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								The definition of the user-provided Trainable will be
 								:ref:`serialized via cloudpickle <serialization-guide>`) and sent to each actor process.
 								Each Ray actor will start an instance of the Trainable to be executed.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								If the Trainable is a class, it will be executed iteratively by calling ``train/step``.
 								After each invocation, the driver is notified that a "result dict" is ready.
 								The driver will then pull the result via ``ray.get``.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								If the trainable is a callable or a function, it will be executed on the Ray actor process on a separate execution thread.
-												[air] update documentation to use `session.report` (#26051)

Update documentation to use `session.report`.

Next steps:
1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer.
2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 
3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources`
4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-06-30 10:37:31 -07:00
+								Whenever ``session.report`` is called, the execution thread is paused and waits for the driver to pull a
-												[tune/structure] Introduce `trainable` package (#26046)

Introduce a `trainable` package to house Trainable, FunctionTrainable (renamed), Session, and utilities.
											
										
										
											2022-06-23 21:50:55 +01:00
+								result (see `function_trainable.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/function_trainable.py>`__.
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								After pulling, the actor’s execution thread will automatically resume.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								Resource Management in Tune
 								~~~~~~~~~~~~~~~~~~~~~~~~~~~
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Before running a trial, the Ray Tune driver will check whether there are available
 								resources on the cluster (see :ref:`resource-requirements`).
-												[Docs ] Tune docs overhaul (first part) (#22112)

Continuing docs overhaul, tune now has:

- [x] better landing page
- [x] a getting started guide
- [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials
- [x] the new user guide contains guides to tune features and practical integrations
- [x] we rewrote some of the feature guides for clarity 
- [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower)
- [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030.
- [x] Examples are tested in the new framework, of course.

There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
											
										
										
											2022-02-07 16:47:03 +01:00
+								It will compare the available resources with the resources required by the trial.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[Docs ] Tune docs overhaul (first part) (#22112)

Continuing docs overhaul, tune now has:

- [x] better landing page
- [x] a getting started guide
- [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials
- [x] the new user guide contains guides to tune features and practical integrations
- [x] we rewrote some of the feature guides for clarity 
- [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower)
- [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030.
- [x] Examples are tested in the new framework, of course.

There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
											
										
										
											2022-02-07 16:47:03 +01:00
+								If there is space on the cluster, then the Tune Driver will start a Ray actor (worker).
 								This actor will be scheduled and executed on some node where the resources are available.
 								See :doc:`tune-resources` for more information.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
 								.. _trial-lifecycle:
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Lifecycle of a Trial
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
+								--------------------
 								A trial's life cycle consists of 6 stages:
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								* **Initialization** (generation): A trial is first generated as a hyperparameter sample,
-												[air/tuner/docs] Update docs for Tuner() API 1: RSTs, docs, move reuse_actors (#26930)

Signed-off-by: Kai Fricke coding@kaifricke.com

Why are these changes needed?
Splitting up #26884: This PR includes changes to use Tuner() instead of tune.run() for most docs files (rst and py), and a change to move reuse_actors to the TuneConfig


											
										
										
											2022-07-24 15:45:24 +01:00
+								  and its parameters are configured according to what was provided in ``Tuner``.
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								  Trials are then placed into a queue to be executed (with status PENDING).
 								* **PENDING**: A pending trial is a trial to be executed on the machine.
 								  Every trial is configured with resource values. Whenever the trial’s resource values are available,
 								  Tune will run the trial (by starting a ray actor holding the config and the training function.
 								* **RUNNING**: A running trial is assigned a Ray Actor. There can be multiple running trials in parallel.
 								  See the :ref:`trainable execution <trainable-execution>` section for more details.
 								* **ERRORED**: If a running trial throws an exception, Tune will catch that exception and mark the trial as errored.
 								  Note that exceptions can be propagated from an actor to the main Tune driver process.
 								  If max_retries is set, Tune will set the trial back into "PENDING" and later start it from the last checkpoint.
 								* **TERMINATED**: A trial is terminated if it is stopped by a Stopper/Scheduler.
 								  If using the Function API, the trial is also terminated when the function stops.
 								* **PAUSED**: A trial can be paused by a Trial scheduler. This means that the trial’s actor will be stopped.
 								  A paused trial can later be resumed from the most recent checkpoint.
 								Tune's Architecture
 								-------------------
 								.. image:: ../../images/tune-arch.png
 								The blue boxes refer to internal components, while green boxes are public-facing.
 								Tune's main components consist of ``TrialRunner``, ``Trial`` objects, ``TrialExecutor``, ``SearchAlg``,
 								``TrialScheduler``, and ``Trainable``.
 								.. _trial-runner-flow:
 								This is an illustration of the high-level training flow and how some of the components interact:
 								*Note: This figure is horizontally scrollable*
 								.. figure:: ../../images/tune-trial-runner-flow-horizontal.png
 								    :class: horizontal-scroll
 								TrialRunner
 								~~~~~~~~~~~
 								[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
 								This is the main driver of the training loop. This component
 								uses the TrialScheduler to prioritize and execute trials,
 								queries the SearchAlgorithm for new
 								configurations to evaluate, and handles the fault tolerance logic.
 								**Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
 								is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
 								For example, if a node is lost while a trial (specifically, the corresponding
 								Trainable of the trial) is still executing on that node and checkpointing
 								is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
 								from the last available checkpoint when it is run.
 								The TrialRunner is also in charge of checkpointing the entire experiment execution state
 								upon each loop iteration. This allows users to restart their experiment
 								in case of machine failure.
 								See the docstring at :ref:`trialrunner-docstring`.
 								Trial objects
 								~~~~~~~~~~~~~
 								[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
 								This is an internal data structure that contains metadata about each training run. Each Trial
 								object is mapped one-to-one with a Trainable object but are not themselves
 								distributed/remote. Trial objects transition among
 								the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
 								``"TERMINATED"``.
 								See the docstring at :ref:`trial-docstring`.
-												[tune/docs] Trial executor doc fix (#25440)


											
										
										
											2022-06-03 16:25:38 +01:00
+								RayTrialExecutor
 								~~~~~~~~~~~~~~~~
 								[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/ray_trial_executor.py>`__]
 								The RayTrialExecutor is a component that interacts with the underlying execution framework.
 								It also manages resources to ensure the cluster isn't overloaded.
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
 								See the docstring at :ref:`raytrialexecutor-docstring`.
 								SearchAlg
 								~~~~~~~~~
-												[tune/structure] Refactor `suggest` into `search` package (#26074)

This PR renames the `suggest` package to `search` and alters the layout slightly.

In the new package, the higher-level abstractions are on the top level and the search algorithms have their own subdirectories.

In a future refactor, we can turn algorithms such as PBT into actual `SearchAlgorithm` classes and move them into the `search` package. 

The main reason to keep algorithms and searchers in the same directory is to avoid user confusion - for a user, `Bayesopt` is as much a search algorithm as e.g. `PBT`, so it doesn't make sense to split them up. 
											
										
										
											2022-06-25 14:55:30 +01:00
+								[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/search>`__]
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								The SearchAlgorithm is a user-provided object
 								that is used for querying new hyperparameter configurations to evaluate.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								SearchAlgorithms will be notified every time a trial finishes
 								executing one training step (of ``train()``), every time a trial
 								errors, and every time a trial completes.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								TrialScheduler
 								~~~~~~~~~~~~~~
 								[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__]
 								TrialSchedulers operate over a set of possible trials to run,
 								prioritizing trial execution given available cluster resources.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								TrialSchedulers are given the ability to kill or pause trials,
 								and also are given the ability to reorder/prioritize incoming trials.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Trainables
 								~~~~~~~~~~
-												[hotfix] Fix linkcheck (#26070)


											
										
										
											2022-06-24 13:38:01 +01:00
+								[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable/trainable.py>`__]
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								These are user-provided objects that are used for
 								the training process. If a class is provided, it is expected to conform to the
 								Trainable interface. If a function is provided. it is wrapped into a
 								Trainable class, and the function itself is executed on a separate thread.
-												[tune+core] tune lifecycle and starting ray guide (#10813)


											
										
										
											2020-09-21 11:27:50 -07:00
-												[docs] Tune overhaul part II (#22656)

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
											
										
										
											2022-02-27 08:07:34 +01:00
+								Trainables will execute one step of ``train()`` before notifying the TrialRunner.