mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

* Move setup script out * some changes * Finished Contributor guide * some comments to the design * move * Apply suggestions from code review Co-Authored-By: richardliaw <rliaw@berkeley.edu> * sourcecode * comments
75 lines
3.4 KiB
ReStructuredText
75 lines
3.4 KiB
ReStructuredText
Tune Design Guide
|
|
=================
|
|
|
|
In this part of the documentation, we overview the design and architecture
|
|
of Tune.
|
|
|
|
.. image:: images/tune-arch.png
|
|
|
|
The blue boxes refer to internal components, and green boxes are public-facing.
|
|
Please refer to the package reference for `user-facing APIs <tune-package-ref.html>`__.
|
|
|
|
Main Components
|
|
---------------
|
|
|
|
Tune's main components consist of TrialRunner, Trial objects, TrialExecutor, SearchAlg, TrialScheduler, and Trainable.
|
|
|
|
TrialRunner
|
|
~~~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_runner.py>`__]
|
|
This is the main driver of the training loop. This component
|
|
uses the TrialScheduler to prioritize and execute trials,
|
|
queries the SearchAlgorithm for new
|
|
configurations to evaluate, and handles the fault tolerance logic.
|
|
|
|
**Fault Tolerance**: The TrialRunner executes checkpointing if ``checkpoint_freq``
|
|
is set, along with automatic trial restarting in case of trial failures (if ``max_failures`` is set).
|
|
For example, if a node is lost while a trial (specifically, the corresponding
|
|
Trainable of the trial) is still executing on that node and checkpointing
|
|
is enabled, the trial will then be reverted to a ``"PENDING"`` state and resumed
|
|
from the last available checkpoint when it is run.
|
|
The TrialRunner is also in charge of checkpointing the entire experiment execution state
|
|
upon each loop iteration. This allows users to restart their experiment
|
|
in case of machine failure.
|
|
|
|
Trial objects
|
|
~~~~~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial.py>`__]
|
|
This is an internal data structure that contains metadata about each training run. Each Trial
|
|
object is mapped one-to-one with a Trainable object but are not themselves
|
|
distributed/remote. Trial objects transition among
|
|
the following states: ``"PENDING"``, ``"RUNNING"``, ``"PAUSED"``, ``"ERRORED"``, and
|
|
``"TERMINATED"``.
|
|
|
|
TrialExecutor
|
|
~~~~~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trial_executor.py>`__]
|
|
The TrialExecutor is a component that interacts with the underlying execution framework.
|
|
It also manages resources to ensure the cluster isn't overloaded. By default, the TrialExecutor uses Ray to execute trials.
|
|
|
|
SearchAlg
|
|
~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/tree/master/python/ray/tune/suggest>`__] The SearchAlgorithm is a user-provided object
|
|
that is used for querying new hyperparameter configurations to evaluate.
|
|
|
|
SearchAlgorithms will be notified every time a trial finishes
|
|
executing one training step (of ``train()``), every time a trial
|
|
errors, and every time a trial completes.
|
|
|
|
TrialScheduler
|
|
~~~~~~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/schedulers>`__] TrialSchedulers operate over a set of possible trials to run,
|
|
prioritizing trial execution given available cluster resources.
|
|
|
|
TrialSchedulers are given the ability to kill or pause trials,
|
|
and also are given the ability to reorder/prioritize incoming trials.
|
|
|
|
Trainables
|
|
~~~~~~~~~~
|
|
[`source code <https://github.com/ray-project/ray/blob/master/python/ray/tune/trainable.py>`__]
|
|
These are user-provided objects that are used for
|
|
the training process. If a class is provided, it is expected to conform to the
|
|
Trainable interface. If a function is provided. it is wrapped into a
|
|
Trainable class, and the function itself is executed on a separate thread.
|
|
|
|
Trainables will execute one step of ``train()`` before notifying the TrialRunner.
|