Added example to user guide for cloud checkpointing (#20045)

Co-authored-by: will <will@anyscale.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
This commit is contained in:
Will Drevo 2021-11-15 07:43:06 -08:00 committed by GitHub
parent 6ff4061f3a
commit fa878e2d4d
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
16 changed files with 401 additions and 182 deletions

View file

@ -6,6 +6,7 @@
- Get rid of the badges in the top - Get rid of the badges in the top
- Get rid of the references section at the bottom - Get rid of the references section at the bottom
- Be sure not to delete the API reference section in the bottom of this file. - Be sure not to delete the API reference section in the bottom of this file.
- add `.. _lightgbm-ray-tuning:` before the "Hyperparameter Tuning" section
- Adjust some link targets (e.g. for "Ray Tune") to anonymous references - Adjust some link targets (e.g. for "Ray Tune") to anonymous references
by adding a second underscore (use `target <link>`__) by adding a second underscore (use `target <link>`__)
- Search for `\ **` and delete this from the links (bold links are not supported) - Search for `\ **` and delete this from the links (bold links are not supported)
@ -217,6 +218,8 @@ Example loading multiple parquet files:
columns=columns, columns=columns,
filetype=RayFileType.PARQUET) filetype=RayFileType.PARQUET)
.. _lightgbm-ray-tuning:
Hyperparameter Tuning Hyperparameter Tuning
--------------------- ---------------------

View file

@ -6,6 +6,7 @@
- remove the table of contents - remove the table of contents
- remove the PyTorch Lightning Compatibility section - remove the PyTorch Lightning Compatibility section
- Be sure not to delete the API reference section in the bottom of this file. - Be sure not to delete the API reference section in the bottom of this file.
- add `.. _ray-lightning-tuning:` before the "Hyperparameter Tuning with Ray Tune" section
- Adjust some link targets (e.g. for "Ray Tune") to anonymous references - Adjust some link targets (e.g. for "Ray Tune") to anonymous references
by adding a second underscore (use `target <link>`__) by adding a second underscore (use `target <link>`__)
- Search for `\ **` and delete this from the links (bold links are not supported) - Search for `\ **` and delete this from the links (bold links are not supported)
@ -131,6 +132,8 @@ With sharded training, leverage the scalability of data parallel training while
See the `Pytorch Lightning docs <https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#sharded-training>`__ for more information on sharded training. See the `Pytorch Lightning docs <https://pytorch-lightning.readthedocs.io/en/stable/advanced/multi_gpu.html#sharded-training>`__ for more information on sharded training.
.. _ray-lightning-tuning:
Hyperparameter Tuning with Ray Tune Hyperparameter Tuning with Ray Tune
----------------------------------- -----------------------------------

View file

@ -69,6 +69,7 @@ In your training program, insert the following, and **customize** for each worke
And on each machine, launch a separate process that contains the index of the worker and information about all other nodes of the cluster. And on each machine, launch a separate process that contains the index of the worker and information about all other nodes of the cluster.
.. _ray-train-tftrainer-example:
TFTrainer Example TFTrainer Example
----------------- -----------------

View file

@ -127,55 +127,13 @@ If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray
1. In the examples, the Ray redis address commonly used is ``localhost:6379``. 1. In the examples, the Ray redis address commonly used is ``localhost:6379``.
2. If the Ray cluster is already started, you should not need to run anything on the worker nodes. 2. If the Ray cluster is already started, you should not need to run anything on the worker nodes.
Syncing Syncing
------- -------
Tune stores checkpoints on the node where the trials are executed. If you are training on more than one node, In a distributed experiment, you should try to use :ref:`cloud checkpointing <tune-cloud-checkpointing>` to
this means that some trial checkpoints may be on the head node and others are not. reduce synchronization overhead. For this, you just have to specify an ``upload_dir`` in the
:class:`tune.SyncConfig <ray.tune.SyncConfig>`:
When trials are restored (e.g. after a failure or when the experiment was paused), they may be scheduled on
different nodes, but still would need access to the latest checkpoint. To make sure this works, Ray Tune
comes with facilities to synchronize trial checkpoints between nodes.
Generally we consider three cases:
1. When using a shared directory (e.g. via NFS)
2. When using cloud storage (e.g. S3 or GS)
3. When using neither
The default option here is 3, which will be automatically used if nothing else is configured.
Using a shared directory
~~~~~~~~~~~~~~~~~~~~~~~~
If all Ray nodes have access to a shared filesystem, e.g. via NFS, they can all write to this directory.
In this case, we don't need any synchronization at all, as it is implicitly done by the operating system.
For this case, we only need to tell Ray Tune not to do any syncing at all (as syncing is the default):
.. code-block:: python
from ray import tune
tune.run(
trainable,
name="experiment_name",
local_dir="/path/to/shared/storage/",
sync_config=tune.SyncConfig(
syncer=None # Disable syncing
)
)
Note that the driver (on the head node) will have access to all checkpoints locally (in the
shared directory) for further processing.
Using cloud storage
~~~~~~~~~~~~~~~~~~~
If all nodes have access to cloud storage, e.g. S3 or GS, we end up with a similar situation as in the first case,
only that the consolidated directory including all logs and checkpoints lives on cloud storage.
For this case, we tell Ray Tune to use an ``upload_dir`` to store checkpoints at.
This will automatically store both the experiment state and the trial checkpoints at that directory:
.. code-block:: python .. code-block:: python
@ -189,89 +147,10 @@ This will automatically store both the experiment state and the trial checkpoint
) )
) )
We don't have to provide a ``syncer`` here as it will be automatically detected. However, you can provide
a string if you want to use a custom command:
.. code-block:: python For more details or customization, see our
:ref:`guide on checkpointing <tune-checkpoint-syncing>`.
from ray import tune
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
upload_dir="s3://bucket-name/sub-path/",
syncer="aws s3 sync {source} {target}", # Custom sync command
)
)
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``,
as demonstrated in the example above.
The consolidated data will live be available in the cloud bucket. This means that the driver
(on the head node) will not have access to all checkpoints locally. If you want to process
e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
Default syncing (no shared/cloud storage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
default syncing mechanisms, which utilizes ``rsync`` (via SSH) to synchronize checkpoints across
nodes.
Please note that this approach is likely the least efficient one - you should always try to use
shared or cloud storage if possible when training on a multi node cluster.
For the syncing to work, the head node must be able to SSH into the worker nodes. If you are using
the Ray cluster launcher this is usually the case (note that Kubernetes is an exception, but
:ref:`see here for more details <tune-kubernetes>`).
If you don't provide a ``tune.SyncConfig`` at all, rsync-based syncing will be used.
If you want to customize syncing behavior, you can again specify a custom sync template:
.. code-block:: python
from ray import tune
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
# Do not specify an upload dir here
syncer="rsync -savz -e "ssh -i ssh_key.pem" {source} {target}", # Custom sync command
)
)
Alternatively, a function can be provided with the following signature:
.. code-block:: python
def custom_sync_func(source, target):
sync_cmd = "rsync {source} {target}".format(
source=source,
target=target)
sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait()
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
syncer=custom_sync_func,
sync_period=60 # Synchronize more often
)
)
When syncing results back to the driver, the source would be a path similar to ``ubuntu@192.0.0.1:/home/ubuntu/ray_results/trial1``, and the target would be a local path.
Note that we adjusted the sync period in the example above. Setting this to a lower number will pull
checkpoints from remote nodes more often. This will lead to more robust trial recovery,
but it will also lead to more synchronization overhead (as SHH is usually slow).
As in the first case, the driver (on the head node) will have access to all checkpoints locally
for further processing.
.. _tune-distributed-spot: .. _tune-distributed-spot:

View file

@ -148,7 +148,7 @@ to decide which hyperparameter configuration lead to the best results. These met
can also be used to stop bad performing trials early in order to avoid wasting can also be used to stop bad performing trials early in order to avoid wasting
resources on those trials. resources on those trials.
The :ref:`checkpoint saving <tune-checkpoint>` is optional. However, it is necessary if we wanted to use advanced The :ref:`checkpoint saving <tune-checkpoint-syncing>` is optional. However, it is necessary if we wanted to use advanced
schedulers like `Population Based Training <https://docs.ray.io/en/master/tune/tutorials/tune-advanced-tutorial.html>`_. schedulers like `Population Based Training <https://docs.ray.io/en/master/tune/tutorials/tune-advanced-tutorial.html>`_.
In this cases, the created checkpoint directory will be passed as the ``checkpoint_dir`` parameter In this cases, the created checkpoint directory will be passed as the ``checkpoint_dir`` parameter
to the training function. to the training function.

View file

@ -20,7 +20,7 @@ Tune includes distributed implementations of early stopping algorithms such as `
.. tip:: The easiest scheduler to start with is the ``ASHAScheduler`` which will aggressively terminate low-performing trials. .. tip:: The easiest scheduler to start with is the ``ASHAScheduler`` which will aggressively terminate low-performing trials.
When using schedulers, you may face compatibility issues, as shown in the below compatibility matrix. Certain schedulers cannot be used with Search Algorithms, and certain schedulers are require :ref:`checkpointing to be implemented <tune-checkpoint>`. When using schedulers, you may face compatibility issues, as shown in the below compatibility matrix. Certain schedulers cannot be used with Search Algorithms, and certain schedulers are require :ref:`checkpointing to be implemented <tune-checkpoint-syncing>`.
Schedulers can dynamically change trial resource requirements during tuning. This is currently implemented in ``ResourceChangingScheduler``, which can wrap around any other scheduler. Schedulers can dynamically change trial resource requirements during tuning. This is currently implemented in ``ResourceChangingScheduler``, which can wrap around any other scheduler.
@ -145,7 +145,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
}) })
tune.run( ... , scheduler=pbt_scheduler) tune.run( ... , scheduler=pbt_scheduler)
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore <tune-checkpoint>`). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore <tune-checkpoint-syncing>`). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
You can run this :doc:`toy PBT example </tune/examples/pbt_function>` to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment: You can run this :doc:`toy PBT example </tune/examples/pbt_function>` to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment:
@ -212,7 +212,7 @@ PB2 can be enabled by setting the ``scheduler`` parameter of ``tune.run``, e.g.:
tune.run( ... , scheduler=pb2_scheduler) tune.run( ... , scheduler=pb2_scheduler)
When the PB2 scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore <tune-checkpoint>`). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation. When the PB2 scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support :ref:`save and restore <tune-checkpoint-syncing>`). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
The primary motivation for PB2 is the ability to find promising hyperparamters with only a small population size. With that in mind, you can run this :doc:`PB2 PPO example </tune/examples/pb2_ppo_example>` to compare PB2 vs. PBT, with a population size of ``4`` (as in the paper). The example uses the ``BipedalWalker`` environment so does not require any additional licenses. The primary motivation for PB2 is the ability to find promising hyperparamters with only a small population size. With that in mind, you can run this :doc:`PB2 PPO example </tune/examples/pb2_ppo_example>` to compare PB2 vs. PBT, with a population size of ``4`` (as in the paper). The example uses the ``BipedalWalker`` environment so does not require any additional licenses.
@ -239,7 +239,7 @@ This class is a utility scheduler, allowing for trial resource requirements to b
* If you are using the Trainable (class) API for tuning, your Trainable must implement ``Trainable.update_resources``, which will let your model know about the new resources assigned. You can also obtain the current trial resources by calling ``Trainable.trial_resources``. * If you are using the Trainable (class) API for tuning, your Trainable must implement ``Trainable.update_resources``, which will let your model know about the new resources assigned. You can also obtain the current trial resources by calling ``Trainable.trial_resources``.
* If you are using the functional API for tuning, the current trial resources can be obtained by calling `tune.get_trial_resources()` inside the training function. The function should be able to :ref:`load and save checkpoints <tune-checkpoint>` (the latter preferably every iteration). * If you are using the functional API for tuning, the current trial resources can be obtained by calling `tune.get_trial_resources()` inside the training function. The function should be able to :ref:`load and save checkpoints <tune-checkpoint-syncing>` (the latter preferably every iteration).
An example of this in use can be found here: :doc:`/tune/examples/xgboost_dynamic_resources_example`. An example of this in use can be found here: :doc:`/tune/examples/xgboost_dynamic_resources_example`.

View file

@ -256,17 +256,6 @@ Use ``validate_save_restore`` to catch ``save_checkpoint``/``load_checkpoint`` e
validate_save_restore(MyTrainableClass, use_object_store=True) validate_save_restore(MyTrainableClass, use_object_store=True)
.. _tune-cloud-checkpointing:
Storing checkpoints on cloud storage
------------------------------------
Ray Tune trainables can sync trial logs and checkpoints to cloud storage (via the `upload_dir`). This is especially
useful when training a large number of distributed trials, as logs and checkpoints are otherwise synchronized
via SSH, which quickly can become a performance bottleneck.
To make use of cloud checkpointing, just specify an ``upload_dir`` in the
:ref:`tune.SyncConfig <tune-sync-config>`.
Advanced: Reusing Actors Advanced: Reusing Actors
~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~

View file

@ -0,0 +1,6 @@
:orphan:
custom_func_checkpointing
~~~~~~~~~~~~~~~~~~~~~~~~~
.. literalinclude:: /../../python/ray/tune/examples/custom_func_checkpointing.py

View file

@ -13,7 +13,7 @@ Tune is a Python library for experiment execution and hyperparameter tuning at a
* Launch a multi-node :ref:`distributed hyperparameter sweep <tune-distributed>` in less than 10 lines of code. * Launch a multi-node :ref:`distributed hyperparameter sweep <tune-distributed>` in less than 10 lines of code.
* Supports any machine learning framework, :ref:`including PyTorch, XGBoost, MXNet, and Keras <tune-guides>`. * Supports any machine learning framework, :ref:`including PyTorch, XGBoost, MXNet, and Keras <tune-guides>`.
* Automatically manages :ref:`checkpoints <tune-checkpoint>` and logging to :ref:`TensorBoard <tune-logging>`. * Automatically manages :ref:`checkpoints <tune-checkpoint-syncing>` and logging to :ref:`TensorBoard <tune-logging>`.
* Choose among state of the art algorithms such as :ref:`Population Based Training (PBT) <tune-scheduler-pbt>`, :ref:`BayesOptSearch <bayesopt>`, :ref:`HyperBand/ASHA <tune-scheduler-hyperband>`. * Choose among state of the art algorithms such as :ref:`Population Based Training (PBT) <tune-scheduler-pbt>`, :ref:`BayesOptSearch <bayesopt>`, :ref:`HyperBand/ASHA <tune-scheduler-hyperband>`.
* Move your models from training to serving on the same infrastructure with `Ray Serve`_. * Move your models from training to serving on the same infrastructure with `Ray Serve`_.
@ -70,7 +70,7 @@ A key problem with machine learning frameworks is the need to restructure all of
With Tune, you can optimize your model just by :ref:`adding a few code snippets <tune-tutorial>`. With Tune, you can optimize your model just by :ref:`adding a few code snippets <tune-tutorial>`.
Further, Tune actually removes boilerplate from your code training workflow, automatically :ref:`managing checkpoints <tune-checkpoint>` and :ref:`logging results to tools <tune-logging>` such as MLflow and TensorBoard. Further, Tune actually removes boilerplate from your code training workflow, automatically :ref:`managing checkpoints <tune-checkpoint-syncing>` and :ref:`logging results to tools <tune-logging>` such as MLflow and TensorBoard.
Multi-GPU & distributed training out of the box Multi-GPU & distributed training out of the box

View file

@ -66,7 +66,7 @@ See the documentation: :ref:`trainable-docs` and :ref:`examples <tune-general-ex
tune.run and Trials tune.run and Trials
------------------- -------------------
Use :ref:`tune.run <tune-run-ref>` to execute hyperparameter tuning. This function manages your experiment and provides many features such as :ref:`logging <tune-logging>`, :ref:`checkpointing <tune-checkpoint>`, and :ref:`early stopping <tune-stopping>`. Use :ref:`tune.run <tune-run-ref>` to execute hyperparameter tuning. This function manages your experiment and provides many features such as :ref:`logging <tune-logging>`, :ref:`checkpointing <tune-checkpoint-syncing>`, and :ref:`early stopping <tune-stopping>`.
.. code-block:: python .. code-block:: python

View file

@ -284,10 +284,10 @@ globally could have side effects. For instance, it could influence the
way your dataset is split. Thus, we leave it up to the user to make way your dataset is split. Thus, we leave it up to the user to make
these global configuration changes. these global configuration changes.
.. _tune-checkpoint: .. _tune-checkpoint-syncing:
Checkpointing Checkpointing and synchronization
------------- ---------------------------------
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. This allows you to: When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. This allows you to:
@ -295,49 +295,299 @@ When running a hyperparameter search, Tune can automatically and periodically sa
* use pre-emptible machines (by automatically restoring from last checkpoint) * use pre-emptible machines (by automatically restoring from last checkpoint)
* Pausing trials when using Trial Schedulers such as HyperBand and PBT. * Pausing trials when using Trial Schedulers such as HyperBand and PBT.
To use Tune's checkpointing features, you must expose a ``checkpoint_dir`` argument in the function signature, and call ``tune.checkpoint_dir``: Tune stores checkpoints on the node where the trials are executed. If you are training on more than one node,
this means that some trial checkpoints may be on the head node and others are not.
When trials are restored (e.g. after a failure or when the experiment was paused), they may be scheduled on
different nodes, but still would need access to the latest checkpoint. To make sure this works, Ray Tune
comes with facilities to synchronize trial checkpoints between nodes.
Generally we consider three cases:
1. When using a shared directory (e.g. via NFS)
2. When using cloud storage (e.g. S3 or GS)
3. When using neither
The default option here is 3, which will be automatically used if nothing else is configured.
Using a shared directory
~~~~~~~~~~~~~~~~~~~~~~~~
If all Ray nodes have access to a shared filesystem, e.g. via NFS, they can all write to this directory.
In this case, we don't need any synchronization at all, as it is implicitly done by the operating system.
For this case, we only need to tell Ray Tune not to do any syncing at all (as syncing is the default):
.. code-block:: python .. code-block:: python
import os
import time
from ray import tune from ray import tune
def train_func(config, checkpoint_dir=None): tune.run(
start = 0 trainable,
if checkpoint_dir: name="experiment_name",
with open(os.path.join(checkpoint_dir, "checkpoint")) as f: local_dir="/path/to/shared/storage/",
state = json.loads(f.read()) sync_config=tune.SyncConfig(
start = state["step"] + 1 syncer=None # Disable syncing
)
)
for step in range(start, 100): Note that the driver (on the head node) will have access to all checkpoints locally (in the
time.sleep(1) shared directory) for further processing.
# Obtain a checkpoint directory
with tune.checkpoint_dir(step=step) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"step": step}))
tune.report(hello="world", ray="tune") .. _tune-cloud-checkpointing:
tune.run(train_func) Using cloud storage
~~~~~~~~~~~~~~~~~~~
If all nodes have access to cloud storage, e.g. S3 or GS, the remote trials can automatically synchronize their
checkpoints. For the filesyste, we end up with a similar situation as in the first case,
only that the consolidated directory including all logs and checkpoints lives on cloud storage.
In this example, checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_name/checkpoint_<step>``. This approach is especially useful when training a large number of distributed trials,
as logs and checkpoints are otherwise synchronized via SSH, which quickly can become a performance bottleneck.
You can restore a single trial checkpoint by using ``tune.run(restore=<checkpoint_dir>)`` By doing this, you can change whatever experiments' configuration such as the experiment's name: For this case, we tell Ray Tune to use an ``upload_dir`` to store checkpoints at.
This will automatically store both the experiment state and the trial checkpoints at that directory:
.. code-block:: python
from ray import tune
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
upload_dir="s3://bucket-name/sub-path/"
)
)
We don't have to provide a ``syncer`` here as it will be automatically detected. However, you can provide
a string if you want to use a custom command:
.. code-block:: python
from ray import tune
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
upload_dir="s3://bucket-name/sub-path/",
syncer="aws s3 sync {source} {target}", # Custom sync command
)
)
If a string is provided, then it must include replacement fields ``{source}`` and ``{target}``,
as demonstrated in the example above.
The consolidated data will live be available in the cloud bucket. This means that the driver
(on the head node) will not have access to all checkpoints locally. If you want to process
e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
Default syncing (no shared/cloud storage)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you're using neither a shared filesystem nor cloud storage, Ray Tune will resort to the
default syncing mechanisms, which utilizes ``rsync`` (via SSH) to synchronize checkpoints across
nodes.
Please note that this approach is likely the least efficient one - you should always try to use
shared or cloud storage if possible when training on a multi node cluster.
For the syncing to work, the head node must be able to SSH into the worker nodes. If you are using
the Ray cluster launcher this is usually the case (note that Kubernetes is an exception, but
:ref:`see here for more details <tune-kubernetes>`).
If you don't provide a ``tune.SyncConfig`` at all, rsync-based syncing will be used.
If you want to customize syncing behavior, you can again specify a custom sync template:
.. code-block:: python
from ray import tune
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
# Do not specify an upload dir here
syncer="rsync -savz -e "ssh -i ssh_key.pem" {source} {target}", # Custom sync command
)
)
Alternatively, a function can be provided with the following signature:
.. code-block:: python
def custom_sync_func(source, target):
sync_cmd = "rsync {source} {target}".format(
source=source,
target=target)
sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait()
tune.run(
trainable,
name="experiment_name",
sync_config=tune.SyncConfig(
syncer=custom_sync_func,
sync_period=60 # Synchronize more often
)
)
When syncing results back to the driver, the source would be a path similar to ``ubuntu@192.0.0.1:/home/ubuntu/ray_results/trial1``, and the target would be a local path.
Note that we adjusted the sync period in the example above. Setting this to a lower number will pull
checkpoints from remote nodes more often. This will lead to more robust trial recovery,
but it will also lead to more synchronization overhead (as SHH is usually slow).
As in the first case, the driver (on the head node) will have access to all checkpoints locally
for further processing.
Checkpointing examples
----------------------
Let's cover how to configure your checkpoints storage location, checkpointing frequency, and how to resume from a previous run.
A simple (cloud) checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cloud storage-backed Tune checkpointing is the recommended best practice for both performance and reliability reasons.
It also enables checkpointing if using Ray on Kubernetes, which does not work out of the box with rsync-based sync,
which relies on SSH. If you'd rather checkpoint locally or use rsync based checkpointing, see :ref:`here <rsync-checkpointing>`.
Prerequisites to use cloud checkpointing in Ray Tune for the example below:
Your ``my_trainable`` is either a:
1. **Model with an existing Ray integration**
* XGBoost (:ref:`example <xgboost-ray-tuning>`)
* Pytorch (:ref:`example <tune-pytorch-lightning>`)
* Pytorch Lightning (:ref:`example <ray-lightning-tuning>`)
* Keras (:doc:`example </tune/examples/tune_mnist_keras>`)
* Tensorflow (:ref:`example <ray-train-tftrainer-example>`)
* LightGBM (:ref:`example <lightgbm-ray-tuning>`)
2. **Custom training function**
* All this means is that your function has to expose a ``checkpoint_dir`` argument in the function signature, and call ``tune.checkpoint_dir``. See :doc:`this example </tune/examples/custom_func_checkpointing>`, it's quite simple to do.
Let's assume for this example you're running this script from your laptop, and connecting to your remote Ray cluster via ``ray.init()``, making your script on your laptop the "driver".
.. code-block:: python
import ray
from ray import tune
from your_module import my_trainable
ray.init(address="<cluster-IP>:<port>") # set `address=None` to train on laptop
# configure how checkpoints are sync'd to the scheduler/sampler
# we recommend cloud storage checkpointing as it survives the cluster when
# instances are terminated, and has better performance
sync_config = tune.syncConfig(
upload_dir="s3://my-checkpoints-bucket/path/", # requires AWS credentials
)
# this starts the run!
tune.run(
my_trainable,
# name of your experiment
name="my-tune-exp",
# a directory where results are stored before being
# sync'd to head node/cloud storage
local_dir="/tmp/mypath",
# see above! we will sync our checkpoints to S3 directory
sync_config=sync_config,
# we'll keep the best five checkpoints at all times
# checkpoints (by AUC score, reported by the trainable, descending)
checkpoint_score_attr="max-auc",
keep_checkpoints_num=5,
# a very useful trick! this will resume from the last run specified by
# sync_config (if one exists), otherwise it will start a new tuning run
resume="AUTO",
)
In this example, checkpoints will be saved:
* **Locally**: not saved! Nothing will be sync'd to the driver (your laptop) automatically (because cloud syncing is enabled)
* **S3**: ``s3://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>``
* **On head node**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` (but only for trials done on that node)
* **On workers nodes**: ``~/ray-results/my-tune-exp/<trial_name>/checkpoint_<step>`` (but only for trials done on that node)
If your run stopped for any reason (finished, errored, user CTRL+C), you can restart it any time by running the script above again -- note with ``resume="AUTO"``, it will detect the previous run so long as the ``sync_config`` points to the same location.
If, however, you prefer not to use ``resume="AUTO"`` (or are on an older version of Ray) you can resume manaully:
.. code-block:: python .. code-block:: python
# Restored previous trial from the given checkpoint # Restored previous trial from the given checkpoint
tune.run( tune.run(
"PG", # our same trainable as before
name="RestoredExp", # The name can be different. my_trainable,
stop={"training_iteration": 10}, # train 5 more iterations than previous
restore="~/ray_results/Original/PG_<xxx>/checkpoint_5/checkpoint-5", # The name can be different from your original name
config={"env": "CartPole-v0"}, name="my-tune-exp-restart",
# our same config as above!
restore=sync_config,
) )
.. _rsync-checkpointing:
A simple local/rsync checkpointing example
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Local or rsync checkpointing can be a good option if:
1. You want to tune on a single laptop Ray cluster
2. You aren't using Ray on Kubernetes (rsync doesn't work with Ray on Kubernetes)
3. You don't want to use S3
Let's take a look at an example:
.. code-block:: python
import ray
from ray import tune
from your_module import my_trainable
ray.init(address="<cluster-IP>:<port>") # set `address=None` to train on laptop
# configure how checkpoints are sync'd to the scheduler/sampler
sync_config = tune.syncConfig() # the default mode is to use use rsync
# this starts the run!
tune.run(
my_trainable,
# name of your experiment
name="my-tune-exp",
# a directory where results are stored before being
# sync'd to head node/cloud storage
local_dir="/tmp/mypath",
# sync our checkpoints via rsync
# you don't have to pass an empty sync config - but we
# do it here for clarity and comparison
sync_config=sync_config,
# we'll keep the best five checkpoints at all times
# checkpoints (by AUC score, reported by the trainable, descending)
checkpoint_score_attr="max-auc",
keep_checkpoints_num=5,
# a very useful trick! this will resume from the last run specified by
# sync_config (if one exists), otherwise it will start a new tuning run
resume="AUTO",
)
.. _tune-distributed-checkpointing: .. _tune-distributed-checkpointing:
@ -346,7 +596,7 @@ Distributed Checkpointing
On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed. On a multinode cluster, Tune automatically creates a copy of all trial checkpoints on the head node. This requires the Ray cluster to be started with the :ref:`cluster launcher <cluster-cloud>` and also requires rsync to be installed.
Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing. Note that you must use the ``tune.checkpoint_dir`` API to trigger syncing (or use a model type with a built-in Ray Tune integration as described here). See :doc:`/tune/examples/custom_func_checkpointing` for an example.
If you are running Ray Tune on Kubernetes, you should usually use a If you are running Ray Tune on Kubernetes, you should usually use a
:ref:`cloud checkpointing <tune-sync-config>` or a shared filesystem for checkpoint sharing. :ref:`cloud checkpointing <tune-sync-config>` or a shared filesystem for checkpoint sharing.
@ -466,7 +716,6 @@ For more flexibility, you can pass in a function instead. If a function is passe
.. code-block:: python .. code-block:: python
def stopper(trial_id, result): def stopper(trial_id, result):
return result["mean_accuracy"] / result["training_iteration"] > 5 return result["mean_accuracy"] / result["training_iteration"] > 5
@ -650,9 +899,9 @@ If a string is provided, then it must include replacement fields ``{source}`` an
sync_process = subprocess.Popen(sync_cmd, shell=True) sync_process = subprocess.Popen(sync_cmd, shell=True)
sync_process.wait() sync_process.wait()
By default, syncing occurs every 300 seconds. To change the frequency of syncing, set the ``TUNE_CLOUD_SYNC_S`` environment variable in the driver to the desired syncing period. By default, syncing occurs every 300 seconds. To change the frequency of syncing, set the ``sync_period`` attribute of the sync config to the desired syncing period.
Note that uploading only happens when global experiment state is collected, and the frequency of this is determined by the ``TUNE_GLOBAL_CHECKPOINT_S`` environment variable. So the true upload period is given by ``max(TUNE_CLOUD_SYNC_S, TUNE_GLOBAL_CHECKPOINT_S)``. Note that uploading only happens when global experiment state is collected, and the frequency of this is determined by the sync period. So the true upload period is given by ``max(sync period, TUNE_GLOBAL_CHECKPOINT_S)``.
Make sure that worker nodes have the write access to the cloud storage. Failing to do so would cause error messages like ``Error message (1): fatal error: Unable to locate credentials``. Make sure that worker nodes have the write access to the cloud storage. Failing to do so would cause error messages like ``Error message (1): fatal error: Unable to locate credentials``.
For AWS set up, this involves adding an IamInstanceProfile configuration for worker nodes. Please :ref:`see here for more tips <aws-cluster-s3>`. For AWS set up, this involves adding an IamInstanceProfile configuration for worker nodes. Please :ref:`see here for more tips <aws-cluster-s3>`.

View file

@ -6,6 +6,7 @@
- Get rid of the badges in the top - Get rid of the badges in the top
- Get rid of the references section at the bottom - Get rid of the references section at the bottom
- Be sure not to delete the API reference section in the bottom of this file. - Be sure not to delete the API reference section in the bottom of this file.
- add `.. _xgboost-ray-tuning:` before the "Hyperparameter Tuning" section
- Adjust some link targets (e.g. for "Ray Tune") to anonymous references - Adjust some link targets (e.g. for "Ray Tune") to anonymous references
by adding a second underscore (use `target <link>`__) by adding a second underscore (use `target <link>`__)
- Search for `\ **` and delete this from the links (bold links are not supported) - Search for `\ **` and delete this from the links (bold links are not supported)
@ -220,6 +221,9 @@ Example loading multiple parquet files:
columns=columns, columns=columns,
filetype=RayFileType.PARQUET) filetype=RayFileType.PARQUET)
.. _xgboost-ray-tuning:
Hyperparameter Tuning Hyperparameter Tuning
--------------------- ---------------------

View file

@ -430,6 +430,15 @@ py_test(
args = ["--smoke-test"] args = ["--smoke-test"]
) )
py_test(
name = "custom_func_checkpointing",
size = "small",
srcs = ["examples/custom_func_checkpointing.py"],
deps = [":tune_lib"],
tags = ["team:ml", "exclusive", "example"],
args = ["--smoke-test"]
)
py_test( py_test(
name = "test_torch_trainable", name = "test_torch_trainable",
size = "medium", size = "medium",

View file

@ -17,6 +17,7 @@ General Examples
- `PBT with Function API <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_function.py>`__: Example of using the function API with a PopulationBasedTraining scheduler. - `PBT with Function API <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_function.py>`__: Example of using the function API with a PopulationBasedTraining scheduler.
- `pbt_ppo_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_ppo_example.py>`__: Example of optimizing a distributed RLlib algorithm (PPO) with the PopulationBasedTraining scheduler. - `pbt_ppo_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_ppo_example.py>`__: Example of optimizing a distributed RLlib algorithm (PPO) with the PopulationBasedTraining scheduler.
- `logging_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__: Example of custom loggers and custom trial directory naming. - `logging_example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__: Example of custom loggers and custom trial directory naming.
- `custom_func_checkpointing <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/logging_example.py>`__: Example of custom checkpointing logic using the function API.
Search Algorithm Examples Search Algorithm Examples
------------------------- -------------------------

View file

@ -0,0 +1,75 @@
# If want to use checkpointing with a custom training function (not a Ray
# integration like PyTorch or Tensorflow), you must expose a
# ``checkpoint_dir`` argument in the function signature, and call
# ``tune.checkpoint_dir``:
import os
import time
import json
import argparse
from ray import tune
def evaluation_fn(step, width, height):
time.sleep(0.1)
return (0.1 + width * step / 100)**(-1) + height * 0.1
def train_func(config, checkpoint_dir=None):
start = 0
width, height = config["width"], config["height"]
if checkpoint_dir:
with open(os.path.join(checkpoint_dir, "checkpoint")) as f:
state = json.loads(f.read())
start = state["step"] + 1
for step in range(start, 100):
intermediate_score = evaluation_fn(step, width, height)
# Obtain a checkpoint directory
with tune.checkpoint_dir(step=step) as checkpoint_dir:
path = os.path.join(checkpoint_dir, "checkpoint")
with open(path, "w") as f:
f.write(json.dumps({"step": step}))
tune.report(iterations=step, mean_loss=intermediate_score)
# You can restore a single trial checkpoint by using
# ``tune.run(restore=<checkpoint_dir>)`` By doing this, you can change
# whatever experiments' configuration such as the experiment's name.
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--smoke-test", action="store_true", help="Finish quickly for testing")
parser.add_argument(
"--server-address",
type=str,
default=None,
required=False,
help="The address of server to connect to if using "
"Ray Client.")
args, _ = parser.parse_known_args()
if args.server_address:
import ray
ray.init(f"ray://{args.server_address}")
analysis = tune.run(
train_func,
name="hyperband_test",
metric="mean_loss",
mode="min",
num_samples=5,
stop={"training_iteration": 1 if args.smoke_test else 10},
config={
"steps": 10,
"width": tune.randint(10, 100),
"height": tune.loguniform(10, 100)
})
print("Best hyperparameters: ", analysis.best_config)
print("Best checkpoint directory: ", analysis.best_checkpoint)
with open(os.path.join(analysis.best_checkpoint, "checkpoint"), "r") as f:
print("Best checkpoint: ", json.load(f))

View file

@ -220,7 +220,7 @@ class ResourceChangingScheduler(TrialScheduler):
If the functional API is used, the current trial resources can be obtained If the functional API is used, the current trial resources can be obtained
by calling `tune.get_trial_resources()` inside the training function. by calling `tune.get_trial_resources()` inside the training function.
The function should be able to The function should be able to
:ref:`load and save checkpoints <tune-checkpoint>` :ref:`load and save checkpoints <tune-checkpoint-syncing>`
(the latter preferably every iteration). (the latter preferably every iteration).
If the Trainable (class) API is used, when the resources of a If the Trainable (class) API is used, when the resources of a