mirror of
https://github.com/vale981/ray
synced 2025-03-06 18:41:40 -05:00

This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).
518 lines
22 KiB
ReStructuredText
518 lines
22 KiB
ReStructuredText
.. _tune-faq:
|
|
|
|
Ray Tune FAQ
|
|
------------
|
|
|
|
Here we try to answer questions that come up often. If you still have questions
|
|
after reading this, let us know!
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 1
|
|
|
|
Which search algorithm/scheduler should I choose?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Ray Tune offers :ref:`many different search algorithms <tune-search-alg>`
|
|
and :ref:`schedulers <tune-schedulers>`.
|
|
Deciding on which to use mostly depends on your problem:
|
|
|
|
* Is it a small or large problem (how long does it take to train? How costly
|
|
are the resources, like GPUs)? Can you run many trials in parallel?
|
|
* How many hyperparameters would you like to tune?
|
|
* What values are valid for hyperparameters?
|
|
|
|
**If your model returns incremental results** (eg. results per epoch in deep learning,
|
|
results per each added tree in GBDTs, etc.) using early stopping usually allows for sampling
|
|
more configurations, as unpromising trials are pruned before they run their full course.
|
|
Please note that not all search algorithms can use information from pruned trials.
|
|
Early stopping cannot be used without incremental results - in case of the functional API,
|
|
that means that ``tune.report()`` has to be called more than once - usually in a loop.
|
|
|
|
**If your model is small**, you can usually try to run many different configurations.
|
|
A **random search** can be used to generate configurations. You can also grid search
|
|
over some values. You should probably still use
|
|
:ref:`ASHA for early termination of bad trials <tune-scheduler-hyperband>` (if your problem
|
|
supports early stopping).
|
|
|
|
**If your model is large**, you can try to either use
|
|
**Bayesian Optimization-based search algorithms** like :ref:`BayesOpt <bayesopt>` or
|
|
:ref:`Dragonfly <Dragonfly>` to get good parameter configurations after few
|
|
trials. :ref:`Ax <tune-ax>` is similar but more robust to noisy data.
|
|
Please note that these algorithms only work well with **a small number of hyperparameters**.
|
|
Alternatively, you can use :ref:`Population Based Training <tune-scheduler-pbt>` which
|
|
works well with few trials, e.g. 8 or even 4. However, this will output a hyperparameter *schedule* rather
|
|
than one fixed set of hyperparameters.
|
|
|
|
**If you have a small number of hyperparameters**, Bayesian Optimization methods
|
|
work well. Take a look at :ref:`BOHB <tune-scheduler-bohb>` or :ref:`Optuna <tune-optuna>`
|
|
with the :ref:`ASHA <tune-scheduler-hyperband>` scheduler to combine the
|
|
benefits of Bayesian Optimization with early stopping.
|
|
|
|
**If you only have continuous values for hyperparameters** this will work well
|
|
with most Bayesian Optimization methods. Discrete or categorical variables still
|
|
work, but less good with an increasing number of categories.
|
|
|
|
**If you have many categorical values for hyperparameters**, consider using random search,
|
|
or a TPE-based Bayesian Optimization algorithm such as :ref:`Optuna <tune-optuna>` or
|
|
:ref:`HyperOpt <tune-hyperopt>`.
|
|
|
|
**Our go-to solution** is usually to use **random search** with :ref:`ASHA for early stopping <tune-scheduler-hyperband>`
|
|
for smaller problems. Use :ref:`BOHB <tune-scheduler-bohb>` for **larger problems** with a **small number of hyperparameters**
|
|
and :ref:`Population Based Training <tune-scheduler-pbt>` for **larger problems** with a **large number of hyperparameters**
|
|
if a learning schedule is acceptable.
|
|
|
|
How do I choose hyperparameter ranges?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
A good start is to look at the papers that introduced the algorithms, and also
|
|
to see what other people are using.
|
|
|
|
Most algorithms also have sensible defaults for some of their parameters.
|
|
For instance, `XGBoost's parameter overview <https://xgboost.readthedocs.io/en/latest/parameter.html>`_
|
|
reports to use ``max_depth=6`` for the maximum decision tree depth. Here, anything
|
|
between 2 and 10 might make sense (though that naturally depends on your problem).
|
|
|
|
For **learning rates**, we suggest using a **loguniform distribution** between
|
|
**1e-5** and **1e-1**: ``tune.loguniform(1e-5, 1e-1)``.
|
|
|
|
For **batch sizes**, we suggest trying **powers of 2**, for instance, 2, 4, 8,
|
|
16, 32, 64, 128, 256, etc. The magnitude depends on your problem. For easy
|
|
problems with lots of data, use higher batch sizes, for harder problems with
|
|
not so much data, use lower batch sizes.
|
|
|
|
For **layer sizes** we also suggest trying **powers of 2**. For small problems
|
|
(e.g. Cartpole), use smaller layer sizes. For larger problems, try larger ones.
|
|
|
|
For **discount factors** in reinforcement learning we suggest sampling uniformly
|
|
between 0.9 and 1.0. Depending on the problem, a much stricter range above 0.97
|
|
or oeven above 0.99 can make sense (e.g. for Atari).
|
|
|
|
How can I use nested/conditional search spaces?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Sometimes you might need to define parameters whose value depend on the value
|
|
of other parameters. Ray Tune offers some methods to define these.
|
|
|
|
Nested spaces
|
|
'''''''''''''
|
|
You can nest hyperparameter definition in sub dictionaries:
|
|
|
|
.. code-block:: python
|
|
|
|
config = {
|
|
"a": {
|
|
"x": tune.uniform(0, 10)
|
|
},
|
|
"b": tune.choice([1, 2, 3])
|
|
}
|
|
|
|
The trial config will be nested exactly like the input config.
|
|
|
|
Conditional spaces
|
|
''''''''''''''''''
|
|
:ref:`Custom and conditional search spaces are explained in detail here <tune_custom-search>`.
|
|
In short, you can pass custom functions to ``tune.sample_from()`` that can
|
|
return values that depend on other values:
|
|
|
|
.. code-block:: python
|
|
|
|
config = {
|
|
"a": tune.randint(5, 10)
|
|
"b": tune.sample_from(lambda spec: np.random.randint(0, spec.config.a))
|
|
}
|
|
|
|
Conditional grid search
|
|
'''''''''''''''''''''''
|
|
If you would like to grid search over two parameters that depend on each other,
|
|
this might not work out of the box. For instance say that *a* should be a value
|
|
between 5 and 10 and *b* should be a value between 0 and a. In this case, we
|
|
cannot use ``tune.sample_from`` because it doesn't support grid searching.
|
|
|
|
The solution here is to create a list of valid *tuples* with the help of a
|
|
helper function, like this:
|
|
|
|
.. code-block:: python
|
|
|
|
def _iter():
|
|
for a in range(5, 10):
|
|
for b in range(a):
|
|
yield a, b
|
|
|
|
config = {
|
|
"ab": tune.grid_search(list(_iter())),
|
|
}
|
|
|
|
Your trainable then can do something like ``a, b = config["ab"]`` to split
|
|
the a and b variables and use them afterwards.
|
|
|
|
How does early termination (e.g. Hyperband/ASHA) work?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Early termination algorithms look at the intermediately reported values,
|
|
e.g. what is reported to them via ``tune.report()`` after each training
|
|
epoch. After a certain number of steps, they then remove the worst
|
|
performing trials and keep only the best performing trials. Goodness of a trial
|
|
is determined by ordering them by the objective metric, for instance accuracy
|
|
or loss.
|
|
|
|
In ASHA, you can decide how many trials are early terminated.
|
|
``reduction_factor=4`` means that only 25% of all trials are kept each
|
|
time they are reduced. With ``grace_period=n`` you can force ASHA to
|
|
train each trial at least for ``n`` epochs.
|
|
|
|
Why are all my trials returning "1" iteration?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
**This is most likely applicable for the Tune function API.**
|
|
|
|
Ray Tune counts iterations internally every time ``tune.report()`` is
|
|
called. If you only call ``tune.report()`` once at the end of the training,
|
|
the counter has only been incremented once. If you're using the class API,
|
|
the counter is increased after calling ``step()``.
|
|
|
|
Note that it might make sense to report metrics more often than once. For
|
|
instance, if you train your algorithm for 1000 timesteps, consider reporting
|
|
intermediate performance values every 100 steps. That way, schedulers
|
|
like Hyperband/ASHA can terminate bad performing trials early.
|
|
|
|
What are all these extra outputs?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
You'll notice that Ray Tune not only reports hyperparameters (from the
|
|
``config``) or metrics (passed to ``tune.report()``), but also some other
|
|
outputs.
|
|
|
|
.. code-block:: bash
|
|
|
|
Result for easy_objective_c64c9112:
|
|
date: 2020-10-07_13-29-18
|
|
done: false
|
|
experiment_id: 6edc31257b564bf8985afeec1df618ee
|
|
experiment_tag: 7_activation=tanh,height=-53.116,steps=100,width=13.885
|
|
hostname: ubuntu
|
|
iterations: 0
|
|
iterations_since_restore: 1
|
|
mean_loss: 4.688385317424468
|
|
neg_mean_loss: -4.688385317424468
|
|
node_ip: 192.168.1.115
|
|
pid: 5973
|
|
time_since_restore: 7.605552673339844e-05
|
|
time_this_iter_s: 7.605552673339844e-05
|
|
time_total_s: 7.605552673339844e-05
|
|
timestamp: 1602102558
|
|
timesteps_since_restore: 0
|
|
training_iteration: 1
|
|
trial_id: c64c9112
|
|
|
|
See the :ref:`tune-autofilled-metrics` section for a glossary.
|
|
|
|
How do I set resources?
|
|
~~~~~~~~~~~~~~~~~~~~~~~
|
|
If you want to allocate specific resources to a trial, you can use the
|
|
``resources_per_trial`` parameter of ``tune.run()``, to which you can pass
|
|
a dict or a :class:`PlacementGroupFactory <ray.tune.utils.placement_groups.PlacementGroupFactory>` object:
|
|
|
|
|
|
.. code-block:: python
|
|
|
|
tune.run(
|
|
train_fn,
|
|
resources_per_trial={
|
|
"cpu": 2,
|
|
"gpu": 0.5,
|
|
"custom_resources": {"hdd": 80}
|
|
}
|
|
)
|
|
|
|
The example above showcases three things:
|
|
|
|
1. The `cpu` and `gpu` options set how many CPUs and GPUs are available for
|
|
each trial, respectively. **Trials cannot request more resources** than these
|
|
(exception: see 3).
|
|
2. It is possible to request **fractional GPUs**. A value of 0.5 means that
|
|
half of the memory of the GPU is made available to the trial. You will have
|
|
to make sure yourself that your model still fits on the fractional memory.
|
|
3. You can request custom resources you supplied to Ray when starting the cluster.
|
|
Trials will only be scheduled on single nodes that can provide all resources you
|
|
requested.
|
|
|
|
One important thing to keep in mind is that each Ray worker (and thus each
|
|
Ray Tune Trial) will only be scheduled on **one machine**. That means if
|
|
you for instance request 2 GPUs for your trial, but your cluster consists
|
|
of 4 machines with 1 GPU each, the trial will never be scheduled.
|
|
|
|
In other words, you will have to make sure that your Ray cluster
|
|
has machines that can actually fulfill your resource requests.
|
|
|
|
In some cases your trainable might want to start other remote actors, for instance if you're
|
|
leveraging distributed training via Ray Train. In these cases, you can use
|
|
:ref:`placement groups <ray-placement-group-doc-ref>` to request additional resources:
|
|
|
|
.. code-block:: python
|
|
|
|
tune.run(
|
|
train_fn,
|
|
resources_per_trial=tune.PlacementGroupFactory([
|
|
{"CPU": 2, "GPU": 0.5, "hdd": 80},
|
|
{"CPU": 1},
|
|
{"CPU": 1},
|
|
], strategy="PACK")
|
|
|
|
Here, you're requesting 2 additional CPUs for remote tasks. These two additional
|
|
actors do not necessarily have to live on the same node as your main trainable.
|
|
In fact, you can control this via the ``strategy`` parameter. In this example, ``PACK``
|
|
will try to schedule the actors on the same node, but allows them to be scheduled
|
|
on other nodes as well. Please refer to the
|
|
:ref:`placement groups documentation <ray-placement-group-doc-ref>` to learn more
|
|
about these placement strategies.
|
|
|
|
Why is my training stuck and Ray reporting that pending actor or tasks cannot be scheduled?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
This is usually caused by Ray actors or tasks being started by the
|
|
trainable without the trainable resources accounting for them, leading to a deadlock.
|
|
This can also be "stealthly" caused by using other libraries in the trainable that are
|
|
based on Ray, such as Modin. In order to fix the issue, request additional resources for
|
|
the trial using :ref:`placement groups <ray-placement-group-doc-ref>`, as outlined in
|
|
the section above.
|
|
|
|
For example, if your trainable is using Modin dataframes, operations on those will spawn
|
|
Ray tasks. By allocating an additional CPU bundle to the trial, those tasks will be able
|
|
to run without being starved of resources.
|
|
|
|
.. code-block:: python
|
|
|
|
import modin.pandas as pd
|
|
|
|
def train_fn(config, checkpoint_dir=None):
|
|
# some Modin operations here
|
|
tune.report(metric=metric)
|
|
|
|
tune.run(
|
|
train_fn,
|
|
resources_per_trial=tune.PlacementGroupFactory([
|
|
{"CPU": 1}, # this bundle will be used by the trainable itself
|
|
{"CPU": 1}, # this bundle will be used by Modin
|
|
], strategy="PACK")
|
|
|
|
How can I pass further parameter values to my trainable?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Ray Tune expects your trainable functions to accept only up to two parameters,
|
|
``config`` and ``checkpoint_dir``. But sometimes there are cases where
|
|
you want to pass constant arguments, like the number of epochs to run,
|
|
or a dataset to train on. Ray Tune offers a wrapper function to achieve
|
|
just that, called :func:`tune.with_parameters() <ray.tune.with_parameters>`:
|
|
|
|
.. code-block:: python
|
|
|
|
from ray import tune
|
|
|
|
import numpy as np
|
|
|
|
def train(config, checkpoint_dir=None, num_epochs=10, data=None):
|
|
for i in range(num_epochs):
|
|
for sample in data:
|
|
# ... train on sample
|
|
|
|
# Some huge dataset
|
|
data = np.random.random(size=100000000)
|
|
|
|
tune.run(
|
|
tune.with_parameters(train, num_epochs=10, data=data))
|
|
|
|
|
|
This function works similarly to ``functools.partial``, but it stores
|
|
the parameters directly in the Ray object store. This means that you
|
|
can pass even huge objects like datasets, and Ray makes sure that these
|
|
are efficiently stored and retrieved on your cluster machines.
|
|
|
|
:func:`tune.with_parameters() <ray.tune.with_parameters>`
|
|
also works with class trainables. Please see
|
|
:ref:`here for further details <tune-with-parameters>` and examples.
|
|
|
|
|
|
How can I reproduce experiments?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Reproducing experiments and experiment results means that you get the exact same
|
|
results when running an experiment again and again. To achieve this, the
|
|
conditions have to be exactly the same each time you run the exeriment.
|
|
In terms of ML training and tuning, this mostly concerns
|
|
the random number generators that are used for sampling in various places of the
|
|
training and tuning lifecycle.
|
|
|
|
Random number generators are used to create randomness, for instance to sample a hyperparameter
|
|
value for a parameter you defined. There is no true randomness in computing, rather
|
|
there are sophisticated algorithms that generate numbers that *seem* to be random and
|
|
fulfill all properties of a random distribution. These algorithms can be *seeded* with
|
|
an initial state, after which the generated random numbers are always the same.
|
|
|
|
.. code-block:: python
|
|
|
|
import random
|
|
random.seed(1234)
|
|
print([random.randint(0, 100) for _ in range(10)])
|
|
|
|
# The output of this will always be
|
|
# [99, 56, 14, 0, 11, 74, 4, 85, 88, 10]
|
|
|
|
|
|
The most commonly used random number generators from Python libraries are those in the
|
|
native ``random`` submodule and the ``numpy.random`` module.
|
|
|
|
.. code-block:: python
|
|
|
|
# This should suffice to initialize the RNGs for most Python-based libraries
|
|
import random
|
|
import numpy as np
|
|
random.seed(1234)
|
|
np.random.seed(5678)
|
|
|
|
In your tuning and training run, there are several places where randomness occurrs, and
|
|
at all these places we will have to introduce seeds to make sure we get the same behavior.
|
|
|
|
* **Search algorithm**: Search algorithms have to be seeded to generate the same
|
|
hyperparameter configurations in each run. Some search algorithms can be explicitly instantiated with a
|
|
random seed (look for a ``seed`` parameter in the constructor). For others, try to use
|
|
the above code block.
|
|
* **Schedulers**: Schedulers like Population Based Training rely on resampling some
|
|
of the parameters, requiring randomness. Use the code block above to set the initial
|
|
seeds.
|
|
* **Training function**: In addition to initializing the configurations, the training
|
|
functions themselves have to use seeds. This could concern e.g. the data splitting.
|
|
You should make sure to set the seed at the start of your training function.
|
|
|
|
PyTorch and TensorFlow use their own RNGs, which have to be initialized, too:
|
|
|
|
.. code-block:: python
|
|
|
|
import torch
|
|
torch.manual_seed(0)
|
|
|
|
import tensorflow as tf
|
|
tf.random.set_seed(0)
|
|
|
|
You should thus seed both Ray Tune's schedulers and search algorithms, and the
|
|
training code. The schedulers and search algorithms should always be seeded with the
|
|
same seed. This is also true for the training code, but often it is beneficial that
|
|
the seeds differ *between different training runs*.
|
|
|
|
Here's a blueprint on how to do all this in your training code:
|
|
|
|
.. code-block:: python
|
|
|
|
import random
|
|
import numpy as np
|
|
from ray import tune
|
|
|
|
|
|
def trainable(config):
|
|
# config["seed"] is set deterministically, but differs between training runs
|
|
random.seed(config["seed"])
|
|
np.random.seed(config["seed"])
|
|
# torch.manual_seed(config["seed"])
|
|
# ... training code
|
|
|
|
|
|
config = {
|
|
"seed": tune.randint(0, 10000),
|
|
# ...
|
|
}
|
|
|
|
if __name__ == "__main__":
|
|
# Set seed for the search algorithms/schedulers
|
|
random.seed(1234)
|
|
np.random.seed(1234)
|
|
# Don't forget to check if the search alg has a `seed` parameter
|
|
tune.run(
|
|
trainable,
|
|
config=config
|
|
)
|
|
|
|
**Please note** that it is not always possible to control all sources of non-determinism.
|
|
For instance, if you use schedulers like ASHA or PBT, some trials might finish earlier
|
|
than other trials, affecting the behavior of the schedulers. Which trials finish first
|
|
can however depend on the current system load, network communication, or other factors
|
|
in the envrionment that we cannot control with random seeds. This is also true for search
|
|
algorithms such as Bayesian Optimization, which take previous results into account when
|
|
sampling new configurations. This can be tackled by
|
|
using the **synchronous modes** of PBT and Hyperband, where the schedulers wait for all trials to
|
|
finish an epoch before deciding which trials to promote.
|
|
|
|
We strongly advise to try reproduction on smaller toy problems first before relying
|
|
on it for larger experiments.
|
|
|
|
|
|
.. _tune-bottlenecks:
|
|
|
|
How can I avoid bottlenecks?
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
Sometimes you might run into a message like this:
|
|
|
|
.. code-block::
|
|
|
|
The `experiment_checkpoint` operation took 2.43 seconds to complete, which may be a performance bottleneck
|
|
|
|
Most commonly, the ``experiment_checkpoint`` operation is throwing this warning, but it might be something else,
|
|
like ``process_trial_result``.
|
|
|
|
These operations should usually take less than 500ms to complete. When it consistently takes longer, this might
|
|
indicate a problem or inefficiencies. To get rid of this message, it is important to understand where it comes
|
|
from.
|
|
|
|
These are the main reasons this problem comes up:
|
|
|
|
**The Trial config is very large**
|
|
|
|
This is the case if you e.g. try to pass a dataset or other large object via the ``config`` parameter.
|
|
If this is the case, the dataset is serialized and written to disk repeatedly during experiment
|
|
checkpointing, which takes a long time.
|
|
|
|
**Solution**: Use :func:`tune.with_parameters <ray.tune.with_parameters>` to pass large objects to
|
|
function trainables via the objects store. For class trainables you can do this manually via ``ray.put()``
|
|
and ``ray.get()``. If you need to pass a class definition, consider passing an
|
|
indicator (e.g. a string) instead and let the trainable select the class instead. Generally, your config
|
|
dictionary should only contain primitive types, like numbers or strings.
|
|
|
|
**The Trial result is very large**
|
|
|
|
This is the case if you return objects, data, or other large objects via the return value of ``step()`` in
|
|
your class trainable or to ``tune.report()`` in your function trainable. The effect is the same as above:
|
|
The results are repeatedly serialized and written to disk, and this can take a long time.
|
|
|
|
**Solution**: Usually you should be able to write data to the trial directory instead. You can then pass a
|
|
filename back (or assume it is a known location). The trial dir is usually the current working directory. Class
|
|
trainables have the ``Trainable.logdir`` property and function trainables the :func:`ray.tune.get_trial_dir`
|
|
function to retrieve the logdir. If you really have to, you can also ``ray.put()`` an object to the Ray
|
|
object store and retrieve it with ``ray.get()`` on the other side. Generally, your result dictionary
|
|
should only contain primitive types, like numbers or strings.
|
|
|
|
**You are training a large number of trials on a cluster, or you are saving huge checkpoints**
|
|
|
|
Checkpoints and logs are synced between nodes
|
|
- usually at least to the driver on the head node, but sometimes between worker nodes if needed (e.g. when
|
|
using :ref:`Population Based Training <tune-scheduler-pbt>`). If these checkpoints are very large (e.g. for
|
|
NLP models), or if you are training a large number of trials, this syncing can take a long time.
|
|
|
|
If nothing else is specified, syncing happens via SSH, which can lead to network overhead as connections are
|
|
not kept open by Ray Tune.
|
|
|
|
**Solution**: There are multiple solutions, depending on your needs:
|
|
|
|
1. You can disable syncing to the driver in the :class:`tune.SyncConfig <ray.tune.SyncConfig>`. In this case,
|
|
logs and checkpoints will not be synced to the driver, so if you need to access them later, you will have to
|
|
transfer them where you need them manually.
|
|
|
|
2. You can use :ref:`cloud checkpointing <tune-cloud-checkpointing>` to save logs and checkpoints to a specified `upload_dir`.
|
|
This is the preferred way to deal with this. All syncing will be taken care of automatically, as all nodes
|
|
are able to access the cloud storage. Additionally, your results will be safe, so even when you're working on
|
|
pre-emptible instances, you won't lose any of your data.
|
|
|
|
**You are reporting results too often**
|
|
|
|
Each result is processed by the search algorithm, trial scheduler, and callbacks (including loggers and the
|
|
trial syncer). If you're reporting a large number of results per trial (e.g. multiple results per second),
|
|
this can take a long time.
|
|
|
|
**Solution**: The solution here is obvious: Just don't report results that often. In class trainables, ``step()``
|
|
should maybe process a larger chunk of data. In function trainables, you can report only every n-th iteration
|
|
of the training loop. Try to balance the number of results you really need to make scheduling or searching
|
|
decisions. If you need more fine grained metrics for logging or tracking, consider using a separate logging
|
|
mechanism for this instead of the Ray Tune-provided progress logging of results.
|