mirror of
https://github.com/vale981/ray
synced 2025-03-08 19:41:38 -05:00

Update documentation to use `session.report`. Next steps: 1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer. 2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources` 4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`. Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
291 lines
9.7 KiB
ReStructuredText
291 lines
9.7 KiB
ReStructuredText
A Guide To Logging & Outputs in Tune
|
|
====================================
|
|
|
|
Tune by default will log results for TensorBoard, CSV, and JSON formats.
|
|
If you need to log something lower level like model weights or gradients, see :ref:`Trainable Logging <trainable-logging>`.
|
|
You can learn more about logging and customizations here: :ref:`loggers-docstring`.
|
|
|
|
|
|
.. _tune-logging:
|
|
|
|
How to configure logging in Tune?
|
|
---------------------------------
|
|
|
|
Tune will log the results of each trial to a sub-folder under a specified local dir, which defaults to ``~/ray_results``.
|
|
|
|
.. code-block:: bash
|
|
|
|
# This logs to two different trial folders:
|
|
# ~/ray_results/trainable_name/trial_name_1 and ~/ray_results/trainable_name/trial_name_2
|
|
# trainable_name and trial_name are autogenerated.
|
|
tune.run(trainable, num_samples=2)
|
|
|
|
You can specify the ``local_dir`` and ``trainable_name``:
|
|
|
|
.. code-block:: python
|
|
|
|
# This logs to 2 different trial folders:
|
|
# ./results/test_experiment/trial_name_1 and ./results/test_experiment/trial_name_2
|
|
# Only trial_name is autogenerated.
|
|
tune.run(trainable, num_samples=2, local_dir="./results", name="test_experiment")
|
|
|
|
To specify custom trial folder names, you can pass use the ``trial_name_creator`` argument to `tune.run`.
|
|
This takes a function with the following signature:
|
|
|
|
.. code-block:: python
|
|
|
|
def trial_name_string(trial):
|
|
"""
|
|
Args:
|
|
trial (Trial): A generated trial object.
|
|
|
|
Returns:
|
|
trial_name (str): String representation of Trial.
|
|
"""
|
|
return str(trial)
|
|
|
|
tune.run(
|
|
MyTrainableClass,
|
|
name="example-experiment",
|
|
num_samples=1,
|
|
trial_name_creator=trial_name_string
|
|
)
|
|
|
|
To learn more about Trials, see its detailed API documentation: :ref:`trial-docstring`.
|
|
|
|
.. _tensorboard:
|
|
|
|
How to log to TensorBoard?
|
|
--------------------------
|
|
|
|
Tune automatically outputs TensorBoard files during ``tune.run``.
|
|
To visualize learning in tensorboard, install tensorboardX:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ pip install tensorboardX
|
|
|
|
Then, after you run an experiment, you can visualize your experiment with TensorBoard by specifying
|
|
the output directory of your results.
|
|
|
|
.. code-block:: bash
|
|
|
|
$ tensorboard --logdir=~/ray_results/my_experiment
|
|
|
|
If you are running Ray on a remote multi-user cluster where you do not have sudo access,
|
|
you can run the following commands to make sure tensorboard is able to write to the tmp directory:
|
|
|
|
.. code-block:: bash
|
|
|
|
$ export TMPDIR=/tmp/$USER; mkdir -p $TMPDIR; tensorboard --logdir=~/ray_results
|
|
|
|
.. image:: ../images/ray-tune-tensorboard.png
|
|
|
|
If using TensorFlow ``2.x``, Tune also automatically generates TensorBoard HParams output, as shown below:
|
|
|
|
.. code-block:: python
|
|
|
|
tune.run(
|
|
...,
|
|
config={
|
|
"lr": tune.grid_search([1e-5, 1e-4]),
|
|
"momentum": tune.grid_search([0, 0.9])
|
|
}
|
|
)
|
|
|
|
.. image:: ../../images/tune-hparams.png
|
|
|
|
|
|
.. _tune-console-output:
|
|
|
|
How to control console output?
|
|
------------------------------
|
|
|
|
User-provided fields will be outputted automatically on a best-effort basis.
|
|
You can use a :ref:`Reporter <tune-reporter-doc>` object to customize the console output.
|
|
|
|
.. code-block:: bash
|
|
|
|
== Status ==
|
|
Memory usage on this node: 11.4/16.0 GiB
|
|
Using FIFO scheduling algorithm.
|
|
Resources requested: 4/12 CPUs, 0/0 GPUs, 0.0/3.17 GiB heap, 0.0/1.07 GiB objects
|
|
Result logdir: /Users/foo/ray_results/myexp
|
|
Number of trials: 4 (4 RUNNING)
|
|
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
|
|
| Trial name | status | loc | param1 | param2 | acc | total time (s) | iter |
|
|
|----------------------+----------+---------------------+-----------+--------+--------+----------------+-------|
|
|
| MyTrainable_a826033a | RUNNING | 10.234.98.164:31115 | 0.303706 | 0.0761 | 0.1289 | 7.54952 | 15 |
|
|
| MyTrainable_a8263fc6 | RUNNING | 10.234.98.164:31117 | 0.929276 | 0.158 | 0.4865 | 7.0501 | 14 |
|
|
| MyTrainable_a8267914 | RUNNING | 10.234.98.164:31111 | 0.068426 | 0.0319 | 0.9585 | 7.0477 | 14 |
|
|
| MyTrainable_a826b7bc | RUNNING | 10.234.98.164:31112 | 0.729127 | 0.0748 | 0.1797 | 7.05715 | 14 |
|
|
+----------------------+----------+---------------------+-----------+--------+--------+----------------+-------+
|
|
|
|
|
|
.. _tune-log_to_file:
|
|
|
|
How to redirect stdout and stderr to files?
|
|
-------------------------------------------
|
|
|
|
The stdout and stderr streams are usually printed to the console.
|
|
For remote actors, Ray collects these logs and prints them to the head process.
|
|
|
|
However, if you would like to collect the stream outputs in files for later
|
|
analysis or troubleshooting, Tune offers an utility parameter, ``log_to_file``,
|
|
for this.
|
|
|
|
By passing ``log_to_file=True`` to ``tune.run()``, stdout and stderr will be logged
|
|
to ``trial_logdir/stdout`` and ``trial_logdir/stderr``, respectively:
|
|
|
|
.. code-block:: python
|
|
|
|
tune.run(
|
|
trainable,
|
|
log_to_file=True)
|
|
|
|
If you would like to specify the output files, you can either pass one filename,
|
|
where the combined output will be stored, or two filenames, for stdout and stderr,
|
|
respectively:
|
|
|
|
.. code-block:: python
|
|
|
|
tune.run(
|
|
trainable,
|
|
log_to_file="std_combined.log")
|
|
|
|
tune.run(
|
|
trainable,
|
|
log_to_file=("my_stdout.log", "my_stderr.log"))
|
|
|
|
The file names are relative to the trial's logdir. You can pass absolute paths,
|
|
too.
|
|
|
|
If ``log_to_file`` is set, Tune will automatically register a new logging handler
|
|
for Ray's base logger and log the output to the specified stderr output file.
|
|
|
|
.. _trainable-logging:
|
|
|
|
How to Configure Trainable Logging?
|
|
-----------------------------------
|
|
|
|
By default, Tune only logs the *training result dictionaries* from your Trainable.
|
|
However, you may want to visualize the model weights, model graph,
|
|
or use a custom logging library that requires multi-process logging.
|
|
For example, you may want to do this if you're trying to log images to TensorBoard.
|
|
|
|
You can do this in the trainable, as shown below:
|
|
|
|
.. tip:: Make sure that any logging calls or objects stay within scope of the Trainable.
|
|
You may see pickling or other serialization errors or inconsistent logs otherwise.
|
|
|
|
.. tabbed:: Function API
|
|
|
|
``library`` refers to whatever 3rd party logging library you are using.
|
|
|
|
.. code-block:: python
|
|
|
|
from ray.air import session
|
|
|
|
def trainable(config):
|
|
library.init(
|
|
name=trial_id,
|
|
id=trial_id,
|
|
resume=trial_id,
|
|
reinit=True,
|
|
allow_val_change=True)
|
|
library.set_log_path(os.getcwd())
|
|
|
|
for step in range(100):
|
|
library.log_model(...)
|
|
library.log(results, step=step)
|
|
session.report(results)
|
|
|
|
|
|
.. tabbed:: Class API
|
|
|
|
.. code-block:: python
|
|
|
|
class CustomLogging(tune.Trainable)
|
|
def setup(self, config):
|
|
trial_id = self.trial_id
|
|
library.init(
|
|
name=trial_id,
|
|
id=trial_id,
|
|
resume=trial_id,
|
|
reinit=True,
|
|
allow_val_change=True)
|
|
library.set_log_path(os.getcwd())
|
|
|
|
def step(self):
|
|
library.log_model(...)
|
|
|
|
def log_result(self, result):
|
|
res_dict = {
|
|
str(k): v
|
|
for k, v in result.items()
|
|
if (v and "config" not in k and not isinstance(v, str))
|
|
}
|
|
step = result["training_iteration"]
|
|
library.log(res_dict, step=step)
|
|
|
|
|
|
Note: For both functional and class trainables, the current working directory is changed to something
|
|
specific to that trainable once it's launched on a remote actor.
|
|
|
|
In the distributed case, these logs will be sync'ed back to the driver under your logger path.
|
|
This will allow you to visualize and analyze logs of all distributed training workers on a single machine.
|
|
|
|
|
|
How to Build Custom Loggers?
|
|
----------------------------
|
|
|
|
You can create a custom logger by inheriting the LoggerCallback interface (:ref:`logger-interface`):
|
|
|
|
.. code-block:: python
|
|
|
|
from typing import Dict, List
|
|
|
|
import json
|
|
import os
|
|
|
|
from ray.tune.logger import LoggerCallback
|
|
|
|
|
|
class CustomLoggerCallback(LoggerCallback):
|
|
"""Custom logger interface"""
|
|
|
|
def __init__(self, filename: str = "log.txt):
|
|
self._trial_files = {}
|
|
self._filename = filename
|
|
|
|
def log_trial_start(self, trial: "Trial"):
|
|
trial_logfile = os.path.join(trial.logdir, self._filename)
|
|
self._trial_files[trial] = open(trial_logfile, "at")
|
|
|
|
def log_trial_result(self, iteration: int, trial: "Trial", result: Dict):
|
|
if trial in self._trial_files:
|
|
self._trial_files[trial].write(json.dumps(result))
|
|
|
|
def on_trial_complete(self, iteration: int, trials: List["Trial"],
|
|
trial: "Trial", **info):
|
|
if trial in self._trial_files:
|
|
self._trial_files[trial].close()
|
|
del self._trial_files[trial]
|
|
|
|
|
|
You can then pass in your own logger as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
from ray import tune
|
|
|
|
tune.run(
|
|
MyTrainableClass,
|
|
name="experiment_name",
|
|
callbacks=[CustomLoggerCallback("log_test.txt")]
|
|
)
|
|
|
|
Per default, Ray Tune creates JSON, CSV and TensorBoardX logger callbacks if you don't pass them yourself.
|
|
You can disable this behavior by setting the ``TUNE_DISABLE_AUTO_CALLBACK_LOGGERS`` environment variable to ``"1"``.
|
|
|
|
An example of creating a custom logger can be found in :doc:`/tune/examples/includes/logging_example`.
|