[train] add TorchTensorboardProfilerCallback (#22345)

The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`. ``` | File "ray_sgd_training.py", line 18, in <module> | from ray import train | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module> | from ray.train.callbacks import TrainingCallback | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module> | from ray.train.callbacks.profile import TorchTensorboardProfilerCallback | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module> | from torch.profiler import profile | ModuleNotFoundError: No module named 'torch.profiler' ``` A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes: 1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized. 2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed: ``` >>> import ray >>> import ray.train >>> import ray.train.torch >>> from ray.train.torch import TorchWorkerProfiler >>> twp = TorchWorkerProfiler() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__ "Torch Profiler requires torch>=1.8.1. " ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler. ```
2025-03-06 02:21:39 -05:00 · 2022-02-14 16:16:55 -08:00 · 2022-02-14 16:16:55 -08:00 · 8f9e0d7f6b
commit 8f9e0d7f6b
parent 35a157948e
13 changed files with 393 additions and 7 deletions
--- a/doc/source/custom_directives.py
+++ b/doc/source/custom_directives.py
@ -236,6 +236,7 @@ MOCK_MODULES = [
    "torch.distributed",
    "torch.nn",
    "torch.nn.parallel",
+    "torch.profiler",
    "torch.utils.data",
    "torch.utils.data.distributed",
    "wandb",
--- a/doc/source/train/api.rst
+++ b/doc/source/train/api.rst
@ -86,6 +86,14 @@ MLflowLoggerCallback

 .. autoclass:: ray.train.callbacks.MLflowLoggerCallback

+
+.. _train-api-torch-tensorboard-profiler-callback:
+
+TorchTensorboardProfilerCallback
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: ray.train.callbacks.TorchTensorboardProfilerCallback
+
 ResultsPreprocessors
 ~~~~~~~~~~~~~~~~~~~~

@ -175,6 +183,14 @@ train.torch.get_device

 .. autofunction:: ray.train.torch.get_device

+.. _train-api-torch-worker-profiler:
+
+train.torch.TorchWorkerProfiler
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. autoclass:: ray.train.torch.TorchWorkerProfiler
+    :members:
+
 TensorFlow Training Function Utilities
 --------------------------------------

--- a/doc/source/train/user_guide.rst
+++ b/doc/source/train/user_guide.rst
@ -20,6 +20,7 @@ In this guide, we cover examples for the following use cases:
 * How do I :ref:`monitor <train-monitoring>` my training?
 * How do I run my training on pre-emptible instances
  (:ref:`fault tolerance <train-fault-tolerance>`)?
+* How do I :ref:`profile <train-profiling>` my training?
 * How do I use Ray Train to :ref:`train with a large dataset <train-datasets>`?
 * How do I :ref:`tune <train-tune>` my Ray Train model?

@ -429,6 +430,7 @@ The following ``TrainingCallback``\s are available and will log the intermediate
 2. :ref:`train-api-json-logger-callback`
 3. :ref:`train-api-tbx-logger-callback`
 4. :ref:`train-api-mlflow-logger-callback`
+5. :ref:`train-api-torch-tensorboard-profiler-callback`

 Example: Logging to MLflow and TensorBoard
 ++++++++++++++++++++++++++++++++++++++++++
@ -919,6 +921,60 @@ number of retries is configurable through the ``max_retries`` argument of the

 .. TODO.

+.. _train-profiling:
+
+Profiling
+---------
+
+Ray Train comes with an integration with `PyTorch Profiler <https://pytorch.org/blog/introducing-pytorch-profiler-the-new-and-improved-performance-tool/>`_.
+Specifically, it comes with a :ref:`TorchWorkerProfiler <train-api-torch-worker-profiler>` utility class and :ref:`train-api-torch-tensorboard-profiler-callback`  callback
+that allow you to use the PyTorch Profiler as you would in a non-distributed PyTorch script, and synchronize the generated Tensorboard traces onto
+the disk that from which your script was executed from.
+
+**Step 1: Update training function with** ``TorchWorkerProfiler``
+
+.. code-block:: bash
+
+    from ray.train.torch import TorchWorkerProfiler
+
+    def train_func():
+        twp = TorchWorkerProfiler()
+        with profile(..., on_trace_ready=twp.trace_handler) as p:
+            ...
+            profile_results = twp.get_and_clear_profile_traces()
+            train.report(..., **profile_results)
+        ...
+
+**Step 2: Run training function with** ``TorchTensorboardProfilerCallback``
+
+.. code-block:: python
+
+    from ray.train import Trainer
+    from ray.train.callbacks import TorchTensorboardProfilerCallback
+
+    trainer = Trainer(backend="torch", num_workers=2)
+    trainer.start()
+    trainer.run(train_func, callbacks=[TorchTensorboardProfilerCallback()])
+    trainer.shutdown()
+
+
+**Step 3: Visualize the logs**
+
+.. code-block:: bash
+
+    # Navigate to the run directory of the trainer.
+    # For example `cd /home/ray_results/train_2021-09-01_12-00-00/run_001/pytorch_profiler`
+    $ cd <TRAINER_RUN_DIR>/pytorch_profiler
+
+    # Install the PyTorch Profiler TensorBoard Plugin.
+    $ pip install torch_tb_profiler
+
+    # Star the TensorBoard UI.
+    $ tensorboard --logdir .
+
+    # View the PyTorch Profiler traces.
+    $ open http://localhost:6006/#pytorch_profiler
+
 .. _train-datasets:

 Distributed Data Ingest (Ray Datasets)
--- a/python/ray/train/BUILD
+++ b/python/ray/train/BUILD
@ -39,6 +39,15 @@ py_test(
    deps = [":train_lib"]
 )

+py_test(
+    name = "torch_tensorboard_profiler_example",
+    size = "small",
+    main = "examples/torch_tensorboard_profiler_example.py",
+    srcs = ["examples/torch_tensorboard_profiler_example.py"],
+    tags = ["team:ml", "exclusive"],
+    deps = [":train_lib"]
+)
+
 py_test(
    name = "transformers_example",
    size = "large",
--- a/python/ray/train/callbacks/init.py
+++ b/python/ray/train/callbacks/init.py
@ -5,11 +5,13 @@ from ray.train.callbacks.logging import (
    TBXLoggerCallback,
 )
 from ray.train.callbacks.print import PrintCallback
+from ray.train.callbacks.profile import TorchTensorboardProfilerCallback

 __all__ = [
    "TrainingCallback",
    "JsonLoggerCallback",
    "MLflowLoggerCallback",
    "TBXLoggerCallback",
+    "TorchTensorboardProfilerCallback",
    "PrintCallback",
 ]
--- a/python/ray/train/callbacks/callback.py
+++ b/python/ray/train/callbacks/callback.py
@ -1,13 +1,22 @@
 import abc
 from typing import List, Dict

-from ray.train.callbacks.results_preprocessors import ResultsPreprocessor
+from ray.train.callbacks.results_preprocessors import (
+    ResultsPreprocessor,
+    ExcludedKeysResultsPreprocessor,
+    SequentialResultsPreprocessor,
+)
+from ray.train.constants import ALL_RESERVED_KEYS


 class TrainingCallback(abc.ABC):
    """Abstract Train callback class."""

    results_preprocessor: ResultsPreprocessor = None
+    # Reserved keys used by this specific Callback.
+    # This should be set in a Callback class implementation so that the keys
+    # are not filtered out. See ``_preprocess_results`` for more details.
+    RESERVED_KEYS = {}

    def start_training(self, logdir: str, config: Dict, **info):
        """Called once on training start.
@ -34,10 +43,37 @@ class TrainingCallback(abc.ABC):
                the training function from each worker.
            **info: kwargs dict for forward compatibility.
        """
-        if self.results_preprocessor:
-            results = self.results_preprocessor.preprocess(results)
+        results = self._preprocess_results(results)
        self.handle_result(results, **info)

+    def _preprocess_results(self, results: List[Dict]) -> List[Dict]:
+        """Preprocesses the reported training results.
+
+        This will:
+
+        * Exclude all keys that are present in ``self.ALL_RESERVED_KEYS`` but
+          not ``self.RESERVED_KEYS``
+        * Execute ``self.results_preprocessor`` if defined.
+
+        Args:
+            results (List[Dict]): List of results from the training
+                function. Each value in the list corresponds to the output of
+                the training function from each worker.
+        Returns:
+            The preprocessed results.
+
+        """
+        results_to_exclude = ALL_RESERVED_KEYS.difference(self.RESERVED_KEYS)
+        system_preprocessor = ExcludedKeysResultsPreprocessor(results_to_exclude)
+        if self.results_preprocessor:
+            self.results_preprocessor = SequentialResultsPreprocessor(
+                [system_preprocessor, self.results_preprocessor]
+            )
+        else:
+            self.results_preprocessor = system_preprocessor
+        results = self.results_preprocessor.preprocess(results)
+        return results
+
    def handle_result(self, results: List[Dict], **info):
        """Called every time train.report() is called after preprocessing.

--- a/python/ray/train/callbacks/logging.py
+++ b/python/ray/train/callbacks/logging.py
@ -59,14 +59,14 @@ class TrainCallbackLogdirManager:
        self._logdir = Path(logdir) if logdir else None
        self._create_logdir = create_logdir

-    def setup_logdir(self, default_logdir: str) -> Path:
+    def setup_logdir(self, default_logdir: Union[str, Path]) -> Path:
        """Sets up the logdir.

        The directory will be created if it does not exist and
         ``create_logdir`` is set to True.

        Args:
-            default_logdir (str): The default logdir to use, only if the
+            default_logdir (str|Path): The default logdir to use, only if the
            ``TrainCallbackLogdirManager`` was not initialized with a ``logdir``.

        Returns:
--- a/python/ray/train/callbacks/profile.py
+++ b/python/ray/train/callbacks/profile.py
@ -0,0 +1,53 @@
+import logging
+from pathlib import Path
+from typing import List, Dict, Optional, Union
+
+from ray.train.callbacks import TrainingCallback
+from ray.train.callbacks.logging import TrainCallbackLogdirManager
+from ray.train.callbacks.results_preprocessors import IndexedResultsPreprocessor
+from ray.train.constants import PYTORCH_PROFILER_KEY
+
+logger = logging.getLogger(__name__)
+
+DRIVER_TRACE_DIR_NAME = "pytorch_profiler"
+
+
+class TorchTensorboardProfilerCallback(TrainingCallback):
+    """Synchronizes PyTorch Profiler traces onto disk.
+
+    This should typically be used in conjunction with ``TorchWorkerProfiler``,
+    though the actual requirement is for the ``_train_torch_profiler`` key
+    to be populated in the results from ``train.report()``.
+
+    Args:
+        logdir (Optional[str]): The directory to store traces. If ``None``,
+            this will use a default temporary dir.
+        workers_to_log (Optional[int|List[int]]): Worker indices to log.
+            If ``None``, will log all workers. By default, this will log all
+            workers.
+    """
+
+    RESERVED_KEYS = [PYTORCH_PROFILER_KEY]
+
+    def __init__(
+        self,
+        logdir: Optional[str] = None,
+        workers_to_log: Optional[Union[int, List[int]]] = None,
+    ) -> None:
+        super().__init__()
+        self._logdir = logdir
+        self._logdir_manager = TrainCallbackLogdirManager(logdir=logdir)
+        self.results_preprocessor = IndexedResultsPreprocessor(indices=workers_to_log)
+
+    def start_training(self, logdir: str, **info):
+        default_logdir = Path(logdir).joinpath(DRIVER_TRACE_DIR_NAME)
+        self._logdir_manager.setup_logdir(default_logdir=default_logdir)
+
+    def handle_result(self, results: List[Dict], **info):
+        for result in results:
+            if PYTORCH_PROFILER_KEY in result and result[PYTORCH_PROFILER_KEY]:
+                profile_traces = result[PYTORCH_PROFILER_KEY]
+                for (name, data) in profile_traces:
+                    path = self._logdir_manager.logdir_path.joinpath(name)
+                    with path.open("w") as f:
+                        f.write(data)
--- a/python/ray/train/constants.py
+++ b/python/ray/train/constants.py
@ -64,3 +64,13 @@ TRAIN_ENABLE_WORKER_SPREAD_ENV = "TRAIN_ENABLE_WORKER_SPREAD"
 # The key used to identify whether we have already warned about ray.train
 # functions being used outside of the session
 SESSION_MISUSE_LOG_ONCE_KEY = "train_warn_session_misuse"
+
+# Reserved keyword used by the ``TorchWorkerProfiler`` and
+# ``TorchTensorboardProfilerCallback`` for passing PyTorch Profiler data
+# through ``train.report()``
+PYTORCH_PROFILER_KEY = "_train_torch_profiler"
+
+# Reserved keys used across all Callbacks.
+# By default these will be filtered out from ``train.report()``.
+# See ``TrainingCallback._preprocess_results`` for more details.
+ALL_RESERVED_KEYS = {PYTORCH_PROFILER_KEY}
--- a/python/ray/train/examples/torch_tensorboard_profiler_example.py
+++ b/python/ray/train/examples/torch_tensorboard_profiler_example.py
@ -0,0 +1,84 @@
+import argparse
+
+import torch
+from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
+from torch.profiler import profile, record_function, schedule
+
+import ray
+import ray.train as train
+from ray.train import Trainer
+from ray.train.callbacks import TBXLoggerCallback
+from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
+from ray.train.torch import TorchWorkerProfiler
+
+
+def train_func():
+    twp = TorchWorkerProfiler()
+    with profile(
+        activities=[],
+        schedule=schedule(wait=0, warmup=0, active=1),
+        on_trace_ready=twp.trace_handler,
+    ) as p:
+
+        # Setup model.
+        model = torch.nn.Linear(1, 1)
+        model = train.torch.prepare_model(model)
+        loss_fn = torch.nn.MSELoss()
+        optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
+
+        # Setup data.
+        input = torch.randn(1000, 1)
+        labels = input * 2
+        dataset = torch.utils.data.TensorDataset(input, labels)
+        dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
+        dataloader = train.torch.prepare_data_loader(dataloader)
+
+        # Train.
+        for epoch in range(5):
+            with record_function("train_epoch"):
+                for X, y in dataloader:
+                    pred = model(X)
+                    loss = loss_fn(pred, y)
+                    optimizer.zero_grad()
+                    loss.backward()
+                    optimizer.step()
+
+            with record_function("train_checkpoint"):
+                state_dict = model.state_dict()
+                consume_prefix_in_state_dict_if_present(state_dict, "module.")
+                train.save_checkpoint(epoch=epoch, model_weights=state_dict)
+
+            p.step()
+
+            with record_function("train_report"):
+                profile_results = twp.get_and_clear_profile_traces()
+                train.report(epoch=epoch, **profile_results)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--address", required=False, type=str, help="the address to use for Ray"
+    )
+    parser.add_argument(
+        "--num-workers",
+        "-n",
+        type=int,
+        default=2,
+        help="Sets number of workers for training.",
+    )
+    parser.add_argument(
+        "--use-gpu", action="store_true", default=False, help="Enables GPU training"
+    )
+
+    args = parser.parse_args()
+
+    ray.init(address=args.address)
+
+    callbacks = [TorchTensorboardProfilerCallback(), TBXLoggerCallback()]
+    trainer = Trainer(
+        backend="torch", num_workers=args.num_workers, use_gpu=args.use_gpu
+    )
+    trainer.start()
+    trainer.run(train_func, callbacks=callbacks)
+    trainer.shutdown()
--- a/python/ray/train/tests/test_callbacks.py
+++ b/python/ray/train/tests/test_callbacks.py
@ -11,7 +11,12 @@ import ray
 import ray.train as train
 from ray.train import Trainer
 from ray.train.backend import BackendConfig, Backend
-from ray.train.callbacks import JsonLoggerCallback, PrintCallback, TBXLoggerCallback
+from ray.train.callbacks import (
+    JsonLoggerCallback,
+    PrintCallback,
+    TBXLoggerCallback,
+    TorchTensorboardProfilerCallback,
+)
 from ray.train.callbacks.logging import MLflowLoggerCallback, TrainCallbackLogdirManager
 from ray.train.constants import (
    TRAINING_ITERATION,
@ -255,6 +260,47 @@ def test_mlflow(ray_start_4_cpus, tmp_path):
    assert rewards == [4, 5, 6]


+def test_torch_tensorboard_profiler_callback(ray_start_4_cpus, tmp_path):
+    config = TestConfig()
+
+    temp_dir = tmp_path
+    num_workers = 4
+    num_epochs = 2
+
+    def train_func():
+        from ray.train.torch import TorchWorkerProfiler
+        from torch.profiler import profile, record_function, schedule
+
+        twp = TorchWorkerProfiler()
+        with profile(
+            activities=[],
+            schedule=schedule(wait=0, warmup=0, active=1),
+            on_trace_ready=twp.trace_handler,
+        ) as p:
+
+            for epoch in range(num_epochs):
+                with record_function("test_function"):
+                    pass
+
+                p.step()
+
+                profile_results = twp.get_and_clear_profile_traces()
+                train.report(epoch=epoch, **profile_results)
+
+    callback = TorchTensorboardProfilerCallback(temp_dir)
+    trainer = Trainer(config, num_workers=num_workers)
+    trainer.start()
+    trainer.run(train_func, callbacks=[callback])
+
+    assert temp_dir.exists()
+
+    count = 0
+    for path in temp_dir.iterdir():
+        assert path.is_file()
+        count += 1
+    assert count == num_workers * num_epochs
+
+
 if __name__ == "__main__":
    import pytest
    import sys
--- a/python/ray/train/tests/test_minimal.py
+++ b/python/ray/train/tests/test_minimal.py
@ -6,7 +6,7 @@ import ray
 import ray.train as train
 from ray.train import Trainer
 from ray.train.backend import BackendConfig, Backend
-from ray.train.callbacks.callback import TrainingCallback
+from ray.train.callbacks import TrainingCallback
 from ray.train.worker_group import WorkerGroup


--- a/python/ray/train/torch.py
+++ b/python/ray/train/torch.py
@ -1,14 +1,17 @@
+import tempfile
 from dataclasses import dataclass
 import io
 import logging
 import os

 from datetime import timedelta
+from pathlib import Path
 from typing import Optional, Dict, Any

 import ray
 from ray import train
 from ray.train.backend import BackendConfig, Backend, EncodedData
+from ray.train.constants import PYTORCH_PROFILER_KEY
 from ray.train.worker_group import WorkerGroup
 from ray.train.utils import get_address_and_port

@ -23,6 +26,11 @@ from torch.utils.data import (
    SequentialSampler,
 )

+try:
+    from torch.profiler import profile
+except ImportError:
+    profile = None
+
 logger = logging.getLogger(__name__)


@ -338,3 +346,68 @@ def prepare_data_loader(
        data_loader = _WrappedDataLoader(data_loader, device)

    return data_loader
+
+
+WORKER_TRACE_DIR_NAME = "pytorch_profiler_worker_traces"
+
+
+class TorchWorkerProfiler:
+    """Utility class for running PyTorch Profiler on a Train worker.
+
+    Args:
+        trace_dir (Optional[str]): The directory to store traces on the
+           worker node. If ``None``, this will use a default temporary dir.
+    """
+
+    def __init__(self, trace_dir: Optional[str] = None):
+        if profile is None:
+            raise ImportError(
+                "Torch Profiler requires torch>=1.8.1. "
+                "Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler."
+            )
+
+        trace_dir = trace_dir or Path(tempfile.gettempdir()).joinpath(
+            WORKER_TRACE_DIR_NAME
+        )
+        self.trace_dir = Path(trace_dir)
+        self.trace_dir.mkdir(parents=True, exist_ok=True)
+        # Accumulated traces.
+        self.profiler_trace_filenames = []
+
+    def trace_handler(self, p: profile):
+        """A stateful PyTorch Profiler trace handler.
+
+        This will the export chrome trace to a file on disk.
+
+        These exported traces can then be fetched by calling
+        ``get_and_clear_profile_traces``.
+
+        Args:
+            p (profile): A PyTorch Profiler profile.
+        """
+        trace_filename = f"worker_{train.world_rank()}_epoch_{p.step_num}.pt.trace.json"
+        trace_path = self.trace_dir.joinpath(trace_filename)
+
+        logger.debug(f"Writing worker trace to {trace_path}.")
+        p.export_chrome_trace(str(trace_path))
+        self.profiler_trace_filenames.append(trace_filename)
+
+    def get_and_clear_profile_traces(self):
+        """Reads unread Profiler traces from this worker.
+
+        Returns:
+            The traces in a format consumable by
+            ``TorchTensorboardProfilerCallback``.
+        """
+
+        def get_trace(filename):
+            trace_path = self.trace_dir.joinpath(filename)
+            return trace_path.read_text()
+
+        traces = [
+            (trace_filename, get_trace(trace_filename))
+            for trace_filename in self.profiler_trace_filenames
+        ]
+
+        self.profiler_trace_files = []
+        return {PYTORCH_PROFILER_KEY: traces}