This PR:
* Allows the user to set `keep_checkpoints_num` and `checkpoint_score_attr` in `RunConfig` using the `CheckpointStrategy` dataclass
* Adds two new fields to the `Result` object - `best_checkpoints` - a list of saved best checkpoints as determined by `CheckpointingConfig`.
Adding a FAQ page. Currently has some basic questions that have come up in the past.
Explaining how to use Matplotlib due to threading in the distributed training function.
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).
These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
Interface for DataParallelTrainer and updates to ScalingConfig definition.
Depends on #22986
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.
```
| File "ray_sgd_training.py", line 18, in <module>
| from ray import train
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
| from ray.train.callbacks import TrainingCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
| from torch.profiler import profile
| ModuleNotFoundError: No module named 'torch.profiler'
```
A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:
```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
"Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
Implement a TorchTensorboardProfilerCallback and corresponding TorchWorkerProfiler to support distributed PyTorch Profiler With TensorBoard integration.