Commit graph

33 commits

Author SHA1 Message Date
Antoni Baum
3625c4760f
[ML/Train] Add TensorflowTrainer interface (#23072)
Interface for TensorflowTrainer

Depends on #22988

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2022-03-15 14:02:17 -07:00
Balaji Veeramani
c694ed4594
[Train] Add enable_reproducibility (#22851)
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).

These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
2022-03-15 11:07:34 -07:00
Amog Kamsetty
e1f24a244b
[ml/train] Training Interfaces [3/4]: DataParallelTrainer interface (#22988)
Interface for DataParallelTrainer and updates to ScalingConfig definition.

Depends on #22986

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-03-15 08:11:05 -07:00
Max Pumperla
2b8faae40c
[docs] re/move old core examples (#22802) 2022-03-10 12:17:00 -08:00
Max Pumperla
11c40e363d
[docs] external promo content (#22823) 2022-03-10 11:39:44 -08:00
Max Pumperla
d53d0e0f50
[docs] Typo - fixes #22761 (#22763)
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
2022-03-02 10:34:46 +01:00
Amog Kamsetty
80e0d9cea4
[Train] Update docs for ray.train.torch import (#22555)
Update more examples to include the ray.train.torch import line. Follow up to #21969
2022-02-23 19:22:27 -08:00
Hao Chen
78597d3089
[train] Minor fixes on Ray Train user guide doc (#22379)
Fixes some typos and format issues.
2022-02-15 10:09:27 -08:00
matthewdeng
8f9e0d7f6b
[train] add TorchTensorboardProfilerCallback (#22345)
The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.

```
  | File "ray_sgd_training.py", line 18, in <module>
  | from ray import train
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
  | from ray.train.callbacks import TrainingCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
  | from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
  | from torch.profiler import profile
  | ModuleNotFoundError: No module named 'torch.profiler'
```

A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:

```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
    "Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
2022-02-14 16:16:55 -08:00
Max Pumperla
5cc9355303
[Docs ] Tune docs overhaul (first part) (#22112)
Continuing docs overhaul, tune now has:

- [x] better landing page
- [x] a getting started guide
- [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials
- [x] the new user guide contains guides to tune features and practical integrations
- [x] we rewrote some of the feature guides for clarity 
- [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower)
- [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030.
- [x] Examples are tested in the new framework, of course.

There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-02-07 15:47:03 +00:00
matthewdeng
014a9959f1
Revert "[train] add TorchTensorboardProfilerCallback (#21864)" (#22117)
This reverts commit f064306de9.
2022-02-04 08:54:16 -08:00
matthewdeng
f064306de9
[train] add TorchTensorboardProfilerCallback (#21864)
Implement a TorchTensorboardProfilerCallback and corresponding TorchWorkerProfiler to support distributed PyTorch Profiler With TensorBoard integration.
2022-02-03 19:28:12 -08:00
Junwen Yao
eb8adc6105
[train] add a utility function to turn off TF autosharding (#21887)
This PR adds a utility function to turn off TF autosharding as a temporary solution.

Closes #19324.
2022-01-28 16:09:06 -08:00
Max Pumperla
4dd221f848
[Docs] Ray Data docs target state (#21931)
Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html)

The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have

- [x] A Getting Started Guide
- [x] An explicit User / How-To Guide
- [x] A dedicated Key Concepts page
- [x] A consistent naming convention in `Ray Data` whenever is is referred to the project.

This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838.
2022-01-27 13:14:36 -08:00
Max Pumperla
7953c9ca57
[docs] integrate algolia docsearch, move to sphinx panels (#21814) 2022-01-24 17:00:41 -08:00
matthewdeng
8119b62640
[train] refactor callback logdir and results preprocessors (#21468)
* [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors

* simplify profiler

* read on get_and_clear_profile_traces

* refactor callbacks

* remove var

* Update python/ray/train/callbacks/logging.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* Update python/ray/train/callbacks/results_prepocessors/keys.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* address comments; add tests

* fix test

* address comments

* docs

* address comments'

* fix test

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-21 17:23:34 -08:00
matthewdeng
165a025641
[train] update worker batch size docs (#21761)
Making it explicit how the user should think about batch size for PyTorch in a distributed setting, similar to what's already done for TensorFlow.

![image](https://user-images.githubusercontent.com/3967392/150421340-df73f574-8531-4626-88a6-b80442ea6b7f.png)
2022-01-21 17:22:47 -08:00
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
Amog Kamsetty
bcae6ba6c9
[Train] _WrappedDataLoader yield tuples (#21467)
Fixes bug with _WrappedDataLoader that yields a generator instead of a tuple.

Addresses https://discuss.ray.io/t/ray-train-creates-typeerror-generator-object-is-not-subscriptable/4605/10
2022-01-10 12:40:36 -08:00
Amog Kamsetty
123aa7cd2b
[Train] Improve usability for GPU Training (#21464)
Minor changes to improve the user experience for GPU Training.

Addresses https://discuss.ray.io/t/ray-train-doesnt-detect-gpu/4608
2022-01-07 11:53:53 -08:00
Balaji Veeramani
7efe1bef11
[Train] Add PrintCallback (#21261)
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-01-03 14:03:04 -08:00
Amog Kamsetty
57db4640ca
[Train] [Tune] Refactor MLflow (#20802)
Pulls out Tune's MLflow logging logic to a shared MLflow util.
Adds an MLflow logger callback to Ray Train

Closes #20642
2021-12-21 17:17:52 -08:00
Junwen Yao
8325a32d66
[Train] Update saving / loading checkpoint documentation (#20973)
This PR updates saving / loading checkpoint examples.

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-12-14 09:53:17 -08:00
Amog Kamsetty
c03b937b95
[Train] Minor migration guide update (#20683)
* update docs

* tf
2021-11-29 12:42:28 -08:00
Amog Kamsetty
9796ae56d5
[Train][Data] Change usages of iter_datasets to iter_epochs (#20487) 2021-11-17 18:05:51 -08:00
Amog Kamsetty
4f88796d5a
[Train] Move to beta (#20378) 2021-11-16 08:19:30 -08:00
Amog Kamsetty
a74cf7ff1c
[Train] Torch Prepare utilities (#20254)
* update

* formatting

* fix failures

* fix session tests

* address comments

* add to api docs

* package refactor

* wip

* wip

* wip

* finish

* finish

* fix

* comment

* fix

* install horovod for docs

* address comment

* Update python/ray/train/session.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/train/torch.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* try fix docs

* fix doc build failure

* fix

* fix

* fix

* try fix doc highlighting

* fix docs

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-15 07:34:17 -08:00
Amog Kamsetty
65a17da2ec
[Train] Refactor Backends (#20312)
* wip

* finish

* comment

* fix

* install horovod for docs

* address comment

* fix doc build failure
2021-11-13 11:05:53 -08:00
matthewdeng
e77cc926be
[train] minor doc updates (#20271) 2021-11-12 17:20:23 -08:00
Amog Kamsetty
1803d88943
[Train] Simplify single worker training (#19814)
* wip

* update

* fix

* fix

* fix

* fix
2021-10-28 10:54:35 -07:00
matthewdeng
aa5499ef0f
[Train] implement CheckpointStrategy (#19111)
* [SGD] implement CheckpointStrategy

* address comments

* update docs

* Update doc/source/train/user_guide.rst

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* best checkpoint

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-10-27 11:31:04 -07:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00