2021-10-18 22:27:46 -07:00
.. _train-docs:
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
Ray Train: Distributed Deep Learning
====================================
2021-08-02 01:47:14 -07:00
.. _`issue on GitHub`: https://github.com/ray-project/ray/issues
2021-10-18 22:27:46 -07:00
.. tip :: Get in touch with us if you're using or considering using `Ray Train <https://forms.gle/PXFcJmHwszCwQhqX7> `_ !
2021-09-27 19:15:37 -07:00
2021-10-18 22:27:46 -07:00
Ray Train is a lightweight library for distributed deep learning, allowing you
2021-09-14 09:07:25 -07:00
to scale up and speed up training for your deep learning models.
2021-08-02 01:47:14 -07:00
The main features are:
- **Ease of use** : Scale your single process training code to a cluster in just a couple lines of code.
2021-10-18 22:27:46 -07:00
- **Composability** : Ray Train interoperates with :ref: `Ray Tune <tune-main>` to tune your distributed model and :ref: `Ray Datasets <datasets>` to train on large amounts of data.
- **Interactivity** : Ray Train fits in your workflow with support to run from any environment, including seamless Jupyter notebook support.
2021-08-02 01:47:14 -07:00
2021-09-14 09:07:25 -07:00
.. note ::
2021-11-16 08:19:30 -08:00
This API is in its Beta release (as of Ray 1.9) and may be revised in
2021-09-14 09:07:25 -07:00
future Ray releases. If you encounter any bugs, please file an
`issue on GitHub`_ .
If you are looking for the previous API documentation, see :ref: `sgd-index` .
2021-10-18 22:27:46 -07:00
Intro to Ray Train
------------------
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
Ray Train is a library that aims to simplify distributed deep learning.
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
**Frameworks** : Ray Train is built to abstract away the coordination/configuration setup of distributed deep learning frameworks such as Pytorch Distributed and Tensorflow Distributed, allowing users to only focus on implementing training logic.
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
* For Pytorch, Ray Train automatically handles the construction of the distributed process group.
* For Tensorflow, Ray Train automatically handles the coordination of the `` TF_CONFIG `` . The current implementation assumes that the user will use a MultiWorkerMirroredStrategy, but this will change in the near future.
* For Horovod, Ray Train automatically handles the construction of the Horovod runtime and Rendezvous server.
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
**Built for data scientists/ML practitioners** : Ray Train has support for standard ML tools and features that practitioners love:
2021-08-02 01:47:14 -07:00
* Callbacks for early stopping
* Checkpointing
* Integration with Tensorboard, Weights/Biases, and MLflow
* Jupyter notebooks
**Integration with Ray Ecosystem** : Distributed deep learning often comes with a lot of complexity.
2021-10-18 22:27:46 -07:00
* Use :ref: `Ray Datasets <datasets>` with Ray Train to handle and train on large amounts of data.
* Use :ref: `Ray Tune <tune-main>` with Ray Train to leverage cutting edge hyperparameter techniques and distribute both your training and tuning.
2021-08-02 01:47:14 -07:00
* You can leverage the :ref: `Ray cluster launcher <cluster-cloud>` to launch autoscaling or spot instance clusters to train your model at scale on any cloud.
2021-09-14 09:07:25 -07:00
Quick Start
-----------
2021-10-18 22:27:46 -07:00
Ray Train abstracts away the complexity of setting up a distributed training
2021-09-14 09:07:25 -07:00
system. Let's take following simple examples:
.. tabs ::
.. group-tab :: PyTorch
2021-10-18 22:27:46 -07:00
This example shows how you can use Ray Train with PyTorch.
2021-09-14 09:07:25 -07:00
First, set up your dataset and model.
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/torch_quick_start.py
:language: python
:start-after: __torch_setup_begin__
:end-before: __torch_setup_end__
2021-09-14 09:07:25 -07:00
Now define your single-worker PyTorch training function.
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/torch_quick_start.py
:language: python
:start-after: __torch_single_begin__
:end-before: __torch_single_end__
2021-09-14 09:07:25 -07:00
This training function can be executed with:
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/torch_quick_start.py
:language: python
:start-after: __torch_single_run_begin__
:end-before: __torch_single_run_end__
2021-09-14 09:07:25 -07:00
Now let's convert this to a distributed multi-worker training function!
2021-11-15 07:34:17 -08:00
All you have to do is use the `` ray.train.torch.prepare_model `` and
`` ray.train.torch.prepare_data_loader `` utility functions to
easily setup your model & data for distributed training.
This will automatically wrap your model with `` DistributedDataParallel ``
and place it on the right device, and add `` DisributedSampler `` to your DataLoaders.
2021-09-14 09:07:25 -07:00
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/torch_quick_start.py
:language: python
:start-after: __torch_distributed_begin__
:end-before: __torch_distributed_end__
2021-09-14 09:07:25 -07:00
Then, instantiate a `` Trainer `` that uses a `` "torch" `` backend
with 4 workers, and use it to run the new training function!
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/torch_quick_start.py
:language: python
:start-after: __torch_trainer_begin__
:end-before: __torch_trainer_end__
2021-09-22 15:48:38 -07:00
2021-10-18 22:27:46 -07:00
See :ref: `train-porting-code` for a more comprehensive example.
2021-09-22 15:48:38 -07:00
2021-09-14 09:07:25 -07:00
.. group-tab :: TensorFlow
2021-10-18 22:27:46 -07:00
This example shows how you can use Ray Train to set up `Multi-worker training
2021-09-14 09:07:25 -07:00
with Keras <https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras>`_.
First, set up your dataset and model.
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/tensorflow_quick_start.py
:language: python
:start-after: __tf_setup_begin__
:end-before: __tf_setup_end__
2021-09-14 09:07:25 -07:00
Now define your single-worker TensorFlow training function.
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/tensorflow_quick_start.py
:language: python
:start-after: __tf_single_begin__
:end-before: __tf_single_end__
2021-09-14 09:07:25 -07:00
This training function can be executed with:
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/tensorflow_quick_start.py
:language: python
:start-after: __tf_single_run_begin__
:end-before: __tf_single_run_end__
2021-09-14 09:07:25 -07:00
Now let's convert this to a distributed multi-worker training function!
All you need to do is:
1. Set the *global* batch size - each worker will process the same size
batch as in the single-worker code.
2. Choose your TensorFlow distributed training strategy. In this example
we use the `` MultiWorkerMirroredStrategy `` .
2021-08-02 01:47:14 -07:00
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/tensorflow_quick_start.py
:language: python
:start-after: __tf_distributed_begin__
:end-before: __tf_distributed_end__
2021-08-02 01:47:14 -07:00
2021-09-14 09:07:25 -07:00
Then, instantiate a `` Trainer `` that uses a `` "tensorflow" `` backend
with 4 workers, and use it to run the new training function!
2021-08-02 01:47:14 -07:00
2021-10-28 10:54:35 -07:00
.. literalinclude :: /../../python/ray/train/examples/tensorflow_quick_start.py
:language: python
:start-after: __tf_trainer_begin__
:end-before: __tf_trainer_end__
2021-08-02 01:47:14 -07:00
2021-10-18 22:27:46 -07:00
See :ref: `train-porting-code` for a more comprehensive example.
2021-09-22 15:48:38 -07:00
2021-10-18 22:27:46 -07:00
**Next steps:** Check out the :ref: `User Guide <train-user-guide>` !