The `Ray Lightning Library <https://github.com/ray-project/ray_lightning>`__ provides plugins for distributed training with Ray.
These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning. It offers the following plugins:
*`PyTorch Distributed Data Parallel <https://github.com/ray-project/ray_lightning#pytorch-distributed-data-parallel-plugin-on-ray>`__
*`Fairscale <https://github.com/ray-project/ray_lightning#model-parallel-sharded-training-on-ray>`__ for model parallel training.
Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.
Install the Ray Lightning Library with the following commands:
To use, simply pass in the plugin to your Pytorch Lightning ``Trainer``. For full details, you can checkout the `README here <https://github.com/ray-project/ray_lightning#distributed-pytorch-lightning-training-on-ray>`__
Here is an example of using the ``RayPlugin`` for Distributed Data Parallel training on a Ray cluster:
With this plugin, Pytorch DDP is used as the distributed training communication protocol, but Ray is used to launch and manage the training worker processes.
Multi-node Distributed Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Using the same examples above, you can run distributed training on a multi-node cluster with just a couple simple steps.
First, use Ray's :ref:`Cluster Launcher <ref-cluster-quick-start>` to start a Ray cluster:
..code-block:: bash
ray up my_cluster_config.yaml
Then, run your Ray script using one of the following options:
1. on the head node of the cluster (``python train_script.py``)
Check out the :ref:`Pytorch Lightning with Ray Tune tutorial<tune-pytorch-lightning-ref>` for a full example on how you can use these callbacks and run a tuning experiment for your Pytorch Lightning model.
These integrations also support the case where you want a distributed hyperparameter tuning experiment, but each trial (training run) needs to be distributed as well.
In this case, you want to use the `Ray Lightning Library's <https://github.com/ray-project/ray_lightning>`_ integration with Ray Tune.
With this integration, you can run multiple PyTorch Lightning training runs in parallel,
each with a different hyperparameter configuration, and each training run also parallelized.
All you have to do is move your training code to a function, pass the function to ``Tuner()``, and make sure to add the appropriate callback (Either ``TuneReportCallback`` or ``TuneReportCheckpointCallback``) to your PyTorch Lightning Trainer.
..warning:: Make sure to use the callbacks from the Ray Lightning library and not the one from the Tune library, i.e. use ``ray_lightning.tune.TuneReportCallback`` and not ``ray.tune.integrations.pytorch_lightning.TuneReportCallback``.