ray/doc/source/raysgd/raysgd_pytorch.rst

RaySGD Pytorch
==============

.. warning:: This is still an experimental API and is subject to change in the near future.

.. tip:: Help us make RaySGD better; take this 1 minute `User Survey <https://forms.gle/26EMwdahdgm7Lscy9>`_!

Ray's ``PyTorchTrainer`` simplifies distributed model training for PyTorch. The ``PyTorchTrainer`` is a wrapper around ``torch.distributed.launch`` with a Python API to easily incorporate distributed training into a larger Python application, as opposed to needing to execute training outside of Python.

----------

**With Ray**:

Wrap your training with this:

.. code-block:: python

    ray.init(args.address)

    trainer1 = PyTorchTrainer(
        model_creator,
        data_creator,
        optimizer_creator,
        loss_creator,
        num_replicas=<NUM_GPUS_YOU_HAVE> * <NUM_NODES>,
        use_gpu=True,
        batch_size=512,
        backend="nccl")

    stats = trainer1.train()
    print(stats)
    trainer1.shutdown()
    print("success!")


Then, start a Ray cluster `via autoscaler <autoscaling.html>`_ or `manually <using-ray-on-a-cluster.html>`_.

.. code-block:: bash

    ray up CLUSTER.yaml
    python train.py --address="localhost:<PORT>"


----------

**Before, with Pytorch**:

In your training program, insert the following:

.. code-block:: python

    torch.distributed.init_process_group(backend='YOUR BACKEND',
                                         init_method='env://')

    model = torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[arg.local_rank],
                                                      output_device=arg.local_rank)

Then, separately, on each machine:

.. code-block:: bash

    # Node 1: *(IP: 192.168.1.1, and has a free port: 1234)*
    $ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
               --nnodes=4 --node_rank=0 --master_addr="192.168.1.1"
               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
               and all other arguments of your training script)
    # Node 2:
    $ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
               --nnodes=4 --node_rank=1 --master_addr="192.168.1.1"
               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
               and all other arguments of your training script)
    # Node 3:
    $ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
               --nnodes=4 --node_rank=2 --master_addr="192.168.1.1"
               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
               and all other arguments of your training script)
    # Node 4:
    $ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE
               --nnodes=4 --node_rank=3 --master_addr="192.168.1.1"
               --master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3
               and all other arguments of your training script)


PyTorchTrainer Example
----------------------

Below is an example of using Ray's PyTorchTrainer. Under the hood, ``PytorchTrainer`` will create *replicas* of your model (controlled by ``num_replicas``) which are each managed by a worker.

.. literalinclude:: ../../../python/ray/experimental/sgd/examples/train_example.py
   :language: python
   :start-after: __torch_train_example__


Hyperparameter Optimization on Distributed Pytorch
--------------------------------------------------

``PyTorchTrainer`` naturally integrates with Tune via the ``PyTorchTrainable`` interface. The same arguments to ``PyTorchTrainer`` should be passed into the ``tune.run(config=...)`` as shown below.

.. literalinclude:: ../../../python/ray/experimental/sgd/examples/tune_example.py
   :language: python
   :start-after: __torch_tune_example__


Package Reference
-----------------

.. autoclass:: ray.experimental.sgd.pytorch.PyTorchTrainer
    :members:

    .. automethod:: __init__


.. autoclass:: ray.experimental.sgd.pytorch.PyTorchTrainable
    :members:
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00			`RaySGD Pytorch`
			`==============`

			`.. warning:: This is still an experimental API and is subject to change in the near future.`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Edit survey links (#6777) 2020-01-17 11:52:04 -08:00			.. tip:: Help us make RaySGD better; take this 1 minute `User Survey <https://forms.gle/26EMwdahdgm7Lscy9>`_!

[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			Ray's ``PyTorchTrainer`` simplifies distributed model training for PyTorch. The ``PyTorchTrainer`` is a wrapper around ``torch.distributed.launch`` with a Python API to easily incorporate distributed training into a larger Python application, as opposed to needing to execute training outside of Python.
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`----------`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`With Ray:`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`Wrap your training with this:`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
			`.. code-block:: python`

[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`ray.init(args.address)`
[sgd] Extend distributed pytorch functionality (#5675) * raysgd * apply fn * double quotes * removed duplicate TimerStat * removed duplicate find_free_port * imports in pytorch_trainer * init doc * ray.experimental * remove resize example * resnet example * cifar * Fix up after kwargs * data_dir and dataloader_workers args * formatting * loss * init * update code * lint * smoketest * better_configs * fix * fix * fix * train_loader * fixdocs * ok * ok * fix * fix_update * fix * fix * done * fix * fix * fix * small * lint * fix * fix * fix_test * fix * validate * fix * fi 2019-11-05 11:16:46 -08:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`trainer1 = PyTorchTrainer(`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00			`model_creator,`
			`data_creator,`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`optimizer_creator,`
[sgd] Extend distributed pytorch functionality (#5675) * raysgd * apply fn * double quotes * removed duplicate TimerStat * removed duplicate find_free_port * imports in pytorch_trainer * init doc * ray.experimental * remove resize example * resnet example * cifar * Fix up after kwargs * data_dir and dataloader_workers args * formatting * loss * init * update code * lint * smoketest * better_configs * fix * fix * fix * train_loader * fixdocs * ok * ok * fix * fix_update * fix * fix * done * fix * fix * fix * small * lint * fix * fix * fix_test * fix * validate * fix * fi 2019-11-05 11:16:46 -08:00			`loss_creator,`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`num_replicas=<NUM_GPUS_YOU_HAVE> * <NUM_NODES>,`
			`use_gpu=True,`
			`batch_size=512,`
[sgd] Extend distributed pytorch functionality (#5675) * raysgd * apply fn * double quotes * removed duplicate TimerStat * removed duplicate find_free_port * imports in pytorch_trainer * init doc * ray.experimental * remove resize example * resnet example * cifar * Fix up after kwargs * data_dir and dataloader_workers args * formatting * loss * init * update code * lint * smoketest * better_configs * fix * fix * fix * train_loader * fixdocs * ok * ok * fix * fix_update * fix * fix * done * fix * fix * fix * small * lint * fix * fix * fix_test * fix * validate * fix * fi 2019-11-05 11:16:46 -08:00			`backend="nccl")`

			`stats = trainer1.train()`
			`print(stats)`
			`trainer1.shutdown()`
			`print("success!")`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00


			Then, start a Ray cluster `via autoscaler <autoscaling.html>`_ or `manually <using-ray-on-a-cluster.html>`_.

			`.. code-block:: bash`

			`ray up CLUSTER.yaml`
			`python train.py --address="localhost:<PORT>"`


			`----------`

			`Before, with Pytorch:`

			`In your training program, insert the following:`

[docs] Distributed Training Quickfix (#5571) 2019-08-29 15:38:43 -07:00			`.. code-block:: python`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00
			`torch.distributed.init_process_group(backend='YOUR BACKEND',`
			`init_method='env://')`

			`model = torch.nn.parallel.DistributedDataParallel(model,`
			`device_ids=[arg.local_rank],`
			`output_device=arg.local_rank)`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`Then, separately, on each machine:`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Distributed Training Quickfix (#5571) 2019-08-29 15:38:43 -07:00			`.. code-block:: bash`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`# Node 1: (IP: 192.168.1.1, and has a free port: 1234)`
			`$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE`
			`--nnodes=4 --node_rank=0 --master_addr="192.168.1.1"`
			`--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3`
			`and all other arguments of your training script)`
			`# Node 2:`
			`$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE`
			`--nnodes=4 --node_rank=1 --master_addr="192.168.1.1"`
			`--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3`
			`and all other arguments of your training script)`
			`# Node 3:`
			`$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE`
			`--nnodes=4 --node_rank=2 --master_addr="192.168.1.1"`
			`--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3`
			`and all other arguments of your training script)`
			`# Node 4:`
			`$ python -m torch.distributed.launch --nproc_per_node=NUM_GPUS_YOU_HAVE`
			`--nnodes=4 --node_rank=3 --master_addr="192.168.1.1"`
			`--master_port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3`
			`and all other arguments of your training script)`


			`PyTorchTrainer Example`
			`----------------------`

			Below is an example of using Ray's PyTorchTrainer. Under the hood, ``PytorchTrainer`` will create replicas of your model (controlled by ``num_replicas``) which are each managed by a worker.

[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00			`.. literalinclude:: ../../../python/ray/experimental/sgd/examples/train_example.py`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`:language: python`
			`:start-after: __torch_train_example__`


			`Hyperparameter Optimization on Distributed Pytorch`
			`--------------------------------------------------`

			``PyTorchTrainer`` naturally integrates with Tune via the ``PyTorchTrainable`` interface. The same arguments to ``PyTorchTrainer`` should be passed into the ``tune.run(config=...)`` as shown below.

[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00			`.. literalinclude:: ../../../python/ray/experimental/sgd/examples/tune_example.py`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00			`:language: python`
			`:start-after: __torch_tune_example__`


			`Package Reference`
			`-----------------`
[sgd] Distributed Training via PyTorch (#4797) Implements distributed SGD using distributed PyTorch. 2019-06-01 21:39:22 -07:00
			`.. autoclass:: ray.experimental.sgd.pytorch.PyTorchTrainer`
			`:members:`

			`.. automethod:: __init__`
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00

			`.. autoclass:: ray.experimental.sgd.pytorch.PyTorchTrainable`
			`:members:`