ray/doc/source/raysgd/raysgd.rst

RaySGD: Distributed Training Wrappers
=====================================

.. _`issue on GitHub`: https://github.com/ray-project/ray/issues

RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.

The main features are:

  - **Ease of use**: Scale PyTorch's native ``DistributedDataParallel`` and TensorFlow's ``tf.distribute.MirroredStrategy`` without needing to monitor individual nodes.
  - **Composability**: RaySGD is built on top of the Ray Actor API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Ray.Serve.
  - **Scale up and down**: Start on single CPU. Scale up to multi-node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.

.. tip:: Join our `community slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`_ to discuss Ray!


Getting Started
---------------

You can start a ``TorchTrainer`` with the following:

.. code-block:: python

    import ray
    from ray.util.sgd import TorchTrainer
    from ray.util.sgd.torch.examples.train_example import LinearDataset

    import torch
    from torch.utils.data import DataLoader


    def model_creator(config):
        return torch.nn.Linear(1, 1)


    def optimizer_creator(model, config):
        """Returns optimizer."""
        return torch.optim.SGD(model.parameters(), lr=1e-2)


    def data_creator(config):
        train_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])
        val_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])
        return train_loader, val_loader

    ray.init()

    trainer1 = TorchTrainer(
        model_creator=model_creator,
        data_creator=data_creator,
        optimizer_creator=optimizer_creator,
        loss_creator=torch.nn.MSELoss,
        num_workers=2,
        use_gpu=False,
        config={"batch_size": 64})

    stats = trainer1.train()
    print(stats)
    trainer1.shutdown()
    print("success!")

.. tip:: Get in touch with us if you're using or considering using `RaySGD <https://forms.gle/26EMwdahdgm7Lscy9>`_!
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00			`RaySGD: Distributed Training Wrappers`
			`=====================================`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00			.. _`issue on GitHub`: https://github.com/ray-project/ray/issues
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.`
[docs] Edit survey links (#6777) 2020-01-17 11:52:04 -08:00
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00			`The main features are:`

[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			- Ease of use: Scale PyTorch's native ``DistributedDataParallel`` and TensorFlow's ``tf.distribute.MirroredStrategy`` without needing to monitor individual nodes.
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`- Composability: RaySGD is built on top of the Ray Actor API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Ray.Serve.`
			`- Scale up and down: Start on single CPU. Scale up to multi-node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.`

less important (#8439) 2020-05-13 22:52:38 -07:00			.. tip:: Join our `community slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`_ to discuss Ray!
[docs] Make Ray slack more prominent (#7870) 2020-04-02 11:14:02 -07:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
			`Getting Started`
			`---------------`

[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			You can start a ``TorchTrainer`` with the following:
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
			`.. code-block:: python`

[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`import ray`
[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			`from ray.util.sgd import TorchTrainer`
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`from ray.util.sgd.torch.examples.train_example import LinearDataset`

			`import torch`
			`from torch.utils.data import DataLoader`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00

			`def model_creator(config):`
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`return torch.nn.Linear(1, 1)`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00

			`def optimizer_creator(model, config):`
			`"""Returns optimizer."""`
			`return torch.optim.SGD(model.parameters(), lr=1e-2)`

[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`def data_creator(config):`
			`train_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])`
			`val_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])`
			`return train_loader, val_loader`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`ray.init()`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			`trainer1 = TorchTrainer(`
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`model_creator=model_creator,`
			`data_creator=data_creator,`
			`optimizer_creator=optimizer_creator,`
			`loss_creator=torch.nn.MSELoss,`
			`num_workers=2,`
			`use_gpu=False,`
			`config={"batch_size": 64})`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`stats = trainer1.train()`
			`print(stats)`
			`trainer1.shutdown()`
			`print("success!")`
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
			.. tip:: Get in touch with us if you're using or considering using `RaySGD <https://forms.gle/26EMwdahdgm7Lscy9>`_!