ray/doc/source/raysgd/raysgd.rst

.. _sgd-index:

=====================================
RaySGD: Distributed Training Wrappers
=====================================

.. _`issue on GitHub`: https://github.com/ray-project/ray/issues

RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.

The main features are:

  - **Ease of use**: Scale PyTorch's native ``DistributedDataParallel`` and TensorFlow's ``tf.distribute.MirroredStrategy`` without needing to monitor individual nodes.
  - **Composability**: RaySGD is built on top of the Ray Actor API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Ray.Serve.
  - **Scale up and down**: Start on single CPU. Scale up to multi-node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.


Getting Started
---------------

You can start a ``TorchTrainer`` with the following:

.. code-block:: python

    import ray
    from ray.util.sgd import TorchTrainer
    from ray.util.sgd.torch import TrainingOperator
    from ray.util.sgd.torch.examples.train_example import LinearDataset

    import torch
    from torch.utils.data import DataLoader

    class CustomTrainingOperator(TrainingOperator):
        def setup(self, config):
            # Load data.
            train_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])
            val_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])

            # Create model.
            model = torch.nn.Linear(1, 1)

            # Create optimizer.
            optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

            # Create loss.
            loss = torch.nn.MSELoss()

            # Register model, optimizer, and loss.
            self.model, self.optimizer, self.criterion = self.register(
                models=model,
                optimizers=optimizer,
                criterion=loss)

            # Register data loaders.
            self.register_data(train_loader=train_loader, validation_loader=val_loader)


    ray.init()

    trainer1 = TorchTrainer(
        training_operator_cls=CustomTrainingOperator,
        num_workers=2,
        use_gpu=False,
        config={"batch_size": 64})

    stats = trainer1.train()
    print(stats)
    trainer1.shutdown()
    print("success!")

.. tip:: Get in touch with us if you're using or considering using `RaySGD <https://forms.gle/26EMwdahdgm7Lscy9>`_!
[docs] Add Overview Section & Gentle Introduction (#8517) 2020-05-26 08:39:34 -07:00			`.. _sgd-index:`

			`=====================================`
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00			`RaySGD: Distributed Training Wrappers`
			`=====================================`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00			.. _`issue on GitHub`: https://github.com/ray-project/ray/issues
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.`
[docs] Edit survey links (#6777) 2020-01-17 11:52:04 -08:00
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00			`The main features are:`

[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			- Ease of use: Scale PyTorch's native ``DistributedDataParallel`` and TensorFlow's ``tf.distribute.MirroredStrategy`` without needing to monitor individual nodes.
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`- Composability: RaySGD is built on top of the Ray Actor API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Ray.Serve.`
			`- Scale up and down: Start on single CPU. Scale up to multi-node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.`


			`Getting Started`
			`---------------`

[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			You can start a ``TorchTrainer`` with the following:
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
			`.. code-block:: python`

[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`import ray`
[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			`from ray.util.sgd import TorchTrainer`
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`from ray.util.sgd.torch import TrainingOperator`
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`from ray.util.sgd.torch.examples.train_example import LinearDataset`

			`import torch`
			`from torch.utils.data import DataLoader`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`class CustomTrainingOperator(TrainingOperator):`
			`def setup(self, config):`
			`# Load data.`
			`train_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])`
			`val_loader = DataLoader(LinearDataset(2, 5), config["batch_size"])`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`# Create model.`
			`model = torch.nn.Linear(1, 1)`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`# Create optimizer.`
			`optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`# Create loss.`
			`loss = torch.nn.MSELoss()`
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`# Register model, optimizer, and loss.`
			`self.model, self.optimizer, self.criterion = self.register(`
			`models=model,`
			`optimizers=optimizer,`
			`criterion=loss)`

			`# Register data loaders.`
			`self.register_data(train_loader=train_loader, validation_loader=val_loader)`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00

[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`ray.init()`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425) * Start renaming pytorch to torch * Rename PyTorchTrainer to TorchTrainer * Rename PyTorch runners to Torch runners * Finish renaming API * Rename to torch in tests * Finish renaming docs + tests * Run format + fix DeprecationWarning * fix * move tests up * rename Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-03-03 16:44:42 -08:00			`trainer1 = TorchTrainer(`
[RaySGD] Simplify Builder Process (#10321) Co-authored-by: Richard Liaw <rliaw@berkeley.edu> 2020-09-08 15:19:40 -07:00			`training_operator_cls=CustomTrainingOperator,`
[sgd] Readme fix (#7564) * readme fix * replicas 2020-03-11 13:40:18 -07:00			`num_workers=2,`
			`use_gpu=False,`
			`config={"batch_size": 64})`
[sgd] fault tolerance for pytorch + revamp documentation (#6465) 2020-01-16 18:38:27 -08:00
[sgd] Refactor PyTorch SGD Documentation. (#6910) * Refactor documentation and directory structurre * update loss * ,ore examples * fix comments * more code * svgs * formatting * more_docs * more writing * comments ready * move * whitespace * examples * fix * bold * pytorch * batch * fix * fix test * Apply suggestions from code review * quarantinegp * tests/ * fix missing 2020-01-29 08:51:01 -08:00			`stats = trainer1.train()`
			`print(stats)`
			`trainer1.shutdown()`
			`print("success!")`
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
			.. tip:: Get in touch with us if you're using or considering using `RaySGD <https://forms.gle/26EMwdahdgm7Lscy9>`_!