ray/doc/source/raysgd/raysgd.rst
Richard Liaw 94e2fcea2e
[sgd] fp16 (apex) and scheduler support + move examples page (#7061)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* fix tests'

* testmode

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-16 19:04:08 -08:00

63 lines
2 KiB
ReStructuredText

RaySGD: Distributed Deep Learning
=================================
.. image:: raysgdlogo.png
:scale: 20%
:align: center
RaySGD is a lightweight library for distributed deep learning, providing thin wrappers around PyTorch and TensorFlow native modules for data parallel training.
The main features are:
- **Ease of use**: Scale Pytorch's native ``DistributedDataParallel`` and TensorFlow's ``tf.distribute.MirroredStrategy`` without needing to monitor individual nodes.
- **Composability**: RaySGD is built on top of the Ray Actor API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Ray.Serve.
- **Scale up and down**: Start on single CPU. Scale up to multi-node, multi-CPU, or multi-GPU clusters by changing 2 lines of code.
.. tip:: We need your feedback! RaySGD is currently early in its development, and we're hoping to get feedback from people using or considering it. We'd love `to get in touch <https://forms.gle/26EMwdahdgm7Lscy9>`_!
Getting Started
---------------
You can start a ``PyTorchTrainer`` with the following:
.. code-block:: python
import numpy as np
import torch
import torch.nn as nn
from torch import distributed
from ray.experimental.sgd import PyTorchTrainer
from ray.experimental.sgd.examples.train_example import LinearDataset
def model_creator(config):
return nn.Linear(1, 1)
def optimizer_creator(model, config):
"""Returns optimizer."""
return torch.optim.SGD(model.parameters(), lr=1e-2)
def data_creator(batch_size, config):
"""Returns training dataloader, validation dataloader."""
return LinearDataset(2, 5), LinearDataset(2, 5, size=400)
ray.init()
trainer1 = PyTorchTrainer(
model_creator,
data_creator,
optimizer_creator,
loss_creator=nn.MSELoss,
num_replicas=2,
use_gpu=True,
batch_size=512,
backend="nccl")
stats = trainer1.train()
print(stats)
trainer1.shutdown()
print("success!")