mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

* Modify: add interface for model * Modify: remove single quota and build; add metrics * Modify: flatten into list of dict * Update distributed_sgd.rst * Modify: update format with scripts/format.sh * Update sgd_worker.py
56 lines
3.2 KiB
ReStructuredText
56 lines
3.2 KiB
ReStructuredText
Distributed SGD (Experimental)
|
|
==============================
|
|
|
|
Ray includes an implementation of synchronous distributed stochastic gradient descent (SGD), which is competitive in performance with implementations in Horovod and Distributed TensorFlow.
|
|
|
|
Ray SGD is built on top of the Ray task and actor abstractions to provide seamless integration into existing Ray applications.
|
|
|
|
Interface
|
|
---------
|
|
|
|
To use Ray SGD, define a `model class <https://github.com/ray-project/ray/blob/master/python/ray/experimental/sgd/model.py>`__:
|
|
|
|
.. autoclass:: ray.experimental.sgd.Model
|
|
|
|
Then, pass a model creator function to the ``ray.experimental.sgd.DistributedSGD`` class. To drive the distributed training, ``sgd.step()`` can be called repeatedly:
|
|
|
|
.. code-block:: python
|
|
|
|
model_creator = lambda worker_idx, device_idx: YourModelClass()
|
|
|
|
sgd = DistributedSGD(
|
|
model_creator,
|
|
num_workers=2,
|
|
devices_per_worker=4,
|
|
gpu=True,
|
|
strategy="ps")
|
|
|
|
for i in range(NUM_ITERS):
|
|
sgd.step()
|
|
|
|
Under the hood, Ray SGD will create *replicas* of your model onto each hardware device (GPU) allocated to workers (controlled by ``num_workers``). Multiple devices can be managed by each worker process (controlled by ``devices_per_worker``). Each model instance will be in a separate TF variable scope. The ``DistributedSGD`` class coordinates the distributed computation and application of gradients to improve the model.
|
|
|
|
There are two distributed SGD strategies available for use:
|
|
- ``strategy="simple"``: Gradients are averaged centrally on the driver before being applied to each model replica. This is a reference implementation for debugging purposes.
|
|
- ``strategy="ps"``: Gradients are computed and averaged within each node. Gradients are then averaged across nodes through a number of parameter server actors. To pipeline the computation of gradients and transmission across the network, we use a custom TensorFlow op that can read and write to the Ray object store directly.
|
|
|
|
Note that when ``num_workers=1``, only local allreduce will be used and the choice of distributed strategy is irrelevant.
|
|
|
|
The full documentation for ``DistributedSGD`` is as follows:
|
|
|
|
.. autoclass:: ray.experimental.sgd.DistributedSGD
|
|
|
|
Examples
|
|
--------
|
|
|
|
For examples of end-to-end usage, check out the `ImageNet synthetic data test <https://github.com/ray-project/ray/blob/master/python/ray/experimental/sgd/test_sgd.py>`__ and also the simple `MNIST training example <https://github.com/ray-project/ray/blob/master/python/ray/experimental/sgd/mnist_example.py>`__, which includes examples of how access the model weights and monitor accuracy as training progresses.
|
|
|
|
Performance
|
|
-----------
|
|
|
|
When using the new Ray backend (which will be enabled by default in Ray 0.6+), we `expect <https://github.com/ray-project/ray/pull/3033>`__ performance competitive with other synchronous SGD implementations on 25Gbps Ethernet.
|
|
|
|
.. figure:: sgd.png
|
|
:width: 756px
|
|
|
|
Images per second reached when distributing the training of a ResNet-101 TensorFlow model (from the official TF benchmark). All experiments were run on p3.16xl instances connected by 25Gbps Ethernet, and workers allocated 4 GPUs per node as done in the Horovod benchmark.
|