ray/doc/source/raysgd/raysgd_tensorflow.rst
Richard Liaw 94e2fcea2e
[sgd] fp16 (apex) and scheduler support + move examples page (#7061)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* fix tests'

* testmode

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-16 19:04:08 -08:00

74 lines
2.4 KiB
ReStructuredText

RaySGD TensorFlow
=================
RaySGD's ``TFTrainer`` simplifies distributed model training for Tensorflow. The ``TFTrainer`` is a wrapper around ``MultiWorkerMirroredStrategy`` with a Python API to easily incorporate distributed training into a larger Python application, as opposed to write custom logic of setting environments and starting separate processes.
.. important:: This API has only been tested with TensorFlow2.0rc and is still highly experimental. Please file bug reports if you run into any - thanks!
.. tip:: We need your feedback! RaySGD is currently early in its development, and we're hoping to get feedback from people using or considering it. We'd love `to get in touch <https://forms.gle/26EMwdahdgm7Lscy9>`_!
----------
**With Ray**:
Wrap your training with this:
.. code-block:: python
ray.init(args.address)
trainer = TFTrainer(
model_creator=model_creator,
data_creator=data_creator,
num_replicas=4,
use_gpu=True,
verbose=True,
config={
"fit_config": {
"steps_per_epoch": num_train_steps,
},
"evaluate_config": {
"steps": num_eval_steps,
}
})
Then, start a Ray cluster `via autoscaler <autoscaling.html>`_ or `manually <using-ray-on-a-cluster.html>`_.
.. code-block:: bash
ray up CLUSTER.yaml
python train.py --address="localhost:<PORT>"
----------
**Before, with Tensorflow**:
In your training program, insert the following, and **customize** for each worker:
.. code-block:: python
os.environ['TF_CONFIG'] = json.dumps({
'cluster': {
'worker': ["localhost:12345", "localhost:23456"]
},
'task': {'type': 'worker', 'index': 0}
})
...
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()
with strategy.scope():
multi_worker_model = model_creator()
And on each machine, launch a separate process that contains the index of the worker and information about all other nodes of the cluster.
TFTrainer Example
-----------------
Below is an example of using Ray's TFTrainer. Under the hood, ``TFTrainer`` will create *replicas* of your model (controlled by ``num_replicas``) which are each managed by a worker.
.. literalinclude:: ../../../python/ray/experimental/sgd/tf/examples/tensorflow_train_example.py
:language: python