ray/doc/source/train/faq.rst

.. _train-faq:

Ray Train FAQ
=============

How fast is Ray Train compared to PyTorch, TensorFlow, etc.?
------------------------------------------------------------

At its core, training speed should be the same - while Ray Train launches distributed training workers via Ray Actors,
communication during training (e.g. gradient synchronization) is handled by the backend training framework itself.

For example, when running Ray Train with the ``TorchTrainer``,
distributed training communication is done with Torch's ``DistributedDataParallel``.

Take a look at the :ref:`Pytorch <pytorch-training-parity>` and :ref:`Tensorflow <tf-training-parity>` benchmarks to check performance parity.

How do I set resources?
-----------------------

By default, each worker will reserve 1 CPU resource, and an additional 1 GPU resource if ``use_gpu=True``.

To override these resource requests or request additional custom resources,
you can initialize the ``Trainer`` with ``resources_per_worker`` specified in ``scaling_config``.

.. note::
   Some GPU utility functions (e.g. :ref:`train-api-torch-get-device`, :ref:`train-api-torch-prepare-model`)
   currently assume each worker is allocated exactly 1 GPU. The partial GPU and multi GPU use-cases
   can still be run with Ray Train today without these functions.
[train] add FAQ (#22757) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function. 2022-04-04 16:14:35 -07:00			`.. _train-faq:`

			`Ray Train FAQ`
			`=============`

			`How fast is Ray Train compared to PyTorch, TensorFlow, etc.?`
			`------------------------------------------------------------`

			`At its core, training speed should be the same - while Ray Train launches distributed training workers via Ray Actors,`
			`communication during training (e.g. gradient synchronization) is handled by the backend training framework itself.`

[AIR] Replace `train.` with `session.` (#26303) This PR replaces legacy API calls to `train.` with AIR `session.` in Train code, examples and docs. Depends on https://github.com/ray-project/ray/pull/25735 2022-07-07 16:29:04 -07:00			For example, when running Ray Train with the ``TorchTrainer``,
[train] add FAQ (#22757) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function. 2022-04-04 16:14:35 -07:00			distributed training communication is done with Torch's ``DistributedDataParallel``.

[air/train/docs] Add trainer user guide and update trainer docs (#27389) This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections. The main file to review is here: doc/source/ray-air/trainer.rst. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com> 2022-08-04 05:59:50 -07:00			Take a look at the :ref:`Pytorch <pytorch-training-parity>` and :ref:`Tensorflow <tf-training-parity>` benchmarks to check performance parity.

[train] add FAQ (#22757) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function. 2022-04-04 16:14:35 -07:00			`How do I set resources?`
			`-----------------------`

			By default, each worker will reserve 1 CPU resource, and an additional 1 GPU resource if ``use_gpu=True``.

			`To override these resource requests or request additional custom resources,`
[AIR] Replace `train.` with `session.` (#26303) This PR replaces legacy API calls to `train.` with AIR `session.` in Train code, examples and docs. Depends on https://github.com/ray-project/ray/pull/25735 2022-07-07 16:29:04 -07:00			you can initialize the ``Trainer`` with ``resources_per_worker`` specified in ``scaling_config``.
[train] add FAQ (#22757) Adding a FAQ page. Currently has some basic questions that have come up in the past. Explaining how to use Matplotlib due to threading in the distributed training function. 2022-04-04 16:14:35 -07:00
			`.. note::`
			Some GPU utility functions (e.g. :ref:`train-api-torch-get-device`, :ref:`train-api-torch-prepare-model`)
			`currently assume each worker is allocated exactly 1 GPU. The partial GPU and multi GPU use-cases`
			`can still be run with Ray Train today without these functions.`