2022-04-04 16:14:35 -07:00
.. _train-faq:
Ray Train FAQ
=============
How fast is Ray Train compared to PyTorch, TensorFlow, etc.?
------------------------------------------------------------
At its core, training speed should be the same - while Ray Train launches distributed training workers via Ray Actors,
communication during training (e.g. gradient synchronization) is handled by the backend training framework itself.
2022-07-07 16:29:04 -07:00
For example, when running Ray Train with the `` TorchTrainer `` ,
2022-04-04 16:14:35 -07:00
distributed training communication is done with Torch's `` DistributedDataParallel `` .
How do I set resources?
-----------------------
By default, each worker will reserve 1 CPU resource, and an additional 1 GPU resource if `` use_gpu=True `` .
To override these resource requests or request additional custom resources,
2022-07-07 16:29:04 -07:00
you can initialize the `` Trainer `` with `` resources_per_worker `` specified in `` scaling_config `` .
2022-04-04 16:14:35 -07:00
.. note ::
Some GPU utility functions (e.g. :ref: `train-api-torch-get-device` , :ref: `train-api-torch-prepare-model` )
currently assume each worker is allocated exactly 1 GPU. The partial GPU and multi GPU use-cases
can still be run with Ray Train today without these functions.
How can I use Matplotlib with Ray Train?
-----------------------------------------
If you try to create a Matplotlib plot in the training function, you may encounter an error:
.. code-block ::
UserWarning: Starting a Matplotlib GUI outside of the main thread will likely fail.
To handle this, consider the following approaches:
2022-07-07 16:29:04 -07:00
1. If there is no dependency on any code in your training function, simply move the Matplotlib logic out and execute it before or after `` trainer.fit() `` .
2. If you are plotting metrics, you can pass the metrics via `` session.report() `` and create a :ref: `custom callback <train-custom-callbacks>` to plot the results.