AIR builds its training data pipeline on :ref:`Ray Datasets <datasets>`, which is a scalable, framework-agnostic data loading and preprocessing library. Datasets enables AIR to seamlessly load data for local and distributed training with Train.
**Training**: Then, AIR passes the preprocessed dataset to Train workers (Ray actors) launched by the Trainer. Each worker calls ``get_dataset_shard`` to get a handle to its assigned data shard, and then calls one of ``iter_batches``, ``iter_torch_batches``, or ``iter_tf_batches`` to loop over the data.
The following is a simple example of how to configure ingest for a dummy ``TorchTrainer``. Below, we are passing a small tensor dataset to the Trainer via the ``datasets`` argument. In the Trainer's ``train_loop_per_worker``, we access the preprocessed dataset using ``get_dataset_shard()``.
Shuffling or data randomization is important for training high-quality models. By default, AIR will randomize the order the data files (blocks) are read from. AIR also offers options for further randomizing data records within each file:
You can use the ``DatasetConfig`` object to configure how Datasets are preprocessed and split across training workers. Each ``DataParallelTrainer`` has a default ``_dataset_config`` class field. It is a mapping
from dataset names to ``DatasetConfig`` objects, and implements the default behavior described in the :ref:`overview <ingest_basics>`:
..tabbed:: Example: Disable Transform on Aux Dataset
This example shows overriding the transform config for the "side" dataset. This means that
the original dataset will be returned by ``.get_dataset_shard("side")``.
..literalinclude:: doc_code/air_ingest.py
:language:python
:start-after:__config_2__
:end-before:__config_2_end__
Dataset Resources
~~~~~~~~~~~~~~~~~
Datasets uses Ray tasks to execute data processing operations. These tasks use CPU resources in the cluster during execution, which may compete with resources needed for Training.
..tabbed:: Unreserved CPUs
By default, Dataset tasks use cluster CPU resources for execution. This can sometimes
conflict with Trainer resource requests. For example, if Trainers allocate all CPU resources
in the cluster, then no Datasets tasks can run.
..literalinclude:: ./doc_code/air_ingest.py
:language:python
:start-after:__resource_allocation_1_begin__
:end-before:__resource_allocation_1_end__
Unreserved CPUs work well when:
* you are running only one Trainer and the cluster has enough CPUs; or
* your Trainers are configured to use GPUs and not CPUs
..tabbed:: Using Reserved CPUs (experimental)
The ``_max_cpu_fraction_per_node`` option can be used to exclude CPUs from placement
group scheduling. In the below example, setting this parameter to ``0.8`` enables Tune
trials to run smoothly without risk of deadlock by reserving 20% of node CPUs for
Dataset execution.
..literalinclude:: ./doc_code/air_ingest.py
:language:python
:start-after:__resource_allocation_2_begin__
:end-before:__resource_allocation_2_end__
You should use reserved CPUs when:
* you are running multiple concurrent CPU Trainers using Tune; or
* you want to ensure predictable Datasets performance
..warning::
``_max_cpu_fraction_per_node`` is experimental and not currently recommended for use with
autoscaling clusters (scale-up will not trigger properly).
So why was the data ingest only 116MiB/s above? That's sufficient for many models, but one would expect
faster if the trainer was doing nothing except read the data. Based on the stats above, there was no object
spilling, but there was a high batch delay.
We can guess that perhaps AIR was spending too much time loading blocks from other machines, since
we were using a multi-node cluster. We can test this by setting ``prefetch_blocks=10`` to prefetch
blocks more aggressively and rerunning training.
..code::
P50/P95/Max batch delay (s) 0.0006792084998323844 0.0009853049503362856 0.12657493300002898
Num epochs read 47
Num batches read 4700
Num bytes read 458984.95 MiB
Mean throughput 15136.18 MiB/s
That's much better! Now we can see that our DummyTrainer is ingesting data at a rate of 15000MiB/s,
and was able to read through many more epochs of training. This high throughput means
that all data was able to be fit into memory on a single node.
Going from DummyTrainer to your real Trainer
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Once you're happy with the ingest performance of with DummyTrainer with synthetic data, the next steps are to switch to adapting it for your real workload scenario. This involves:
***Scaling the DummyTrainer**: Change the scaling config of the DummyTrainer and cluster configuration to reflect your target workload.
***Switching the Dataset**: Change the dataset from synthetic tensor data to reading your real dataset.
***Switching the Trainer**: Swap the DummyTrainer with your real trainer.
Switching these components one by one allows performance problems to be easily isolated and reproduced.
Performance Tips
----------------
**Memory availability**: To maximize ingest performance, consider using machines with sufficient memory to fit the dataset entirely in memory. This avoids the need for disk spilling, streamed ingest, or fetching data across the network. As a rule of thumb, a Ray cluster with fewer but bigger nodes will outperform a Ray cluster with more smaller nodes due to better memory locality.
**Autoscaling**: We generally recommend first trying out AIR training with a fixed size cluster. This makes it easier to understand and debug issues. Autoscaling can be enabled after you are happy with performance to autoscale experiment sweeps with Tune, etc. We also recommend starting with autoscaling with a single node type. Autoscaling with hetereogeneous clusters can optimize costs, but may complicate performance debugging.
**Partitioning**: By default, Datasets will automatically select the read parallelism based on the current cluster size and number of files. If you run into out-of-memory errors during preprocessing, consider increasing the number of blocks to reduce their size. To increase the max number of partitions, you can manually set the ``parallelism`` option when calling ``ray.data.read_*()``. To change the number of partitions at runtime, use ``ds.repartition(N)``. As a rule of thumb, blocks should be no more than 1-2GiB each.
When you pass Datasets to a Tuner, Datasets are executed independently per-trial. This could potentially duplicate data reads in the cluster. To share Dataset blocks between trials, call ``ds = ds.fully_executed()`` prior to passing the Dataset to the Tuner. This ensures that the initial read operation will not be repeated per trial.