.. _datasets_faq: === FAQ === These are some Frequently Asked Questions that we've seen pop up for Ray Datasets. .. note:: For a general conceptual overview of Ray Datasets, see our :ref:`Key Concepts docs `. If you still have questions after reading this FAQ, please reach out on `our Discourse `__! .. contents:: :local: :depth: 2 What problems does Ray Datasets solve? ====================================== Ray Datasets aims to solve the problems of slow, resource-inefficient, unscalable data loading and preprocessing pipelines for two core uses cases: 1. **Model training:** resulting in poor training throughput and low GPU utilization as the trainers are bottlenecked on preprocessing and data loading. 2. **Batch inference:** resulting in poor batch inference throughput and low GPU utilization. In order to solve these problems without sacrificing usability, Ray Datasets simplifies parallel and pipelined data processing on Ray, providing a higher-level API while internally handling data batching, task parallelism and pipelining, and memory management. Who is using Ray Datasets? ========================== To give an idea of Datasets use cases, we list a few notable users running Datasets integrations in production below: * Predibase is using Ray Datasets for ML ingest and batch inference in their OSS declarative ML framework, `Ludwig `__, and internally in their `AutoML product `__. * Amazon is using Ray Datasets for large-scale I/O in their scalable data catalog, `DeltaCAT `__. * Shopify is using Ray Datasets for ML ingest and batch inference in their ML platform, `Merlin `__. * Ray Datasets is used as the data processing engine for the `Ray-based Apache Beam runner `__. * Ray Datasets is used as the preprocessing and batch inference engine for :ref:`Ray AIR `. If you're using Ray Datasets, please let us know about your experience on the `Slack `__ or `Discourse `__; we'd love to hear from you! What should I use Ray Datasets for? =================================== Ray Datasets is the standard way to load, process, and exchange data in Ray libraries and applications, with a particular emphasis on ease-of-use, performance, and scalability in both data size and cluster size. Within that, Datasets is designed for two core uses cases: * **ML (training) ingest:** Loading, preprocessing, and ingesting data into one or more (possibly distributed) model trainers. * **Batch inference:** Loading, preprocessing, and performing parallel batch inference on data. We have designed the Datasets APIs, data model, execution model, and integrations with these use cases in mind, and have captured these use cases in large-scale nightly tests to ensure that we're hitting our scalability, performance, and efficiency marks for these use cases. See our :ref:`ML preprocessing feature guide ` for more information on this positioning. What should I not use Ray Datasets for? ======================================= Ray Datasets is not meant to be used for generic ETL pipelines (like Spark) or scalable data science (like Dask, Modin, or Mars). However, each of these frameworks are :ref:`runnable on Ray `, and Datasets integrates tightly with these frameworks, allowing for efficient exchange of distributed data partitions often with zero-copy. Check out the :ref:`dataset creation feature guide ` to learn more about these integrations. Datasets is specifically targeting the ML ingest and batch inference use cases, with focus on data loading and last-mile preprocessing for ML pipelines. For more information on this distinction, what we mean by last-mile preprocessing, and how Ray Datasets fits into a larger ML pipeline picture, please see our :ref:`ML preprocessing feature guide `. For data loading for training, how does Ray Datasets compare to other solutions? ================================================================================ There are several ML framework-specific and general solutions for loading data into model trainers. Below, we summarize some advantages Datasets offers over these more specific ingest frameworks. Torch datasets (and data loaders) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ * **Framework-agnostic:** Datasets is framework-agnostic and portable between different distributed training frameworks, while `Torch datasets `__ are specific to Torch. * **No built-in IO layer:** Torch datasets do not have an I/O layer for common file formats or in-memory exchange with other frameworks; users need to bring in other libraries and roll this integration themselves. * **Generic distributed data processing:** Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. Torch datasets would require such stitching for anything more involved than batch-based preprocessing, and does not natively support shuffling across worker shards. See our `blog post `__ on why this shared infrastructure is important for 3rd generation ML architectures. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of Torch datasets. TensorFlow datasets ~~~~~~~~~~~~~~~~~~~ * **Framework-agnostic:** Datasets is framework-agnostic and portable between different distributed training frameworks, while `TensorFlow datasets `__ is specific to TensorFlow. * **Unified single-node and distributed:** Datasets unifies single and multi-node training under the same abstraction. TensorFlow datasets presents `separate concepts `__ for distributed data loading and prevents code from being seamlessly scaled to larger clusters. * **Lazy execution:** Datasets executed operations eagerly by default, while TensorFlow datasets are lazy by default. The formter provides easier iterative development and debuggability, and when needing the optimizations that become available with lazy execution, Ray Datasets has a lazy execution mode that you can turn on when productionizing your integration. * **Generic distributed data processing:** Datasets is more general: it can handle generic distributed operations, including global per-epoch shuffling, which would otherwise have to be implemented by stitching together two separate systems. TensorFlow datasets would require such stitching for anything more involved than basic preprocessing, and does not natively support full-shuffling across worker shards; only file interleaving is supported. See our `blog post `__ on why this shared infrastructure is important for 3rd generation ML architectures. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets. Petastorm ~~~~~~~~~ * **Supported data types:** `Petastorm `__ only supports Parquet data, while Ray Datasets supports many file formats. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by Petastorm. * **No data processing:** Petastorm does not expose any data processing APIs. NVTabular ~~~~~~~~~ * **Supported data types:** `NVTabular `__ only supports tabular (Parquet, CSV, Avro) data, while Ray Datasets supports many other file formats. * **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between processes, in contrast to the multi-processing-based pipelines used by Petastorm. * **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g. both CPU and GPU transformations), while Ray Datasets supports this. * **ML-specific ops:** NVTabular has a bunch of great ML-specific preprocessing operations; this is currently WIP for Ray Datasets: :ref:`Ray AIR preprocessors `. .. _datasets_streaming_faq: For batch (offline) inference, why should I use Ray Datasets instead of an actor pool? ====================================================================================== Ray Datasets provides its own autoscaling actor pool via the actor compute strategy for :meth:`ds.map_batches() `, allowing you to perform CPU- or GPU-based batch inference on this actor pool. Using this instead of the `Ray actor pool `__ has a few advantages: * Ray Datasets actor pool is autoscaling and supports easy-to-configure task dependency prefetching, pipelining data transfer with compute. * Ray Datasets takes care of orchestrating the tasks, batching the data, and managing the memory. * With :ref:`Ray Datasets pipelining `, Ray Datasets allows you to precisely configure pipelining of preprocessing with batch inference, allowing you to easily tweak parallelism vs. pipelining to maximize your GPU utilization. * Ray Datasets provides a broad and performant I/O layer, which you would otherwise have to roll yourself. How fast is Ray Datasets? ========================= We're still working on open benchmarks, but we've done some benchmarking on synthetic data and have helped several users port from solutions using Petastorm, Torch multi-processing data loader, and TensorFlow datasets that have seen a big training throughput improvement (4-8x) and model accuracy improvement (due to global per-epoch shuffling) using Ray Datasets. Please see our `recent blog post on Ray Datasets `__ for more information on this benchmarking. Does all of my data need to fit into memory? ============================================ No, with Ray's support for :ref:`spilling objects to disk `, you only need to be able to fit your data into memory OR disk. However, keeping your data in distributed memory may speed up your workload, which can be done on arbitrarily large datasets by windowing them, creating :ref:`pipelines `. How much data can Ray Datasets handle? ====================================== Ray Datasets has been tested at multi-petabyte scale for I/O and multi-terabyte scale for shuffling, and we're continuously working on improving this scalability. If you have a very large dataset that you'd like to process and you're running into scalability issues, please reach out to us on our `Discourse `__. How do I get my data into Ray Datasets? ======================================= Ray Datasets supports creating a ``Dataset`` from local and distributed in-memory data via integrations with common data libraries, as well as from local and remote storage systems via our support for many common file formats and storage backends. Check out our :ref:`feature guide for creating datasets ` for details. How do I do streaming/online data loading and processing? ========================================================= Streaming data loading and data processing can be accomplished by using :ref:`DatasetPipelines `. By windowing a dataset, you can stream data transformations across subsets of the data, even windowing down to the reading of each file. See the :ref:`pipelining feature guide ` for more information. When should I use :ref:`pipelining `? =============================================================== Pipelining is useful in a few scenarios: * You have two chained operations using different resources (e.g. CPU and GPU) that you want to saturate; this is the case for both ML ingest (CPU-based preprocessing and GPU-based training) and batch inference (CPU-based preprocessing and GPU-based batch inference). * You want to do streaming data loading and processing in order to keep the size of the working set small; see previous FAQ on :ref:`how to do streaming data loading and processing `. * You want to decrease the time-to-first-batch (latency) for a certain operation at the end of your workload. This is the case for training and inference since this prevents GPUs from being idle (which is costly), and can be advantageous for some other latency-sensitive consumers of datasets. When should I use global per-epoch shuffling? ============================================= Background ~~~~~~~~~~ When training a machine learning model, shuffling your training dataset is important in general in order to ensure that your model isn't overfitting on some unintended pattern in your data, e.g. sorting on the label column, or time-correlated samples. Per-epoch shuffling in particular can improve your model's precision gain per epoch by reducing the likelihood of bad (unrepresentative) batches getting you permanently stuck in local minima: if you get unlucky and your last few batches have noisy labels that pull your learned weights in the wrong direction, shuffling before the next epoch lets you bounce out of such a gradient rut. In the distributed data-parallel training case, the current status quo solution is typically to have a per-shard in-memory shuffle buffer that you fill up and pop random batches from, without mixing data across shards between epochs. Ray Datasets also offers fully global random shuffling via :meth:`ds.random_shuffle() `, and doing so on an epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as ``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for global per-epoch shuffling instead of local shuffle buffer shuffling? How to choose a shuffling policy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Global per-epoch shuffling should only be used if your model is sensitive to the randomness of the training data. There is `theoretical foundation `__ for all gradient-descent-based model trainers benefiting from improved (global) shuffle quality, and we've found that this is particular pronounced for tabular data/models in practice. However, the more global your shuffle is, the expensive the shuffling operation, and this compounds when doing distributed data-parallel training on a multi-node cluster due to data transfer costs, and this cost can be prohibitive when using very large datasets. The best route for determining the best tradeoff between preprocessing time + cost and per-epoch shuffle quality is to measure the precision gain per training step for your particular model under different shuffling policies: * no shuffling, * local (per-shard) limited-memory shuffle buffer, * local (per-shard) shuffling, * windowed (pseudo-global) shuffling, and * fully global shuffling. From the perspective of keeping preprocessing time in check, as long as your data loading + shuffling throughput is higher than your training throughput, your GPU should be saturated, so we like to recommend users with shuffle-sensitive models to push their shuffle quality higher until this threshold is hit. What is Arrow and how does Ray Datasets use it? =============================================== `Apache Arrow `__ is a columnar memory format and a single-node data processing and I/O library that Ray Datasets leverages extensively. You can think of Ray Datasets as orchestrating distributed processing of Arrow data. See our :ref:`key concepts ` for more information on how Ray Datasets uses Arrow. How much performance tuning does Ray Datasets require? ====================================================== Ray Datasets doesn't perform query optimization, so some manual performance tuning may be necessary depending on your use case and data scale. Please see our :ref:`performance tuning guide ` for more information. How can I contribute to Ray Datasets? ===================================== We're always happy to accept external contributions! If you have a question, a feature request, or want to contibute to Ray Datasets or tell us about your use case, please reach out to us on `Discourse `__; if you have a you're confident that you've found a bug, please open an issue on the `Ray GitHub repo `__. Please see our :ref:`contributing guide ` for more information!