diff --git a/doc/source/_toc.yml b/doc/source/_toc.yml index dfe16208f..bbe9939ce 100644 --- a/doc/source/_toc.yml +++ b/doc/source/_toc.yml @@ -21,6 +21,7 @@ parts: title: Processing the NYC taxi dataset - file: data/examples/big_data_ingestion title: Large-scale ML Ingest + - file: data/faq - file: data/package-ref - file: data/integrations diff --git a/doc/source/data/dataset.rst b/doc/source/data/dataset.rst index 2ccf5ed96..bf3bb9caf 100644 --- a/doc/source/data/dataset.rst +++ b/doc/source/data/dataset.rst @@ -108,6 +108,18 @@ Advanced users can refer directly to the Ray Datasets :ref:`API reference `. + +If you still have questions after reading this FAQ, please reach out on +`our Discourse `__! + +.. contents:: + :local: + :depth: 2 + + +What problems does Ray Datasets solve? +====================================== + +Ray Datasets aims to solve the problems of slow, resource-inefficient, unscalable data +loading and preprocessing pipelines for two core uses cases: + +1. **Model training:** resulting in poor training throughput and low GPU utilization as + the trainers are bottlenecked on preprocessing and data loading. +2. **Batch inference:** resulting in poor batch inference throughput and low GPU + utilization. + +In order to solve these problems without sacrificing usability, Ray Datasets simplifies +parallel and pipelined data processing on Ray, providing a higher-level API while +internally handling data batching, task parallelism and pipelining, and memory +management. + +Who is using Ray Datasets? +========================== + +To give an idea of Datasets use cases, we list a few notable users running Datasets +integrations in production below: + +* Predibase is using Ray Datasets for ML ingest and batch inference in their OSS + declarative ML framework, `Ludwig `__, and + internally in their `AutoML product `__. +* Amazon is using Ray Datasets for large-scale I/O in their scalable data catalog, + `DeltaCAT `__. +* Shopify is using Ray Datasets for ML ingest and batch inference in their ML platform, + `Merlin `__. +* Ray Datasets is used as the data processing engine for the + `Ray-based Apache Beam runner `__. +* Ray Datasets is used as the preprocessing and batch inference engine for + :ref:`Ray AIR `. + + +If you're using Ray Datasets, please let us know about your experience on the +`Slack `__ or +`Discourse `__; we'd love to hear from you! + +What should I use Ray Datasets for? +=================================== + +Ray Datasets is the standard way to load, process, and exchange data in Ray libraries +and applications, with a particular emphasis on ease-of-use, performance, and +scalability in both data size and cluster size. Within that, Datasets is designed for +two core uses cases: + +* **ML (training) ingest:** Loading, preprocessing, and ingesting data into one or more + (possibly distributed) model trainers. +* **Batch inference:** Loading, preprocessing, and performing parallel batch + inference on data. + +We have designed the Datasets APIs, data model, execution model, and +integrations with these use cases in mind, and have captured these use cases in +large-scale nightly tests to ensure that we're hitting our scalability, performance, +and efficiency marks for these use cases. + +See our :ref:`ML preprocessing feature guide ` for more +information on this positioning. + +What should I not use Ray Datasets for? +======================================= + +Ray Datasets is not meant to be used for generic ETL pipelines (like Spark) or +scalable data science (like Dask, Modin, or Mars). However, each of these frameworks +are :ref:`runnable on Ray `, and Datasets integrates tightly with +these frameworks, allowing for efficient exchange of distributed data partitions often +with zero-copy. Check out the +:ref:`dataset creation feature guide ` to learn +more about these integrations. + +Datasets is specifically targeting +the ML ingest and batch inference use cases, with focus on data loading and last-mile +preprocessing for ML pipelines. For more information on this distinction, what we +mean by last-mile preprocessing, and how Ray Datasets fits into a larger ML pipeline +picture, please see our :ref:`ML preprocessing feature guide `. + +For data loading for training, how does Ray Datasets compare to other solutions? +================================================================================ + +There are several ML framework-specific and general solutions for loading data into +model trainers. Below, we summarize some advantages Datasets offers over these more +specific ingest frameworks. + +Torch datasets (and data loaders) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +* **Framework-agnostic:** Datasets is framework-agnostic and portable between different + distributed training frameworks, while + `Torch datasets `__ are specific to Torch. +* **No built-in IO layer:** Torch datasets do not have an I/O layer for common file formats or in-memory exchange + with other frameworks; users need to bring in other libraries and roll this + integration themselves. +* **Generic distributed data processing:** Datasets is more general: it can handle + generic distributed operations, including global per-epoch shuffling, + which would otherwise have to be implemented by stitching together two separate + systems. Torch datasets would require such stitching for anything more involved + than batch-based preprocessing, and does not natively support shuffling across worker + shards. See our + `blog post `__ + on why this shared infrastructure is important for 3rd generation ML architectures. +* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between + processes, in contrast to the multi-processing-based pipelines of Torch datasets. + +TensorFlow datasets +~~~~~~~~~~~~~~~~~~~ + +* **Framework-agnostic:** Datasets is framework-agnostic and portable between different + distributed training frameworks, while + `TensorFlow datasets `__ + is specific to TensorFlow. +* **Unified single-node and distributed:** Datasets unifies single and multi-node training under + the same abstraction. TensorFlow datasets presents + `separate concepts `__ + for distributed data loading and prevents code from being seamlessly scaled to larger + clusters. +* **Lazy execution:** Datasets executed operations eagerly by default, while TensorFlow + datasets are lazy by default. The formter provides easier iterative development and + debuggability, and when needing the optimizations that become available with lazy execution, + Ray Datasets has a lazy execution mode that you can turn on when productionizing your + integration. +* **Generic distributed data processing:** Datasets is more general: it can handle + generic distributed operations, including global per-epoch shuffling, + which would otherwise have to be implemented by stitching together two separate + systems. TensorFlow datasets would require such stitching for anything more involved + than basic preprocessing, and does not natively support full-shuffling across worker + shards; only file interleaving is supported. See our + `blog post `__ + on why this shared infrastructure is important for 3rd generation ML architectures. +* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between + processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets. + +Petastorm +~~~~~~~~~ + +* **Supported data types:** `Petastorm `__ only supports Parquet data, while + Ray Datasets supports many file formats. +* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between + processes, in contrast to the multi-processing-based pipelines used by Petastorm. +* **No data processing:** Petastorm does not expose any data processing APIs. + +NVTabular +~~~~~~~~~ + +* **Supported data types:** `NVTabular `__ only supports tabular + (Parquet, CSV, Avro) data, while Ray Datasets supports many other file formats. +* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between + processes, in contrast to the multi-processing-based pipelines used by Petastorm. +* **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g. + both CPU and GPU transformations), while Ray Datasets supports this. +* **ML-specific ops:** NVTabular has a bunch of great ML-specific preprocessing + operations; this is currently WIP for Ray Datasets: + :ref:`Ray AIR preprocessors `. + +.. _datasets_streaming_faq: + +For batch (offline) inference, why should I use Ray Datasets instead of an actor pool? +====================================================================================== + +Ray Datasets provides its own autoscaling actor pool via the actor compute strategy for +:meth:`ds.map_batches() `, allowing you to perform CPU- or +GPU-based batch inference on this actor pool. Using this instead of the +`Ray actor pool `__ +has a few advantages: + +* Ray Datasets actor pool is autoscaling and supports easy-to-configure task dependency + prefetching, pipelining data transfer with compute. +* Ray Datasets takes care of orchestrating the tasks, batching the data, and managing + the memory. +* With :ref:`Ray Datasets pipelining `, Ray Datasets allows you to + precisely configure pipelining of preprocessing with batch inference, allowing you to + easily tweak parallelism vs. pipelining to maximize your GPU utilization. +* Ray Datasets provides a broad and performant I/O layer, which you would otherwise have + to roll yourself. + +How fast is Ray Datasets? +========================= + +We're still working on open benchmarks, but we've done some benchmarking on synthetic +data and have helped several users port from solutions using Petastorm, Torch +multi-processing data loader, and TensorFlow datasets that have seen a big training +throughput improvement (4-8x) and model accuracy improvement (due to global per-epoch +shuffling) using Ray Datasets. + +Please see our +`recent blog post on Ray Datasets `__ +for more information on this benchmarking. + +Does all of my data need to fit into memory? +============================================ + +No, with Ray's support for :ref:`spilling objects to disk `, you only +need to be able to fit your data into memory OR disk. However, keeping your data in +distributed memory may speed up your workload, which can be done on arbitrarily large +datasets by windowing them, creating :ref:`pipelines `. + +How much data can Ray Datasets handle? +====================================== + +Ray Datasets has been tested at multi-petabyte scale for I/O and multi-terabyte scale for +shuffling, and we're continuously working on improving this scalability. If you have a +very large dataset that you'd like to process and you're running into scalability +issues, please reach out to us on our `Discourse `__. + +How do I get my data into Ray Datasets? +======================================= + +Ray Datasets supports creating a ``Dataset`` from local and distributed in-memory data +via integrations with common data libraries, as well as from local and remote storage +systems via our support for many common file formats and storage backends. + +Check out our :ref:`feature guide for creating datasets ` for +details. + +How do I do streaming/online data loading and processing? +========================================================= + +Streaming data loading and data processing can be accomplished by using +:ref:`DatasetPipelines `. By windowing a dataset, you can +stream data transformations across subsets of the data, even windowing down to the +reading of each file. + +See the :ref:`pipelining feature guide ` for more information. + +When should I use :ref:`pipelining `? +=============================================================== + +Pipelining is useful in a few scenarios: + +* You have two chained operations using different resources (e.g. CPU and GPU) that you + want to saturate; this is the case for both ML ingest (CPU-based preprocessing and + GPU-based training) and batch inference (CPU-based preprocessing and GPU-based batch + inference). +* You want to do streaming data loading and processing in order to keep the size of the + working set small; see previous FAQ on + :ref:`how to do streaming data loading and processing `. +* You want to decrease the time-to-first-batch (latency) for a certain operation at the + end of your workload. This is the case for training and inference since this prevents + GPUs from being idle (which is costly), and can be advantageous for some other + latency-sensitive consumers of datasets. + +When should I use global per-epoch shuffling? +============================================= + +Background +~~~~~~~~~~ + +When training a machine learning model, shuffling your training dataset is important in +general in order to ensure that your model isn't overfitting on some unintended pattern +in your data, e.g. sorting on the label column, or time-correlated samples. Per-epoch +shuffling in particular can improve your model's precision gain per epoch by reducing +the likelihood of bad (unrepresentative) batches getting you permanently stuck in local +minima: if you get unlucky and your last few batches have noisy labels that pull your +learned weights in the wrong direction, shuffling before the next epoch lets you bounce +out of such a gradient rut. In the distributed data-parallel training case, the current +status quo solution is typically to have a per-shard in-memory shuffle buffer that you +fill up and pop random batches from, without mixing data across shards between epochs. +Ray Datasets also offers fully global random shuffling via +:meth:`ds.random_shuffle() `__ for all +gradient-descent-based model trainers benefiting from improved (global) shuffle quality, +and we've found that this is particular pronounced for tabular data/models in practice. +However, the more global your shuffle is, the expensive the shuffling operation, and +this compounds when doing distributed data-parallel training on a multi-node cluster due +to data transfer costs, and this cost can be prohibitive when using very large datasets. + +The best route for determining the best tradeoff between preprocessing time + cost and +per-epoch shuffle quality is to measure the precision gain per training step for your +particular model under different shuffling policies: + +* no shuffling, +* local (per-shard) limited-memory shuffle buffer, +* local (per-shard) shuffling, +* windowed (psuedo-global) shuffling, and +* fully global shuffling. + +From the perspective of keeping preprocessing time in check, as long as your data +loading + shuffling throughput is higher than your training throughput, your GPU should +be saturated, so we like to recommend users with shuffle-sensitive models to push their +shuffle quality higher until this threshold is hit. + +What is Arrow and how does Ray Datasets use it? +=============================================== + +`Apache Arrow `__ is a columnar memory format and a +single-node data processing and I/O library that Ray Datasets leverages extensively. You +can think of Ray Datasets as orchestrating distributed processing of Arrow data. + +See our :ref:`key concepts ` for more information on how Ray Datasets +uses Arrow. + +How much performance tuning does Ray Datasets require? +====================================================== + +Ray Datasets doesn't perform query optimization, so some manual performance +tuning may be necessary depending on your use case and data scale. Please see our +:ref:`performance tuning guide ` for more information. + +How can I contribute to Ray Datasets? +===================================== + +We're always happy to accept external contributions! If you have a question, a feature +request, or want to contibute to Ray Datasets or tell us about your use case, please +reach out to us on `Discourse `__; if you have a you're +confident that you've found a bug, please open an issue on the +`Ray GitHub repo `__. Please see our +:ref:`contributing guide ` for more information!