[Datasets] Add FAQ to Datasets docs. (#24932)

This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <ekhliang@gmail.com>
2025-03-06 02:21:39 -05:00 · 2022-05-19 15:44:22 -07:00 · 2022-05-19 15:44:22 -07:00 · 822cbc420b
commit 822cbc420b
parent 44fd7fd1d0
3 changed files with 349 additions and 0 deletions
--- a/doc/source/_toc.yml
+++ b/doc/source/_toc.yml
@ -21,6 +21,7 @@ parts:
                title: Processing the NYC taxi dataset
              - file: data/examples/big_data_ingestion
                title: Large-scale ML Ingest
+          - file: data/faq
          - file: data/package-ref
          - file: data/integrations

--- a/doc/source/data/dataset.rst
+++ b/doc/source/data/dataset.rst
@ -108,6 +108,18 @@ Advanced users can refer directly to the Ray Datasets :ref:`API reference <data_
        :classes: btn-outline-info btn-block
    ---

+    **Ray Datasets FAQ**
+    ^^^
+
+    Find answers to commonly asked questions in our detailed FAQ.
+
+    +++
+    .. link-button:: datasets_faq
+        :type: ref
+        :text: Ray Datasets FAQ
+        :classes: btn-outline-info btn-block
+    ---
+
    API
    ^^^

--- a/doc/source/data/faq.rst
+++ b/doc/source/data/faq.rst
@ -0,0 +1,336 @@
+.. _datasets_faq:
+
+===
+FAQ
+===
+
+These are some Frequently Asked Questions that we've seen pop up for Ray Datasets.
+
+.. note::
+  For a general conceptual overview of Ray Datasets, see our
+  :ref:`Key Concepts docs <data_key_concepts>`.
+
+If you still have questions after reading this FAQ,  please reach out on
+`our Discourse <https://discuss.ray.io/>`__!
+
+.. contents::
+    :local:
+    :depth: 2
+
+
+What problems does Ray Datasets solve?
+======================================
+
+Ray Datasets aims to solve the problems of slow, resource-inefficient, unscalable data
+loading and preprocessing pipelines for two core uses cases:
+
+1. **Model training:** resulting in poor training throughput and low GPU utilization as
+   the trainers are bottlenecked on preprocessing and data loading.
+2. **Batch inference:** resulting in poor batch inference throughput and low GPU
+   utilization.
+
+In order to solve these problems without sacrificing usability, Ray Datasets simplifies
+parallel and pipelined data processing on Ray, providing a higher-level API while
+internally handling data batching, task parallelism and pipelining, and memory
+management.
+
+Who is using Ray Datasets?
+==========================
+
+To give an idea of Datasets use cases, we list a few notable users running Datasets
+integrations in production below:
+
+* Predibase is using Ray Datasets for ML ingest and batch inference in their OSS
+  declarative ML framework, `Ludwig <https://github.com/ludwig-ai/ludwig>`__, and
+  internally in their `AutoML product <https://predibase.com/>`__.
+* Amazon is using Ray Datasets for large-scale I/O in their scalable data catalog,
+  `DeltaCAT <https://github.com/ray-project/deltacat>`__.
+* Shopify is using Ray Datasets for ML ingest and batch inference in their ML platform,
+  `Merlin <https://shopify.engineering/merlin-shopify-machine-learning-platform>`__.
+* Ray Datasets is used as the data processing engine for the 
+  `Ray-based Apache Beam runner <https://github.com/ray-project/ray_beam_runner>`__.
+* Ray Datasets is used as the preprocessing and batch inference engine for
+  :ref:`Ray AIR <air>`.
+
+
+If you're using Ray Datasets, please let us know about your experience on the
+`Slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`__  or
+`Discourse <https://discuss.ray.io/>`__; we'd love to hear from you!
+
+What should I use Ray Datasets for?
+===================================
+
+Ray Datasets is the standard way to load, process, and exchange data in Ray libraries
+and applications, with a particular emphasis on ease-of-use, performance, and
+scalability in both data size and cluster size. Within that, Datasets is designed for
+two core uses cases:
+
+* **ML (training) ingest:** Loading, preprocessing, and ingesting data into one or more
+  (possibly distributed) model trainers.
+* **Batch inference:** Loading, preprocessing, and performing parallel batch
+  inference on data.
+
+We have designed the Datasets APIs, data model, execution model, and
+integrations with these use cases in mind, and have captured these use cases in
+large-scale nightly tests to ensure that we're hitting our scalability, performance,
+and efficiency marks for these use cases.
+
+See our :ref:`ML preprocessing feature guide <datasets-ml-preprocessing>` for more
+information on this positioning.
+
+What should I not use Ray Datasets for?
+=======================================
+
+Ray Datasets is not meant to be used for generic ETL pipelines (like Spark) or
+scalable data science (like Dask, Modin, or Mars). However, each of these frameworks
+are :ref:`runnable on Ray <data_integrations>`, and Datasets integrates tightly with
+these frameworks, allowing for efficient exchange of distributed data partitions often
+with zero-copy. Check out the
+:ref:`dataset creation feature guide <dataset_from_in_memory_data_distributed>` to learn
+more about these integrations.
+
+Datasets is specifically targeting
+the ML ingest and batch inference use cases, with focus on data loading and last-mile
+preprocessing for ML pipelines. For more information on this distinction, what we
+mean by last-mile preprocessing, and how Ray Datasets fits into a larger ML pipeline
+picture, please see our :ref:`ML preprocessing feature guide <datasets-ml-preprocessing>`.
+
+For data loading for training, how does Ray Datasets compare to other solutions?
+================================================================================
+
+There are several ML framework-specific and general solutions for loading data into
+model trainers. Below, we summarize some advantages Datasets offers over these more
+specific ingest frameworks.
+
+Torch datasets (and data loaders)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Framework-agnostic:** Datasets is framework-agnostic and portable between different
+  distributed training frameworks, while
+  `Torch datasets <https://pytorch.org/docs/stable/data.html>`__ are specific to Torch.
+* **No built-in IO layer:** Torch datasets do not have an I/O layer for common file formats or in-memory exchange
+  with other frameworks; users need to bring in other libraries and roll this
+  integration themselves.
+* **Generic distributed data processing:** Datasets is more general: it can handle
+  generic distributed operations, including global per-epoch shuffling,
+  which would otherwise have to be implemented by stitching together two separate
+  systems. Torch datasets would require such stitching for anything more involved
+  than batch-based preprocessing, and does not natively support shuffling across worker
+  shards. See our
+  `blog post <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture>`__
+  on why this shared infrastructure is important for 3rd generation ML architectures.
+* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
+  processes, in contrast to the multi-processing-based pipelines of Torch datasets.
+
+TensorFlow datasets
+~~~~~~~~~~~~~~~~~~~
+
+* **Framework-agnostic:** Datasets is framework-agnostic and portable between different
+  distributed training frameworks, while
+  `TensorFlow datasets <https://www.tensorflow.org/api_docs/python/tf/data/Dataset>`__
+  is specific to TensorFlow.
+* **Unified single-node and distributed:** Datasets unifies single and multi-node training under
+  the same abstraction. TensorFlow datasets presents
+  `separate concepts <https://www.tensorflow.org/api_docs/python/tf/distribute/DistributedDataset>`__
+  for distributed data loading and prevents code from being seamlessly scaled to larger
+  clusters.
+* **Lazy execution:** Datasets executed operations eagerly by default, while TensorFlow
+  datasets are lazy by default. The formter provides easier iterative development and
+  debuggability, and when needing the optimizations that become available with lazy execution,
+  Ray Datasets has a lazy execution mode that you can turn on when productionizing your
+  integration.
+* **Generic distributed data processing:** Datasets is more general: it can handle
+  generic distributed operations, including global per-epoch shuffling,
+  which would otherwise have to be implemented by stitching together two separate
+  systems. TensorFlow datasets would require such stitching for anything more involved
+  than basic preprocessing, and does not natively support full-shuffling across worker
+  shards; only file interleaving is supported. See our
+  `blog post <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture>`__
+  on why this shared infrastructure is important for 3rd generation ML architectures.
+* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
+  processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets.
+
+Petastorm
+~~~~~~~~~
+
+* **Supported data types:** `Petastorm <https://github.com/uber/petastorm>`__ only supports Parquet data, while
+  Ray Datasets supports many file formats.
+* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
+  processes, in contrast to the multi-processing-based pipelines used by Petastorm.
+* **No data processing:** Petastorm does not expose any data processing APIs.
+
+NVTabular
+~~~~~~~~~
+
+* **Supported data types:** `NVTabular <https://github.com/NVIDIA-Merlin/NVTabular>`__ only supports tabular
+  (Parquet, CSV, Avro) data, while Ray Datasets supports many other file formats.
+* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
+  processes, in contrast to the multi-processing-based pipelines used by Petastorm.
+* **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g.
+  both CPU and GPU transformations), while Ray Datasets supports this.
+* **ML-specific ops:** NVTabular has a bunch of great ML-specific preprocessing
+  operations; this is currently WIP for Ray Datasets:
+  :ref:`Ray AIR preprocessors <air-key-concepts>`.
+
+.. _datasets_streaming_faq:
+
+For batch (offline) inference, why should I use Ray Datasets instead of an actor pool?
+======================================================================================
+
+Ray Datasets provides its own autoscaling actor pool via the actor compute strategy for
+:meth:`ds.map_batches() <ray.data.Dataset.map_batches>`, allowing you to perform CPU- or
+GPU-based batch inference on this actor pool. Using this instead of the
+`Ray actor pool <https://github.com/ray-project/ray/blob/b17cbd825fe3fbde4fe9b03c9dd33be2454d4737/python/ray/util/actor_pool.py#L6>`__
+has a few advantages:
+
+* Ray Datasets actor pool is autoscaling and supports easy-to-configure task dependency
+  prefetching, pipelining data transfer with compute.
+* Ray Datasets takes care of orchestrating the tasks, batching the data, and managing
+  the memory.
+* With :ref:`Ray Datasets pipelining <dataset_pipeline_concept>`, Ray Datasets allows you to
+  precisely configure pipelining of preprocessing with batch inference, allowing you to
+  easily tweak parallelism vs. pipelining to maximize your GPU utilization.
+* Ray Datasets provides a broad and performant I/O layer, which you would otherwise have
+  to roll yourself.
+
+How fast is Ray Datasets?
+=========================
+
+We're still working on open benchmarks, but we've done some benchmarking on synthetic
+data and have helped several users port from solutions using Petastorm, Torch
+multi-processing data loader, and TensorFlow datasets that have seen a big training
+throughput improvement (4-8x) and model accuracy improvement (due to global per-epoch
+shuffling) using Ray Datasets.
+
+Please see our
+`recent blog post on Ray Datasets <https://www.anyscale.com/blog/ray-datasets-for-machine-learning-training-and-scoring>`__
+for more information on this benchmarking.
+
+Does all of my data need to fit into memory?
+============================================
+
+No, with Ray's support for :ref:`spilling objects to disk <object-spilling>`, you only
+need to be able to fit your data into memory OR disk. However, keeping your data in
+distributed memory may speed up your workload, which can be done on arbitrarily large
+datasets by windowing them, creating :ref:`pipelines <dataset_pipeline_concept>`.
+
+How much data can Ray Datasets handle?
+======================================
+
+Ray Datasets has been tested at multi-petabyte scale for I/O and multi-terabyte scale for
+shuffling, and we're continuously working on improving this scalability. If you have a
+very large dataset that you'd like to process and you're running into scalability
+issues, please reach out to us on our `Discourse <https://discuss.ray.io/>`__.
+
+How do I get my data into Ray Datasets?
+=======================================
+
+Ray Datasets supports creating a ``Dataset`` from local and distributed in-memory data
+via integrations with common data libraries, as well as from local and remote storage
+systems via our support for many common file formats and storage backends.
+
+Check out our :ref:`feature guide for creating datasets <creating_datasets>` for
+details.
+
+How do I do streaming/online data loading and processing?
+=========================================================
+
+Streaming data loading and data processing can be accomplished by using
+:ref:`DatasetPipelines <dataset_pipeline_concept>`. By windowing a dataset, you can
+stream data transformations across subsets of the data, even windowing down to the
+reading of each file.
+
+See the :ref:`pipelining feature guide <data_pipeline_usage>` for more information.
+
+When should I use :ref:`pipelining <dataset_pipeline_concept>`?
+===============================================================
+
+Pipelining is useful in a few scenarios:
+
+* You have two chained operations using different resources (e.g. CPU and GPU) that you
+  want to saturate; this is the case for both ML ingest (CPU-based preprocessing and
+  GPU-based training) and batch inference (CPU-based preprocessing and GPU-based batch
+  inference).
+* You want to do streaming data loading and processing in order to keep the size of the
+  working set small; see previous FAQ on
+  :ref:`how to do streaming data loading and processing <datasets_streaming_faq>`.
+* You want to decrease the time-to-first-batch (latency) for a certain operation at the
+  end of your workload. This is the case for training and inference since this prevents
+  GPUs from being idle (which is costly), and can be advantageous for some other
+  latency-sensitive consumers of datasets.
+
+When should I use global per-epoch shuffling?
+=============================================
+
+Background
+~~~~~~~~~~
+
+When training a machine learning model, shuffling your training dataset is important in
+general in order to ensure that your model isn't overfitting on some unintended pattern
+in your data, e.g. sorting on the label column, or time-correlated samples. Per-epoch
+shuffling in particular can improve your model's precision gain per epoch by reducing
+the likelihood of bad (unrepresentative) batches getting you permanently stuck in local
+minima: if you get unlucky and your last few batches have noisy labels that pull your
+learned weights in the wrong direction, shuffling before the next epoch lets you bounce
+out of such a gradient rut. In the distributed data-parallel training case, the current
+status quo solution is typically to have a per-shard in-memory shuffle buffer that you
+fill up and pop random batches from, without mixing data across shards between epochs.
+Ray Datasets also offers fully global random shuffling via
+:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()`, and doing so on an
+epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
+``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
+global per-epoch shuffling instead of local shuffle buffer shuffling?
+
+How to choose a shuffling policy
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Global per-epoch shuffling should only be used if your model is sensitive to the
+randomness of the training data. There is
+`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ for all
+gradient-descent-based model trainers benefiting from improved (global) shuffle quality,
+and we've found that this is particular pronounced for tabular data/models in practice.
+However, the more global your shuffle is, the expensive the shuffling operation, and
+this compounds when doing distributed data-parallel training on a multi-node cluster due
+to data transfer costs, and this cost can be prohibitive when using very large datasets.
+
+The best route for determining the best tradeoff between preprocessing time + cost and
+per-epoch shuffle quality is to measure the precision gain per training step for your
+particular model under different shuffling policies:
+
+* no shuffling,
+* local (per-shard) limited-memory shuffle buffer,
+* local (per-shard) shuffling,
+* windowed (psuedo-global) shuffling, and
+* fully global shuffling.
+
+From the perspective of keeping preprocessing time in check, as long as your data
+loading + shuffling throughput is higher than your training throughput, your GPU should
+be saturated, so we like to recommend users with shuffle-sensitive models to push their
+shuffle quality higher until this threshold is hit.
+
+What is Arrow and how does Ray Datasets use it?
+===============================================
+
+`Apache Arrow <https://arrow.apache.org/>`__ is a columnar memory format and a
+single-node data processing and I/O library that Ray Datasets leverages extensively. You
+can think of Ray Datasets as orchestrating distributed processing of Arrow data.
+
+See our :ref:`key concepts <data_key_concepts>` for more information on how Ray Datasets
+uses Arrow.
+
+How much performance tuning does Ray Datasets require?
+======================================================
+
+Ray Datasets doesn't perform query optimization, so some manual performance
+tuning may be necessary depending on your use case and data scale. Please see our
+:ref:`performance tuning guide <data_performance_tips>` for more information.
+
+How can I contribute to Ray Datasets?
+=====================================
+
+We're always happy to accept external contributions! If you have a question, a feature
+request, or want to contibute to Ray Datasets or tell us about your use case, please
+reach out to us on `Discourse <https://discuss.ray.io/>`__; if you have a you're
+confident that you've found a bug, please open an issue on the
+`Ray GitHub repo <https://github.com/ray-project/ray>`__. Please see our
+:ref:`contributing guide <getting-involved>` for more information!