mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

There are small typos in: - doc/source/data/faq.rst - python/ray/serve/replica.py Fixes: - Should read `successfully` rather than `succssifully`. - Should read `pseudo` rather than `psuedo`.
336 lines
17 KiB
ReStructuredText
336 lines
17 KiB
ReStructuredText
.. _datasets_faq:
|
|
|
|
===
|
|
FAQ
|
|
===
|
|
|
|
These are some Frequently Asked Questions that we've seen pop up for Ray Datasets.
|
|
|
|
.. note::
|
|
For a general conceptual overview of Ray Datasets, see our
|
|
:ref:`Key Concepts docs <data_key_concepts>`.
|
|
|
|
If you still have questions after reading this FAQ, please reach out on
|
|
`our Discourse <https://discuss.ray.io/>`__!
|
|
|
|
.. contents::
|
|
:local:
|
|
:depth: 2
|
|
|
|
|
|
What problems does Ray Datasets solve?
|
|
======================================
|
|
|
|
Ray Datasets aims to solve the problems of slow, resource-inefficient, unscalable data
|
|
loading and preprocessing pipelines for two core uses cases:
|
|
|
|
1. **Model training:** resulting in poor training throughput and low GPU utilization as
|
|
the trainers are bottlenecked on preprocessing and data loading.
|
|
2. **Batch inference:** resulting in poor batch inference throughput and low GPU
|
|
utilization.
|
|
|
|
In order to solve these problems without sacrificing usability, Ray Datasets simplifies
|
|
parallel and pipelined data processing on Ray, providing a higher-level API while
|
|
internally handling data batching, task parallelism and pipelining, and memory
|
|
management.
|
|
|
|
Who is using Ray Datasets?
|
|
==========================
|
|
|
|
To give an idea of Datasets use cases, we list a few notable users running Datasets
|
|
integrations in production below:
|
|
|
|
* Predibase is using Ray Datasets for ML ingest and batch inference in their OSS
|
|
declarative ML framework, `Ludwig <https://github.com/ludwig-ai/ludwig>`__, and
|
|
internally in their `AutoML product <https://predibase.com/>`__.
|
|
* Amazon is using Ray Datasets for large-scale I/O in their scalable data catalog,
|
|
`DeltaCAT <https://github.com/ray-project/deltacat>`__.
|
|
* Shopify is using Ray Datasets for ML ingest and batch inference in their ML platform,
|
|
`Merlin <https://shopify.engineering/merlin-shopify-machine-learning-platform>`__.
|
|
* Ray Datasets is used as the data processing engine for the
|
|
`Ray-based Apache Beam runner <https://github.com/ray-project/ray_beam_runner>`__.
|
|
* Ray Datasets is used as the preprocessing and batch inference engine for
|
|
:ref:`Ray AIR <air>`.
|
|
|
|
|
|
If you're using Ray Datasets, please let us know about your experience on the
|
|
`Slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`__ or
|
|
`Discourse <https://discuss.ray.io/>`__; we'd love to hear from you!
|
|
|
|
What should I use Ray Datasets for?
|
|
===================================
|
|
|
|
Ray Datasets is the standard way to load, process, and exchange data in Ray libraries
|
|
and applications, with a particular emphasis on ease-of-use, performance, and
|
|
scalability in both data size and cluster size. Within that, Datasets is designed for
|
|
two core uses cases:
|
|
|
|
* **ML (training) ingest:** Loading, preprocessing, and ingesting data into one or more
|
|
(possibly distributed) model trainers.
|
|
* **Batch inference:** Loading, preprocessing, and performing parallel batch
|
|
inference on data.
|
|
|
|
We have designed the Datasets APIs, data model, execution model, and
|
|
integrations with these use cases in mind, and have captured these use cases in
|
|
large-scale nightly tests to ensure that we're hitting our scalability, performance,
|
|
and efficiency marks for these use cases.
|
|
|
|
See our :ref:`ML preprocessing feature guide <datasets-ml-preprocessing>` for more
|
|
information on this positioning.
|
|
|
|
What should I not use Ray Datasets for?
|
|
=======================================
|
|
|
|
Ray Datasets is not meant to be used for generic ETL pipelines (like Spark) or
|
|
scalable data science (like Dask, Modin, or Mars). However, each of these frameworks
|
|
are :ref:`runnable on Ray <data_integrations>`, and Datasets integrates tightly with
|
|
these frameworks, allowing for efficient exchange of distributed data partitions often
|
|
with zero-copy. Check out the
|
|
:ref:`dataset creation feature guide <dataset_from_in_memory_data_distributed>` to learn
|
|
more about these integrations.
|
|
|
|
Datasets is specifically targeting
|
|
the ML ingest and batch inference use cases, with focus on data loading and last-mile
|
|
preprocessing for ML pipelines. For more information on this distinction, what we
|
|
mean by last-mile preprocessing, and how Ray Datasets fits into a larger ML pipeline
|
|
picture, please see our :ref:`ML preprocessing feature guide <datasets-ml-preprocessing>`.
|
|
|
|
For data loading for training, how does Ray Datasets compare to other solutions?
|
|
================================================================================
|
|
|
|
There are several ML framework-specific and general solutions for loading data into
|
|
model trainers. Below, we summarize some advantages Datasets offers over these more
|
|
specific ingest frameworks.
|
|
|
|
Torch datasets (and data loaders)
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
* **Framework-agnostic:** Datasets is framework-agnostic and portable between different
|
|
distributed training frameworks, while
|
|
`Torch datasets <https://pytorch.org/docs/stable/data.html>`__ are specific to Torch.
|
|
* **No built-in IO layer:** Torch datasets do not have an I/O layer for common file formats or in-memory exchange
|
|
with other frameworks; users need to bring in other libraries and roll this
|
|
integration themselves.
|
|
* **Generic distributed data processing:** Datasets is more general: it can handle
|
|
generic distributed operations, including global per-epoch shuffling,
|
|
which would otherwise have to be implemented by stitching together two separate
|
|
systems. Torch datasets would require such stitching for anything more involved
|
|
than batch-based preprocessing, and does not natively support shuffling across worker
|
|
shards. See our
|
|
`blog post <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture>`__
|
|
on why this shared infrastructure is important for 3rd generation ML architectures.
|
|
* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
|
|
processes, in contrast to the multi-processing-based pipelines of Torch datasets.
|
|
|
|
TensorFlow datasets
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
* **Framework-agnostic:** Datasets is framework-agnostic and portable between different
|
|
distributed training frameworks, while
|
|
`TensorFlow datasets <https://www.tensorflow.org/api_docs/python/tf/data/Dataset>`__
|
|
is specific to TensorFlow.
|
|
* **Unified single-node and distributed:** Datasets unifies single and multi-node training under
|
|
the same abstraction. TensorFlow datasets presents
|
|
`separate concepts <https://www.tensorflow.org/api_docs/python/tf/distribute/DistributedDataset>`__
|
|
for distributed data loading and prevents code from being seamlessly scaled to larger
|
|
clusters.
|
|
* **Lazy execution:** Datasets executed operations eagerly by default, while TensorFlow
|
|
datasets are lazy by default. The formter provides easier iterative development and
|
|
debuggability, and when needing the optimizations that become available with lazy execution,
|
|
Ray Datasets has a lazy execution mode that you can turn on when productionizing your
|
|
integration.
|
|
* **Generic distributed data processing:** Datasets is more general: it can handle
|
|
generic distributed operations, including global per-epoch shuffling,
|
|
which would otherwise have to be implemented by stitching together two separate
|
|
systems. TensorFlow datasets would require such stitching for anything more involved
|
|
than basic preprocessing, and does not natively support full-shuffling across worker
|
|
shards; only file interleaving is supported. See our
|
|
`blog post <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture>`__
|
|
on why this shared infrastructure is important for 3rd generation ML architectures.
|
|
* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
|
|
processes, in contrast to the multi-processing-based pipelines of TensorFlow datasets.
|
|
|
|
Petastorm
|
|
~~~~~~~~~
|
|
|
|
* **Supported data types:** `Petastorm <https://github.com/uber/petastorm>`__ only supports Parquet data, while
|
|
Ray Datasets supports many file formats.
|
|
* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
|
|
processes, in contrast to the multi-processing-based pipelines used by Petastorm.
|
|
* **No data processing:** Petastorm does not expose any data processing APIs.
|
|
|
|
NVTabular
|
|
~~~~~~~~~
|
|
|
|
* **Supported data types:** `NVTabular <https://github.com/NVIDIA-Merlin/NVTabular>`__ only supports tabular
|
|
(Parquet, CSV, Avro) data, while Ray Datasets supports many other file formats.
|
|
* **Lower overhead:** Datasets is lower overhead: it supports zero-copy exchange between
|
|
processes, in contrast to the multi-processing-based pipelines used by Petastorm.
|
|
* **Heterogeneous compute:** NVTabular doesn't support mixing heterogeneous resources in dataset transforms (e.g.
|
|
both CPU and GPU transformations), while Ray Datasets supports this.
|
|
* **ML-specific ops:** NVTabular has a bunch of great ML-specific preprocessing
|
|
operations; this is currently WIP for Ray Datasets:
|
|
:ref:`Ray AIR preprocessors <air-key-concepts>`.
|
|
|
|
.. _datasets_streaming_faq:
|
|
|
|
For batch (offline) inference, why should I use Ray Datasets instead of an actor pool?
|
|
======================================================================================
|
|
|
|
Ray Datasets provides its own autoscaling actor pool via the actor compute strategy for
|
|
:meth:`ds.map_batches() <ray.data.Dataset.map_batches>`, allowing you to perform CPU- or
|
|
GPU-based batch inference on this actor pool. Using this instead of the
|
|
`Ray actor pool <https://github.com/ray-project/ray/blob/b17cbd825fe3fbde4fe9b03c9dd33be2454d4737/python/ray/util/actor_pool.py#L6>`__
|
|
has a few advantages:
|
|
|
|
* Ray Datasets actor pool is autoscaling and supports easy-to-configure task dependency
|
|
prefetching, pipelining data transfer with compute.
|
|
* Ray Datasets takes care of orchestrating the tasks, batching the data, and managing
|
|
the memory.
|
|
* With :ref:`Ray Datasets pipelining <dataset_pipeline_concept>`, Ray Datasets allows you to
|
|
precisely configure pipelining of preprocessing with batch inference, allowing you to
|
|
easily tweak parallelism vs. pipelining to maximize your GPU utilization.
|
|
* Ray Datasets provides a broad and performant I/O layer, which you would otherwise have
|
|
to roll yourself.
|
|
|
|
How fast is Ray Datasets?
|
|
=========================
|
|
|
|
We're still working on open benchmarks, but we've done some benchmarking on synthetic
|
|
data and have helped several users port from solutions using Petastorm, Torch
|
|
multi-processing data loader, and TensorFlow datasets that have seen a big training
|
|
throughput improvement (4-8x) and model accuracy improvement (due to global per-epoch
|
|
shuffling) using Ray Datasets.
|
|
|
|
Please see our
|
|
`recent blog post on Ray Datasets <https://www.anyscale.com/blog/ray-datasets-for-machine-learning-training-and-scoring>`__
|
|
for more information on this benchmarking.
|
|
|
|
Does all of my data need to fit into memory?
|
|
============================================
|
|
|
|
No, with Ray's support for :ref:`spilling objects to disk <object-spilling>`, you only
|
|
need to be able to fit your data into memory OR disk. However, keeping your data in
|
|
distributed memory may speed up your workload, which can be done on arbitrarily large
|
|
datasets by windowing them, creating :ref:`pipelines <dataset_pipeline_concept>`.
|
|
|
|
How much data can Ray Datasets handle?
|
|
======================================
|
|
|
|
Ray Datasets has been tested at multi-petabyte scale for I/O and multi-terabyte scale for
|
|
shuffling, and we're continuously working on improving this scalability. If you have a
|
|
very large dataset that you'd like to process and you're running into scalability
|
|
issues, please reach out to us on our `Discourse <https://discuss.ray.io/>`__.
|
|
|
|
How do I get my data into Ray Datasets?
|
|
=======================================
|
|
|
|
Ray Datasets supports creating a ``Dataset`` from local and distributed in-memory data
|
|
via integrations with common data libraries, as well as from local and remote storage
|
|
systems via our support for many common file formats and storage backends.
|
|
|
|
Check out our :ref:`feature guide for creating datasets <creating_datasets>` for
|
|
details.
|
|
|
|
How do I do streaming/online data loading and processing?
|
|
=========================================================
|
|
|
|
Streaming data loading and data processing can be accomplished by using
|
|
:ref:`DatasetPipelines <dataset_pipeline_concept>`. By windowing a dataset, you can
|
|
stream data transformations across subsets of the data, even windowing down to the
|
|
reading of each file.
|
|
|
|
See the :ref:`pipelining feature guide <data_pipeline_usage>` for more information.
|
|
|
|
When should I use :ref:`pipelining <dataset_pipeline_concept>`?
|
|
===============================================================
|
|
|
|
Pipelining is useful in a few scenarios:
|
|
|
|
* You have two chained operations using different resources (e.g. CPU and GPU) that you
|
|
want to saturate; this is the case for both ML ingest (CPU-based preprocessing and
|
|
GPU-based training) and batch inference (CPU-based preprocessing and GPU-based batch
|
|
inference).
|
|
* You want to do streaming data loading and processing in order to keep the size of the
|
|
working set small; see previous FAQ on
|
|
:ref:`how to do streaming data loading and processing <datasets_streaming_faq>`.
|
|
* You want to decrease the time-to-first-batch (latency) for a certain operation at the
|
|
end of your workload. This is the case for training and inference since this prevents
|
|
GPUs from being idle (which is costly), and can be advantageous for some other
|
|
latency-sensitive consumers of datasets.
|
|
|
|
When should I use global per-epoch shuffling?
|
|
=============================================
|
|
|
|
Background
|
|
~~~~~~~~~~
|
|
|
|
When training a machine learning model, shuffling your training dataset is important in
|
|
general in order to ensure that your model isn't overfitting on some unintended pattern
|
|
in your data, e.g. sorting on the label column, or time-correlated samples. Per-epoch
|
|
shuffling in particular can improve your model's precision gain per epoch by reducing
|
|
the likelihood of bad (unrepresentative) batches getting you permanently stuck in local
|
|
minima: if you get unlucky and your last few batches have noisy labels that pull your
|
|
learned weights in the wrong direction, shuffling before the next epoch lets you bounce
|
|
out of such a gradient rut. In the distributed data-parallel training case, the current
|
|
status quo solution is typically to have a per-shard in-memory shuffle buffer that you
|
|
fill up and pop random batches from, without mixing data across shards between epochs.
|
|
Ray Datasets also offers fully global random shuffling via
|
|
:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()>`, and doing so on an
|
|
epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
|
|
``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
|
|
global per-epoch shuffling instead of local shuffle buffer shuffling?
|
|
|
|
How to choose a shuffling policy
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Global per-epoch shuffling should only be used if your model is sensitive to the
|
|
randomness of the training data. There is
|
|
`theoretical foundation <https://arxiv.org/abs/1709.10432>`__ for all
|
|
gradient-descent-based model trainers benefiting from improved (global) shuffle quality,
|
|
and we've found that this is particular pronounced for tabular data/models in practice.
|
|
However, the more global your shuffle is, the expensive the shuffling operation, and
|
|
this compounds when doing distributed data-parallel training on a multi-node cluster due
|
|
to data transfer costs, and this cost can be prohibitive when using very large datasets.
|
|
|
|
The best route for determining the best tradeoff between preprocessing time + cost and
|
|
per-epoch shuffle quality is to measure the precision gain per training step for your
|
|
particular model under different shuffling policies:
|
|
|
|
* no shuffling,
|
|
* local (per-shard) limited-memory shuffle buffer,
|
|
* local (per-shard) shuffling,
|
|
* windowed (pseudo-global) shuffling, and
|
|
* fully global shuffling.
|
|
|
|
From the perspective of keeping preprocessing time in check, as long as your data
|
|
loading + shuffling throughput is higher than your training throughput, your GPU should
|
|
be saturated, so we like to recommend users with shuffle-sensitive models to push their
|
|
shuffle quality higher until this threshold is hit.
|
|
|
|
What is Arrow and how does Ray Datasets use it?
|
|
===============================================
|
|
|
|
`Apache Arrow <https://arrow.apache.org/>`__ is a columnar memory format and a
|
|
single-node data processing and I/O library that Ray Datasets leverages extensively. You
|
|
can think of Ray Datasets as orchestrating distributed processing of Arrow data.
|
|
|
|
See our :ref:`key concepts <data_key_concepts>` for more information on how Ray Datasets
|
|
uses Arrow.
|
|
|
|
How much performance tuning does Ray Datasets require?
|
|
======================================================
|
|
|
|
Ray Datasets doesn't perform query optimization, so some manual performance
|
|
tuning may be necessary depending on your use case and data scale. Please see our
|
|
:ref:`performance tuning guide <data_performance_tips>` for more information.
|
|
|
|
How can I contribute to Ray Datasets?
|
|
=====================================
|
|
|
|
We're always happy to accept external contributions! If you have a question, a feature
|
|
request, or want to contibute to Ray Datasets or tell us about your use case, please
|
|
reach out to us on `Discourse <https://discuss.ray.io/>`__; if you have a you're
|
|
confident that you've found a bug, please open an issue on the
|
|
`Ray GitHub repo <https://github.com/ray-project/ray>`__. Please see our
|
|
:ref:`contributing guide <getting-involved>` for more information!
|