[docs] Change data tagline to "Distributed Data Preprocessing" (#27434)

This commit is contained in:
Eric Liang 2022-08-03 16:57:07 -07:00 committed by GitHub
parent 55209692ee
commit cd9cabcadf
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -2,9 +2,9 @@
.. _datasets:
==================================================
Ray Datasets: Distributed Data Loading and Compute
==================================================
============================================
Ray Datasets: Distributed Data Preprocessing
============================================
.. _datasets-intro:
@ -29,7 +29,19 @@ is already supported.
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
Ray Datasets simplifies general purpose parallel GPU and CPU compute in Ray; for
Data Loading and Preprocessing for ML Training
==============================================
Ray Datasets is designed to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>`.
Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality `per-epoch global shuffles <examples/big_data_ingestion.html>`__) and provides `higher overall performance <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant>`__.
Ray Datasets is not intended as a replacement for more general data processing systems.
:ref:`Learn more about how Ray Datasets works with other ETL systems <datasets-ml-preprocessing>`.
Datasets for Parallel Compute
=============================
Datasets also simplifies general purpose parallel GPU and CPU compute in Ray; for
instance, for :ref:`GPU batch inference <transforming_datasets>`.
It provides a higher-level API for Ray tasks and actors for such embarrassingly parallel compute,
internally handling operations like batching, pipelining, and memory management.
@ -41,15 +53,6 @@ internally handling operations like batching, pipelining, and memory management.
As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Ray's distributed scheduler,
e.g., using actors for optimizing setup time and GPU scheduling.
Data Loading and Preprocessing for ML Training
==============================================
Ray Datasets are designed to load and preprocess data for distributed :ref:`ML training pipelines <train-docs>`.
Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality `per-epoch global shuffles <examples/big_data_ingestion.html>`__) and provides `higher overall performance <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant>`__.
Ray Datasets is not intended as a replacement for more general data processing systems.
:ref:`Learn more about how Ray Datasets works with other ETL systems <datasets-ml-preprocessing>`.
----------------------
Where to Go from Here?
----------------------