2022-03-10 20:39:44 +01:00
.. include :: /_includes/data/announcement.rst
2021-07-16 12:31:52 -07:00
.. _datasets:
2022-02-05 18:59:34 -06:00
==================================================
Ray Datasets: Distributed Data Loading and Compute
==================================================
2021-07-14 23:27:13 -07:00
2022-05-17 20:57:42 -07:00
.. _datasets-intro:
2022-02-05 18:59:34 -06:00
Ray Datasets are the standard way to load and exchange data in Ray libraries and applications.
2022-05-17 20:57:42 -07:00
They provide basic distributed data transformations such as maps
(:meth: `map_batches <ray.data.Dataset.map_batches>` ),
global and grouped aggregations (:class: `GroupedDataset <ray.data.GroupedDataset>` ), and
shuffling operations (:meth: `random_shuffle <ray.data.Dataset.random_shuffle>` ,
:meth: `sort <ray.data.Dataset.sort>` ,
:meth: `repartition <ray.data.Dataset.repartition>` ),
2022-01-27 22:14:36 +01:00
and are compatible with a variety of file formats, data sources, and distributed frameworks.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
Here's an overview of the integrations with other processing frameworks, file formats, and supported operations,
2022-05-17 20:57:42 -07:00
as well as a glimpse at the Ray Datasets API.
Check our :ref: `compatibility matrix<data-compatibility>` to see if your favorite format
is already supported.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
.. image :: images/dataset.svg
2021-07-14 23:27:13 -07:00
..
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
2022-03-17 15:01:12 -07:00
2022-05-17 20:57:42 -07:00
Ray Datasets simplifies general purpose parallel GPU and CPU compute in Ray; for
instance, for :ref: `GPU batch inference <transforming_datasets>` .
It provides a higher-level API for Ray tasks and actors for such embarrassingly parallel compute,
2022-01-27 22:14:36 +01:00
internally handling operations like batching, pipelining, and memory management.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
.. image :: images/dataset-compute-1.png
:width: 500px
:align: center
2021-11-04 18:13:40 -07:00
2022-02-05 18:59:34 -06:00
As part of the Ray ecosystem, Ray Datasets can leverage the full functionality of Ray's distributed scheduler,
2022-01-27 22:14:36 +01:00
e.g., using actors for optimizing setup time and GPU scheduling.
2021-11-04 18:13:40 -07:00
2022-02-05 18:59:34 -06:00
Data Loading and Preprocessing for ML Training
2022-05-17 20:57:42 -07:00
==============================================
2022-02-05 18:59:34 -06:00
Ray Datasets are designed to load and preprocess data for distributed :ref: `ML training pipelines <train-docs>` .
Compared to other loading solutions, Datasets are more flexible (e.g., can express higher-quality `per-epoch global shuffles <examples/big_data_ingestion.html> `__ ) and provides `higher overall performance <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant> `__ .
Ray Datasets is not intended as a replacement for more general data processing systems.
:ref: `Learn more about how Ray Datasets works with other ETL systems <datasets-ml-preprocessing>` .
2022-01-27 22:14:36 +01:00
----------------------
Where to Go from Here?
----------------------
2021-11-04 18:13:40 -07:00
2022-05-17 20:57:42 -07:00
As new user of Ray Datasets, you may want to start with our :ref: `Getting Started guide<datasets_getting_started>` .
If you've run your first examples already, you might want to dive into Ray Datasets'
:ref: `key concepts <data_key_concepts>` or our :ref: `User Guide <data_user_guide>` instead.
Advanced users can refer directly to the Ray Datasets :ref: `API reference <data_api>` for their projects.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
.. panels ::
:container: text-center
:column: col-lg-6 px-2 py-2
:card:
2021-11-04 18:13:40 -07:00
2022-05-23 21:22:00 -07:00
**Getting Started**
2022-01-27 22:14:36 +01:00
^^^
2021-11-04 18:13:40 -07:00
2022-05-18 13:46:23 -07:00
Start with our quick start tutorials for :ref: `working with Datasets<datasets_getting_started>`
2022-04-07 12:52:04 -07:00
and :ref: `Dataset Pipelines<pipelining_datasets>` .
2022-02-05 18:59:34 -06:00
These concrete examples will give you an idea of how to use Ray Datasets.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
+++
2022-02-05 18:59:34 -06:00
.. link-button :: datasets_getting_started
2022-01-27 22:14:36 +01:00
:type: ref
2022-02-05 18:59:34 -06:00
:text: Get Started with Ray Datasets
2022-01-27 22:14:36 +01:00
:classes: btn-outline-info btn-block
---
2021-11-04 18:13:40 -07:00
2022-05-23 21:22:00 -07:00
**Key Concepts**
2022-01-27 22:14:36 +01:00
^^^
2021-11-04 18:13:40 -07:00
2022-02-05 18:59:34 -06:00
Understand the key concepts behind Ray Datasets.
2022-01-27 22:14:36 +01:00
Learn what :ref: `Datasets<dataset_concept>` and :ref: `Dataset Pipelines<dataset_pipeline_concept>` are
2022-02-05 18:59:34 -06:00
and :ref: `how they get executed<dataset_execution_concept>` in Ray Datasets.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_key_concepts
:type: ref
:text: Learn Key Concepts
:classes: btn-outline-info btn-block
---
2021-11-04 18:13:40 -07:00
2022-05-23 21:22:00 -07:00
**User Guide**
2022-01-27 22:14:36 +01:00
^^^
2021-07-14 23:27:13 -07:00
2022-05-17 20:57:42 -07:00
Learn how to :ref: `create datasets<creating_datasets>` , :ref:`save
datasets<saving_datasets>`, :ref:` transform datasets<transforming_datasets>`,
:ref: `access and exchange datasets<accessing_datasets>` , :ref:`pipeline
transformations<pipelining_datasets>`, :ref:` load and process data for ML<datasets-ml-preprocessing>`,
2022-01-27 22:14:36 +01:00
work with :ref: `tensor data<datasets_tensor_support>` , or :ref: `use pipelines<data_pipeline_usage>` .
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_user_guide
:type: ref
2022-02-05 18:59:34 -06:00
:text: Start Using Ray Datasets
2022-01-27 22:14:36 +01:00
:classes: btn-outline-info btn-block
---
2022-05-23 21:22:00 -07:00
**Examples**
^^^
Find both simple and scaling-out examples of using Ray Datasets for data
processing and ML ingest.
+++
.. link-button :: datasets-recipes
:type: ref
:text: Ray Datasets Examples
:classes: btn-outline-info btn-block
---
2021-07-14 23:27:13 -07:00
2022-05-19 15:44:22 -07:00
**Ray Datasets FAQ**
^^^
Find answers to commonly asked questions in our detailed FAQ.
+++
.. link-button :: datasets_faq
:type: ref
:text: Ray Datasets FAQ
:classes: btn-outline-info btn-block
---
2022-05-23 21:22:00 -07:00
**API**
2022-01-27 22:14:36 +01:00
^^^
2022-02-05 18:59:34 -06:00
Get more in-depth information about the Ray Datasets API.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_api
:type: ref
:text: Read the API Reference
:classes: btn-outline-info btn-block
2022-05-17 20:57:42 -07:00
---
2022-05-23 21:22:00 -07:00
**Other Data Processing Solutions**
2022-05-17 20:57:42 -07:00
^^^
For running ETL pipelines, check out :ref: `Spark-on-Ray <spark-on-ray>` . For scaling
up your data science workloads, check out :ref: `Dask-on-Ray <dask-on-ray>` ,
:ref: `Modin <modin-on-ray>` , and :ref: `Mars-on-Ray <mars-on-ray>` .
+++
.. link-button :: integrations
:type: ref
:text: Check Out Other Data Processing Options
:classes: btn-outline-info btn-block
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
.. _data-compatibility:
------------------------
Datasource Compatibility
------------------------
2021-07-14 23:27:13 -07:00
2022-05-17 20:57:42 -07:00
Ray Datasets supports reading and writing many file formats.
The following compatibility matrices will help you understand which formats are currently available.
If none of these meet your needs, please reach out on `Discourse <https://discuss.ray.io/> `__ or open a feature
request on the `Ray GitHub repo <https://github.com/ray-project/ray> `__ , and check out
our :ref: `guide for implementing a custom Datasets datasource <datasets_custom_datasource>`
if you're interested in rolling your own integration!
2022-01-27 22:14:36 +01:00
Supported Input Formats
=======================
2021-07-14 23:27:13 -07:00
.. list-table :: Input compatibility matrix
:header-rows: 1
* - Input Type
- Read API
- Status
* - CSV File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_csv()`
2021-07-14 23:27:13 -07:00
- ✅
* - JSON File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_json()`
2021-07-14 23:27:13 -07:00
- ✅
* - Parquet File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_parquet()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-01 22:45:21 -07:00
* - Numpy File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_numpy()`
2021-08-01 22:45:21 -07:00
- ✅
* - Text Files
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_text()`
2021-08-01 22:45:21 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Binary Files
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_binary_files()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-04 13:31:10 -07:00
* - Python Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_items()`
2021-08-04 13:31:10 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Spark Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_spark()`
2021-09-09 15:07:49 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Dask Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_dask()`
2021-07-14 23:27:13 -07:00
- ✅
* - Modin Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_modin()`
2021-08-31 14:19:35 -04:00
- ✅
2021-07-14 23:27:13 -07:00
* - MARS Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_mars()`
2022-04-30 00:43:52 +08:00
- ✅
2021-07-14 23:27:13 -07:00
* - Pandas Dataframe Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_pandas()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-27 13:33:11 -07:00
* - NumPy ndarray Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_numpy()`
2021-08-27 13:33:11 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Arrow Table Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_arrow()`
2021-07-14 23:27:13 -07:00
- ✅
2022-05-06 22:09:28 +02:00
* - 🤗 (Hugging Face) Dataset
- :func: `ray.data.from_huggingface()`
- ✅
2021-07-14 23:27:13 -07:00
* - Custom Datasource
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_datasource()`
2021-07-14 23:27:13 -07:00
- ✅
2022-01-27 22:14:36 +01:00
Supported Output Formats
========================
2021-07-14 23:27:13 -07:00
.. list-table :: Output compatibility matrix
:header-rows: 1
* - Output Type
- Dataset API
- Status
* - CSV File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_csv() <ray.data.Dataset.write_csv>`
2021-07-14 23:27:13 -07:00
- ✅
* - JSON File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_json() <ray.data.Dataset.write_json>`
2021-07-14 23:27:13 -07:00
- ✅
* - Parquet File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_parquet() <ray.data.Dataset.write_parquet>`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-01 22:45:21 -07:00
* - Numpy File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_numpy() <ray.data.Dataset.write_numpy>`
2021-08-01 22:45:21 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Spark Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_spark() <ray.data.Dataset.to_spark>`
2021-09-09 15:07:49 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Dask Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_dask() <ray.data.Dataset.to_dask>`
2021-07-14 23:27:13 -07:00
- ✅
* - Modin Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_modin() <ray.data.Dataset.to_modin>`
2021-08-31 14:19:35 -04:00
- ✅
2021-07-14 23:27:13 -07:00
* - MARS Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_mars() <ray.data.Dataset.to_mars>`
2022-04-30 00:43:52 +08:00
- ✅
2021-07-14 23:27:13 -07:00
* - Arrow Table Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_arrow_refs() <ray.data.Dataset.to_arrow_refs>`
2021-07-14 23:27:13 -07:00
- ✅
* - Arrow Table Iterator
2022-01-26 14:05:27 -06:00
- :meth: `ds.iter_batches(batch_format="pyarrow") <ray.data.Dataset.iter_batches>`
2021-07-14 23:27:13 -07:00
- ✅
2021-10-23 12:20:23 -07:00
* - Single Pandas Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_pandas() <ray.data.Dataset.to_pandas>`
2021-07-14 23:27:13 -07:00
- ✅
2021-10-23 12:20:23 -07:00
* - Pandas Dataframe Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_pandas_refs() <ray.data.Dataset.to_pandas_refs>`
2021-10-23 12:20:23 -07:00
- ✅
2021-08-27 13:33:11 -07:00
* - NumPy ndarray Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_numpy_refs() <ray.data.Dataset.to_numpy_refs>`
2021-08-27 13:33:11 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Pandas Dataframe Iterator
2022-01-26 14:05:27 -06:00
- :meth: `ds.iter_batches(batch_format="pandas") <ray.data.Dataset.iter_batches>`
2021-07-14 23:27:13 -07:00
- ✅
* - PyTorch Iterable Dataset
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_torch() <ray.data.Dataset.to_torch>`
2021-07-14 23:27:13 -07:00
- ✅
* - TensorFlow Iterable Dataset
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_tf() <ray.data.Dataset.to_tf>`
2021-07-14 23:27:13 -07:00
- ✅
2022-03-17 15:01:12 -07:00
* - Random Access Dataset
- :meth: `ds.to_random_access_dataset() <ray.data.Dataset.to_random_access_dataset>`
- ✅
2021-07-14 23:27:13 -07:00
* - Custom Datasource
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_datasource() <ray.data.Dataset.write_datasource>`
2021-07-14 23:27:13 -07:00
- ✅
2021-11-16 14:30:08 -08:00
.. _data-talks:
2022-01-27 22:14:36 +01:00
----------
Learn More
----------
2021-11-16 14:30:08 -08:00
- [slides] `Talk given at PyData 2021 <https://docs.google.com/presentation/d/1zANPlmrxQkjPU62I-p92oFO3rJrmjVhs73hL4YbM4C4> `_
2022-01-11 22:09:57 -08:00
- [blog] `Data Ingest in a Third Generation ML Architecture <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture> `_
- [blog] `Building an end-to-end ML pipeline using Mars and XGBoost on Ray <https://www.anyscale.com/blog/building-an-end-to-end-ml-pipeline-using-mars-and-xgboost-on-ray> `_
2022-03-17 15:01:12 -07:00
- [blog] `Ray Datasets for large-scale machine learning ingest and scoring <https://www.anyscale.com/blog/ray-datasets-for-machine-learning-training-and-scoring> `_
2021-11-16 14:30:08 -08:00
2022-01-27 22:14:36 +01:00
----------
Contribute
----------
2021-11-16 14:30:08 -08:00
2022-02-05 18:59:34 -06:00
Contributions to Ray Datasets are `welcome <https://docs.ray.io/en/master/development.html#python-develop> `__ !
2022-01-27 22:14:36 +01:00
There are many potential improvements, including:
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
- Supporting more data sources and transforms.
2021-07-14 23:27:13 -07:00
- Integration with more ecosystem libraries.
2022-05-17 20:57:42 -07:00
- Adding features such as `join()` .
2021-07-14 23:27:13 -07:00
- Performance optimizations.
2022-03-10 20:39:44 +01:00
.. include :: /_includes/data/announcement_bottom.rst