2021-07-16 12:31:52 -07:00
.. _datasets:
2022-01-27 22:14:36 +01:00
.. note ::
Before you proceed, note that Ray Data is available as **beta** in Ray 1.8+.
Please file feature requests and bug reports on GitHub Issues or join the discussion
on the `Ray Slack <https://forms.gle/9TSdDYUgxYs8SA9e8> `__ .
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
==============================================
Ray Data: Distributed Data Loading and Compute
==============================================
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
Ray Data is a library for loading and processing large datasets.
Its key abstraction is that of a Dataset.
Datasets are the standard way to load and exchange data in Ray libraries and applications.
They provide basic distributed data transformations such as `` map `` , `` filter `` , and `` repartition `` ,
and are compatible with a variety of file formats, data sources, and distributed frameworks.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
Here's an overview of the integrations with other processing frameworks, file formats, and supported operations,
as well as glimpse at the Ray Data API.
Check our :ref: `compatibility matrix<data-compatibility>` to see if your favorite format is supported already.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
.. image :: images/dataset.svg
2021-07-14 23:27:13 -07:00
..
https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit
2022-01-27 22:14:36 +01:00
Ray Data simplifies general purpose parallel GPU and CPU compute in Ray,
for instance for `GPU batch inference <dataset.html#transforming-datasets> `__ ).
It provides a higher level API for Ray tasks and actors in such embarrassingly parallel compute situations,
internally handling operations like batching, pipelining, and memory management.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
.. image :: images/dataset-compute-1.png
:width: 500px
:align: center
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
As part of the Ray ecosystem, Ray Data can leverage the full functionality of Ray's distributed scheduler,
e.g., using actors for optimizing setup time and GPU scheduling.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
----------------------
Where to Go from Here?
----------------------
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
As new user of Ray Data, you may want to start with our :ref: `Getting Started Guide<data_getting_started>` .
If you've run your first examples already, you might want to dive into Ray Data's key concepts or our User Guide instead.
Advanced users can utilize the Ray Data API reference for their projects.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
.. panels ::
:container: text-center
:column: col-lg-6 px-2 py-2
:card:
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
Getting Started
^^^
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
Start with our quick start tutorials for :ref: `working with Datasets<ray_data_quick_start>`
and :ref: `Dataset Pipelines<data_pipelines_quick_start>` .
These concrete examples will give you an idea of how to use Ray Data.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_getting_started
:type: ref
:text: Get Started with Ray Data
:classes: btn-outline-info btn-block
---
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
Key Concepts
^^^
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
Understand the key concepts behind Ray Data.
Learn what :ref: `Datasets<dataset_concept>` and :ref: `Dataset Pipelines<dataset_pipeline_concept>` are
and :ref: `how they get executed<dataset_execution_concept>` in Ray Data.
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_key_concepts
:type: ref
:text: Learn Key Concepts
:classes: btn-outline-info btn-block
---
2021-11-04 18:13:40 -07:00
2022-01-27 22:14:36 +01:00
User Guide
^^^
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
Learn how to :ref: `load and process data for ML<datasets-ml-preprocessing>` ,
work with :ref: `tensor data<datasets_tensor_support>` , or :ref: `use pipelines<data_pipeline_usage>` .
Run your first :ref: `Dask <dask-on-ray>` , :ref: `Spark <spark-on-ray>` , :ref: `Mars <mars-on-ray>`
and :ref: `Modin <modin-on-ray>` examples on Ray Data.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_user_guide
:type: ref
:text: Start Using Ray Data
:classes: btn-outline-info btn-block
---
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
API
^^^
Get more in-depth information about the Ray Data API.
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
+++
.. link-button :: data_api
:type: ref
:text: Read the API Reference
:classes: btn-outline-info btn-block
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
.. _data-compatibility:
------------------------
Datasource Compatibility
------------------------
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
Ray Data supports reading and writing many formats.
The following two compatibility matrices will help you understand which formats are currently available.
Supported Input Formats
=======================
2021-07-14 23:27:13 -07:00
.. list-table :: Input compatibility matrix
:header-rows: 1
* - Input Type
- Read API
- Status
* - CSV File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_csv()`
2021-07-14 23:27:13 -07:00
- ✅
* - JSON File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_json()`
2021-07-14 23:27:13 -07:00
- ✅
* - Parquet File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_parquet()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-01 22:45:21 -07:00
* - Numpy File Format
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_numpy()`
2021-08-01 22:45:21 -07:00
- ✅
* - Text Files
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_text()`
2021-08-01 22:45:21 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Binary Files
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_binary_files()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-04 13:31:10 -07:00
* - Python Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_items()`
2021-08-04 13:31:10 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Spark Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_spark()`
2021-09-09 15:07:49 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Dask Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_dask()`
2021-07-14 23:27:13 -07:00
- ✅
* - Modin Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_modin()`
2021-08-31 14:19:35 -04:00
- ✅
2021-07-14 23:27:13 -07:00
* - MARS Dataframe
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_mars()`
2021-07-14 23:27:13 -07:00
- (todo)
* - Pandas Dataframe Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_pandas()`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-27 13:33:11 -07:00
* - NumPy ndarray Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_numpy()`
2021-08-27 13:33:11 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Arrow Table Objects
2022-01-26 14:05:27 -06:00
- :func: `ray.data.from_arrow()`
2021-07-14 23:27:13 -07:00
- ✅
* - Custom Datasource
2022-01-26 14:05:27 -06:00
- :func: `ray.data.read_datasource()`
2021-07-14 23:27:13 -07:00
- ✅
2022-01-27 22:14:36 +01:00
Supported Output Formats
========================
2021-07-14 23:27:13 -07:00
.. list-table :: Output compatibility matrix
:header-rows: 1
* - Output Type
- Dataset API
- Status
* - CSV File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_csv() <ray.data.Dataset.write_csv>`
2021-07-14 23:27:13 -07:00
- ✅
* - JSON File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_json() <ray.data.Dataset.write_json>`
2021-07-14 23:27:13 -07:00
- ✅
* - Parquet File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_parquet() <ray.data.Dataset.write_parquet>`
2021-07-14 23:27:13 -07:00
- ✅
2021-08-01 22:45:21 -07:00
* - Numpy File Format
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_numpy() <ray.data.Dataset.write_numpy>`
2021-08-01 22:45:21 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Spark Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_spark() <ray.data.Dataset.to_spark>`
2021-09-09 15:07:49 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Dask Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_dask() <ray.data.Dataset.to_dask>`
2021-07-14 23:27:13 -07:00
- ✅
* - Modin Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_modin() <ray.data.Dataset.to_modin>`
2021-08-31 14:19:35 -04:00
- ✅
2021-07-14 23:27:13 -07:00
* - MARS Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_mars() <ray.data.Dataset.to_mars>`
2021-07-14 23:27:13 -07:00
- (todo)
* - Arrow Table Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_arrow_refs() <ray.data.Dataset.to_arrow_refs>`
2021-07-14 23:27:13 -07:00
- ✅
* - Arrow Table Iterator
2022-01-26 14:05:27 -06:00
- :meth: `ds.iter_batches(batch_format="pyarrow") <ray.data.Dataset.iter_batches>`
2021-07-14 23:27:13 -07:00
- ✅
2021-10-23 12:20:23 -07:00
* - Single Pandas Dataframe
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_pandas() <ray.data.Dataset.to_pandas>`
2021-07-14 23:27:13 -07:00
- ✅
2021-10-23 12:20:23 -07:00
* - Pandas Dataframe Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_pandas_refs() <ray.data.Dataset.to_pandas_refs>`
2021-10-23 12:20:23 -07:00
- ✅
2021-08-27 13:33:11 -07:00
* - NumPy ndarray Objects
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_numpy_refs() <ray.data.Dataset.to_numpy_refs>`
2021-08-27 13:33:11 -07:00
- ✅
2021-07-14 23:27:13 -07:00
* - Pandas Dataframe Iterator
2022-01-26 14:05:27 -06:00
- :meth: `ds.iter_batches(batch_format="pandas") <ray.data.Dataset.iter_batches>`
2021-07-14 23:27:13 -07:00
- ✅
* - PyTorch Iterable Dataset
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_torch() <ray.data.Dataset.to_torch>`
2021-07-14 23:27:13 -07:00
- ✅
* - TensorFlow Iterable Dataset
2022-01-26 14:05:27 -06:00
- :meth: `ds.to_tf() <ray.data.Dataset.to_tf>`
2021-07-14 23:27:13 -07:00
- ✅
* - Custom Datasource
2022-01-26 14:05:27 -06:00
- :meth: `ds.write_datasource() <ray.data.Dataset.write_datasource>`
2021-07-14 23:27:13 -07:00
- ✅
2021-11-16 14:30:08 -08:00
.. _data-talks:
2022-01-27 22:14:36 +01:00
----------
Learn More
----------
2021-11-16 14:30:08 -08:00
- [slides] `Talk given at PyData 2021 <https://docs.google.com/presentation/d/1zANPlmrxQkjPU62I-p92oFO3rJrmjVhs73hL4YbM4C4> `_
2022-01-11 22:09:57 -08:00
- [blog] `Data Ingest in a Third Generation ML Architecture <https://www.anyscale.com/blog/deep-dive-data-ingest-in-a-third-generation-ml-architecture> `_
- [blog] `Building an end-to-end ML pipeline using Mars and XGBoost on Ray <https://www.anyscale.com/blog/building-an-end-to-end-ml-pipeline-using-mars-and-xgboost-on-ray> `_
2021-11-16 14:30:08 -08:00
2022-01-27 22:14:36 +01:00
----------
Contribute
----------
2021-11-16 14:30:08 -08:00
2022-01-27 22:14:36 +01:00
Contributions to Ray Data are `welcome <https://docs.ray.io/en/master/development.html#python-develop> `__ !
There are many potential improvements, including:
2021-07-14 23:27:13 -07:00
2022-01-27 22:14:36 +01:00
- Supporting more data sources and transforms.
2021-07-14 23:27:13 -07:00
- Integration with more ecosystem libraries.
2022-01-27 22:14:36 +01:00
- Adding features that require partitioning such as `groupby()` and `join()` .
2021-07-14 23:27:13 -07:00
- Performance optimizations.