mirror of
https://github.com/vale981/ray
synced 2025-03-10 13:26:39 -04:00
118 lines
4 KiB
ReStructuredText
118 lines
4 KiB
ReStructuredText
.. _consuming_datasets:
|
|
|
|
==================
|
|
Consuming Datasets
|
|
==================
|
|
|
|
The data underlying a ``Dataset`` can be consumed in several ways:
|
|
|
|
* Retrieving a limited prefix of rows.
|
|
* Iterating over rows and batches.
|
|
* Saving to files.
|
|
|
|
Retrieving a limited set of rows
|
|
================================
|
|
|
|
A limited set of rows can be retried from a ``Dataset`` via the
|
|
:meth:`ds.take() <ray.data.Dataset.take>` API, along with its sibling helper APIs
|
|
:meth:`ds.take_all() <ray.data.Dataset.take_all>`, for retrieving **all** rows, and
|
|
:meth:`ds.show() <ray.data.Dataset.show>`, for printing a limited set of rows. These
|
|
methods are convenient for quickly inspecting a subset (prefix) of rows. They have the
|
|
benefit that, if used right after reading, they will only trigger more files to be
|
|
read if needed to retrieve rows from that file; if inspecting a small prefix of rows,
|
|
often only the first file will need to be read.
|
|
|
|
.. literalinclude:: ./doc_code/accessing_datasets.py
|
|
:language: python
|
|
:start-after: __take_begin__
|
|
:end-before: __take_end__
|
|
|
|
Iterating over Datasets
|
|
=======================
|
|
|
|
Datasets can be consumed a row at a time using the
|
|
:meth:`ds.iter_rows() <ray.data.Dataset.iter_rows>` API
|
|
|
|
.. literalinclude:: ./doc_code/accessing_datasets.py
|
|
:language: python
|
|
:start-after: __iter_rows_begin__
|
|
:end-before: __iter_rows_end__
|
|
|
|
or a batch at a time using the
|
|
:meth:`ds.iter_batches() <ray.data.Dataset.iter_batches>` API, where you can specify
|
|
batch size as well as the desired batch format. By default, the batch format is
|
|
``"native"``, which means that the batch format that's native to the data type will be
|
|
returned. For tabular data, the native format is a Pandas DataFrame; for Python objects,
|
|
it's a list.
|
|
|
|
.. literalinclude:: ./doc_code/accessing_datasets.py
|
|
:language: python
|
|
:start-after: __iter_batches_begin__
|
|
:end-before: __iter_batches_end__
|
|
|
|
|
|
Datasets can be passed to Ray tasks or actors and accessed by these iteration methods.
|
|
This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects:
|
|
|
|
.. literalinclude:: ./doc_code/accessing_datasets.py
|
|
:language: python
|
|
:start-after: __remote_iterators_begin__
|
|
:end-before: __remote_iterators_end__
|
|
|
|
|
|
Splitting Into and Consuming Shards
|
|
===================================
|
|
|
|
Datasets can be split up into disjoint sub-datasets, or shards.
|
|
Locality-aware splitting is supported if you pass in a list of actor handles to the
|
|
:meth:`ds.split() <ray.data.Dataset.split>` function along with the number of desired splits.
|
|
This is a common pattern useful for loading and sharding data between distributed training actors:
|
|
|
|
.. note::
|
|
|
|
If using :ref:`Ray Train <train-docs>` for distributed training, you do not need to split the dataset; Ray
|
|
Train will automatically do locality-aware splitting into per-trainer shards for you!
|
|
|
|
.. literalinclude:: ./doc_code/accessing_datasets.py
|
|
:language: python
|
|
:start-after: __split_begin__
|
|
:end-before: __split_end__
|
|
|
|
.. _saving_datasets:
|
|
|
|
===============
|
|
Saving Datasets
|
|
===============
|
|
|
|
Datasets can be written to local or remote storage in the desired data format.
|
|
The supported formats include Parquet, CSV, JSON, NumPy. To control the number
|
|
of output files, you may use :meth:`ds.repartition() <ray.data.Dataset.repartition>`
|
|
to repartition the Dataset before writing out.
|
|
|
|
.. tabbed:: Parquet
|
|
|
|
.. literalinclude:: ./doc_code/saving_datasets.py
|
|
:language: python
|
|
:start-after: __write_parquet_begin__
|
|
:end-before: __write_parquet_end__
|
|
|
|
.. tabbed:: CSV
|
|
|
|
.. literalinclude:: ./doc_code/saving_datasets.py
|
|
:language: python
|
|
:start-after: __write_csv_begin__
|
|
:end-before: __write_csv_end__
|
|
|
|
.. tabbed:: JSON
|
|
|
|
.. literalinclude:: ./doc_code/saving_datasets.py
|
|
:language: python
|
|
:start-after: __write_json_begin__
|
|
:end-before: __write_json_end__
|
|
|
|
.. tabbed:: NumPy
|
|
|
|
.. literalinclude:: ./doc_code/saving_datasets.py
|
|
:language: python
|
|
:start-after: __write_numpy_begin__
|
|
:end-before: __write_numpy_end__
|