mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

This PR adds a NumPy "numpy" batch format for batch transformations and batch consumption that works with all block types. See #24811.
497 lines
20 KiB
ReStructuredText
497 lines
20 KiB
ReStructuredText
.. _transforming_datasets:
|
|
|
|
=====================
|
|
Transforming Datasets
|
|
=====================
|
|
|
|
The Ray Datasets transformations take in datasets and produce new datasets.
|
|
For example, *map* is a transformation that applies a user-defined function (UDF)
|
|
on each row of input dataset and returns a new dataset as result. The Datasets
|
|
transformations are **composable**. Operations can be further applied on the result
|
|
dataset, forming a chain of transformations to express more complex computations.
|
|
Transformations are the core for expressing business logic in Datasets.
|
|
|
|
.. _transform_datasets_transformations:
|
|
|
|
---------------
|
|
Transformations
|
|
---------------
|
|
|
|
In general, we have two types of transformations:
|
|
|
|
* **One-to-one transformations:** each input block will contribute to only one output
|
|
block, such as :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`. In other
|
|
systems this may be called narrow transformations.
|
|
* **All-to-all transformations:** input blocks can contribute to multiple output blocks,
|
|
such as :meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle>`. In other
|
|
systems this may be called wide transformations.
|
|
|
|
Here is a table listing some common transformations supported by Ray Datasets.
|
|
|
|
.. list-table:: Common Ray Datasets transformations.
|
|
:header-rows: 1
|
|
|
|
* - Transformation
|
|
- Type
|
|
- Description
|
|
* - :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`
|
|
- One-to-one
|
|
- Apply a given function to batches of records of this dataset.
|
|
* - :meth:`ds.split() <ray.data.Dataset.split>`
|
|
- One-to-one
|
|
- | Split the dataset into N disjoint pieces.
|
|
* - :meth:`ds.repartition(shuffle=False) <ray.data.Dataset.repartition>`
|
|
- One-to-one
|
|
- | Repartition the dataset into N blocks, without shuffling the data.
|
|
* - :meth:`ds.repartition(shuffle=True) <ray.data.Dataset.repartition>`
|
|
- All-to-all
|
|
- | Repartition the dataset into N blocks, shuffling the data during repartition.
|
|
* - :meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle>`
|
|
- All-to-all
|
|
- | Randomly shuffle the elements of this dataset.
|
|
* - :meth:`ds.sort() <ray.data.Dataset.sort>`
|
|
- All-to-all
|
|
- | Sort the dataset by a sortkey.
|
|
* - :meth:`ds.groupby() <ray.data.Dataset.groupby>`
|
|
- All-to-all
|
|
- | Group the dataset by a groupkey.
|
|
|
|
.. tip::
|
|
|
|
Datasets also provides the convenience transformation methods :meth:`ds.map() <ray.data.Dataset.map>`,
|
|
:meth:`ds.flat_map() <ray.data.Dataset.flat_map>`, and :meth:`ds.filter() <ray.data.Dataset.filter>`,
|
|
which are not vectorized (slower than :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`), but
|
|
may be useful for development.
|
|
|
|
The following is an example to make use of those transformation APIs for processing
|
|
the Iris dataset.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __dataset_transformation_begin__
|
|
:end-before: __dataset_transformation_end__
|
|
|
|
.. _transform_datasets_writing_udfs:
|
|
|
|
------------
|
|
Writing UDFs
|
|
------------
|
|
|
|
User-defined functions (UDFs) are routines that apply on one row (e.g.
|
|
:meth:`.map() <ray.data.Dataset.map>`) or a batch of rows (e.g.
|
|
:meth:`.map_batches() <ray.data.Dataset.map_batches>`) of a dataset. UDFs let you
|
|
express your customized business logic in transformations. Here we will focus on
|
|
:meth:`.map_batches() <ray.data.Dataset.map_batches>` as it's the primary mapping
|
|
API in Datasets.
|
|
|
|
A batch UDF can be a function or, if using the
|
|
:ref:`actor compute strategy <transform_datasets_compute_strategy>`, a
|
|
:ref:`callable class <transform_datasets_callable_classes>`.
|
|
These UDFs have several :ref:`batch format options <transform_datasets_batch_formats>`,
|
|
which control the format of the batches that are passed to the provided batch UDF.
|
|
Depending on the underlying :ref:`dataset format <transform_datasets_dataset_formats>`,
|
|
using a particular batch format may or may not incur a data conversion cost
|
|
(e.g. converting an Arrow Table to a Pandas DataFrame, or creating an Arrow Table from a
|
|
Python list, both of which would incur a full copy of the data).
|
|
|
|
.. _transform_datasets_dataset_formats:
|
|
|
|
Dataset Formats
|
|
===============
|
|
|
|
A **dataset format** refers to how Datasets represents data under-the-hood as data
|
|
**blocks**.
|
|
|
|
* **Tabular (Arrow or Pandas) Datasets:** Represented under-the-hood as
|
|
`Arrow Tables <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
or
|
|
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.
|
|
Tabular datasets are created by reading on-disk tabular data (Parquet, CSV, JSON),
|
|
converting from in-memory tabular data (Pandas DataFrames, Arrow Tables,
|
|
dictionaries, Dask DataFrames, Modin DataFrames, Spark DataFrames, Mars DataFrames)
|
|
and by returning tabular data (Arrow, Pandas, dictionaries) from previous batch
|
|
UDFs.
|
|
|
|
Datasets will minimize conversions between the Arrow and Pandas
|
|
representation: tabular data read from disk will always be read as Arrow Tables, but
|
|
in-memory conversions of Pandas DataFrame data to a ``Dataset`` (e.g. converting from
|
|
Pandas, Dask, Modin, Spark, and Mars) will use Pandas DataFrames as the internal data
|
|
representation in order to avoid extra data conversions/copies.
|
|
|
|
* **Tensor Datasets:** Represented under-the-hood as a single-column
|
|
`Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
or
|
|
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__,
|
|
using our :ref:`tensor extension type <dataset-tensor-extension-api>` to embed the
|
|
tensor in these tables under a single ``"__value__"`` column. Tensor datasets are
|
|
created by reading on-disk tensor data (NPY), converting from in-memory tensor data
|
|
(NumPy), and by returning tensor data (NumPy) from previous batch UDFs.
|
|
|
|
Note that converting the underlying tabular representation to a NumPy ndarray is
|
|
zero-copy, including converting the entire column to an ndarray and converting just
|
|
a single column element to an ndarray.
|
|
|
|
* **Simple Datasets:** Represented under-the-hood as plain Python lists. Simple
|
|
datasets are created by reading on-disk binary and plain-text data, converting from
|
|
in-memory simple data (Python lists), and by returning simple data (Python lists)
|
|
from previou batch UDFs.
|
|
|
|
Simple datasets are mostly used as an escape hatch for data that's not cleanly
|
|
representable in Arrow Tables and Pandas DataFrames.
|
|
|
|
.. _transform_datasets_batch_formats:
|
|
|
|
Batch Formats
|
|
=============
|
|
|
|
The **batch format** is the format of the data that's given to batch UDFs, e.g.
|
|
Pandas DataFrames, Arrow Tables, NumPy ndarrays, or Python lists. The
|
|
:meth:`.map_batches() <ray.data.Dataset.map_batches>` API has a ``batch_format: str``
|
|
parameter that allows the user to dictate the batch format; we dig into the details of
|
|
each ``batch_format`` option below:
|
|
|
|
.. tabbed:: "native" (default)
|
|
|
|
The ``"native"`` batch format presents batches in the canonical format for each dataset
|
|
type:
|
|
|
|
* **Tabular (Arrow or Pandas) Datasets:** Each batch will be a
|
|
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.
|
|
If the dataset is represented as Arrow Tables under-the-hood, the conversion to the
|
|
Pandas DataFrame batch format will incur a conversion cost (i.e., a full copy of
|
|
each batch will be created in order to convert it).
|
|
* **Tensor Datasets:** Each batch will be a
|
|
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__.
|
|
* **Simple Datasets:** Each batch will be a Python list.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_native_udfs_begin__
|
|
:end-before: __writing_native_udfs_end__
|
|
|
|
.. tabbed:: "pandas"
|
|
|
|
The ``"pandas"`` batch format provides batches in a
|
|
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
|
|
format. If converting a simple dataset to Pandas DataFrame batches, a single-column
|
|
dataframe with the column ``"value"`` will be created.
|
|
|
|
If the underlying datasets data is not already in Pandas DataFrame format, the
|
|
batch format conversion will incur a copy for each batch.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_pandas_udfs_begin__
|
|
:end-before: __writing_pandas_udfs_end__
|
|
|
|
.. tabbed:: "pyarrow"
|
|
|
|
The ``"pyarrow"`` batch format provides batches in a
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
format. If converting a simple dataset to Arrow Table batches, a single-column table
|
|
with the column ``"value"`` will be created.
|
|
|
|
If the underlying datasets data is not already in Arrow Table format, the
|
|
batch format conversion will incur a copy for each batch.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_arrow_udfs_begin__
|
|
:end-before: __writing_arrow_udfs_end__
|
|
|
|
.. tabbed:: "numpy"
|
|
|
|
The ``"numpy"`` batch format provides batches in a
|
|
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
|
|
format. The following details how each dataset format is converted into a NumPy batch:
|
|
|
|
* **Tensor Datasets:** Each batch will be a single
|
|
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
|
|
containing the single-tensor-column for this batch.
|
|
|
|
Note that this conversion should always be zero-copy.
|
|
|
|
* **Tabular (Arrow or Pandas) Datasets:** Each batch will be a dictionary of NumPy
|
|
ndarrays (``Dict[str, np.ndarray]``), with each key-value pair representing a column
|
|
in the table.
|
|
|
|
Note that this conversion should usually be zero-copy.
|
|
|
|
* **Simple Datasets:** Each batch will be a single NumPy ndarray, where Datasets will
|
|
attempt to convert each list-batch to an ndarray.
|
|
|
|
Note that this conversion will
|
|
incur a copy for primitive types that are representable as a NumPy dtype (since
|
|
NumPy will create a continguous ndarray in that case), and will not incur a copy if
|
|
the batch items aren't supported in the NumPy dtype system (since NumPy will create
|
|
an ndarray of ``np.object`` pointers to the batch items in that case).
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_numpy_udfs_begin__
|
|
:end-before: __writing_numpy_udfs_end__
|
|
|
|
The following table summarizes the conversion costs from a particular dataset format to
|
|
a particular batch format:
|
|
|
|
.. list-table:: Batch format conversion costs - is the conversion zero-copy?
|
|
:header-rows: 1
|
|
:stub-columns: 1
|
|
:widths: 40 15 15 15 15
|
|
:align: center
|
|
|
|
* - Dataset Format -> Batch Format
|
|
- ``"native"``
|
|
- ``"pandas"``
|
|
- ``"pyarrow"``
|
|
- ``"numpy"``
|
|
* - Tabular - Pandas
|
|
- Zero-Copy
|
|
- Zero-Copy
|
|
- Copy
|
|
- Zero-Copy
|
|
* - Tabular - Arrow
|
|
- Copy
|
|
- Copy
|
|
- Zero-Copy
|
|
- Zero-Copy
|
|
* - Tensor - Pandas
|
|
- Zero-Copy
|
|
- Zero-Copy
|
|
- Copy
|
|
- Zero-Copy
|
|
* - Tensor - Arrow
|
|
- Zero-Copy
|
|
- Copy
|
|
- Zero-Copy
|
|
- Zero-Copy
|
|
* - Simple - List
|
|
- Zero-Copy
|
|
- Copy
|
|
- Copy
|
|
- Copy
|
|
|
|
You should reference the `pyarrow.Table APIs
|
|
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__, the
|
|
`pandas.DataFrame APIs <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__,
|
|
or the `numpy.ndarray APIs <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
|
|
when writing batch UDFs.
|
|
|
|
.. tip::
|
|
|
|
Write your UDFs to leverage built-in vectorized operations on the ``pandas.DataFrame``,
|
|
``pyarrow.Table``, and ``numpy.ndarray`` abstractions for better performance. For
|
|
example, suppose you want to compute the sum of a column in ``pandas.DataFrame``:
|
|
instead of iterating over each row of a batch and summing up values of that column,
|
|
you should use ``df_batch["col_foo"].sum()``.
|
|
|
|
.. _transform_datasets_callable_classes:
|
|
|
|
Callable Class UDFs
|
|
===================
|
|
|
|
When using the actor compute strategy, per-row and per-batch UDFs can also be
|
|
**callable classes**, i.e. classes that implement the ``__call__`` magic method. The
|
|
constructor of the class can be used for stateful setup, and will be only invoked once
|
|
per actor worker.
|
|
|
|
.. note::
|
|
These transformation APIs take the uninstantiated callable class as an argument,
|
|
**not** an instance of the class!
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_callable_classes_udfs_begin__
|
|
:end-before: __writing_callable_classes_udfs_end__
|
|
|
|
.. _transform_datasets_batch_output_types:
|
|
|
|
Batch UDF Output Types
|
|
======================
|
|
|
|
The return type of a batch UDF will determine the format of the resulting dataset. This
|
|
allows you to dynamically convert between data formats during batch transformations.
|
|
|
|
The following details how each **batch** UDF (as used in
|
|
:meth:`ds.map_batches() <ray.data.Dataset.map_batches>`) return type will be interpreted
|
|
by Datasets when constructing its internal blocks:
|
|
|
|
.. tabbed:: pd.DataFrame
|
|
|
|
Tabular dataset containing
|
|
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
|
|
blocks.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_pandas_out_udfs_begin__
|
|
:end-before: __writing_pandas_out_udfs_end__
|
|
|
|
.. tabbed:: pa.Table
|
|
|
|
Tabular dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_arrow_out_udfs_begin__
|
|
:end-before: __writing_arrow_out_udfs_end__
|
|
|
|
.. tabbed:: np.ndarray
|
|
|
|
Tensor dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks, representing the tensor using our
|
|
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed the tensor in
|
|
these tables under a single ``"__value__"`` column.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_numpy_out_udfs_begin__
|
|
:end-before: __writing_numpy_out_udfs_end__
|
|
|
|
.. tabbed:: Dict[str, np.ndarray]
|
|
|
|
Tabular dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks, where each key-value pair treats the key as the column name as the value as
|
|
the column tensor.
|
|
|
|
If a column tensor is 1-dimensional, then the native Arrow 1D list
|
|
type is used; if a column tensor has 2 or more dimensions, then we use our
|
|
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed these
|
|
n-dimensional tensors in the Arrow table.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_numpy_dict_out_udfs_begin__
|
|
:end-before: __writing_numpy_dict_out_udfs_end__
|
|
|
|
.. tabbed:: list
|
|
|
|
Simple dataset containing Python list blocks.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_simple_out_udfs_begin__
|
|
:end-before: __writing_simple_out_udfs_end__
|
|
|
|
.. _transform_datasets_row_output_types:
|
|
|
|
Row UDF Output Types
|
|
====================
|
|
|
|
The return type of a row UDF will determine the format of the resulting dataset. This
|
|
allows you to dynamically convert between data formats during row-based transformations.
|
|
|
|
The following details how each **row** UDF (as used in
|
|
:meth:`ds.map() <ray.data.Dataset.map>`) return type will be interpreted by Datasets
|
|
when constructing its internal blocks:
|
|
|
|
.. tabbed:: dict
|
|
|
|
Tabular dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks, treating the keys as the column names and the values as the column elements.
|
|
|
|
If Arrow Table construction fails due to a value type in the dictionary not
|
|
being supported by Arrow, then Datasets will fall back to a simple dataset containing
|
|
Python list blocks.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_dict_out_row_udfs_begin__
|
|
:end-before: __writing_dict_out_row_udfs_end__
|
|
|
|
.. tabbed:: PandasRow
|
|
|
|
Tabular dataset containing
|
|
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
|
|
blocks.
|
|
|
|
A ``PandasRow`` object is a zero-copy row view on an underlying Pandas DataFrame block
|
|
that Datasets provides to per-row UDFs (:meth:`ds.map() <ray.data.Dataset.map>`) and
|
|
returns in the row iterators (:meth:`ds.iter_rows <ray.data.Dataset.iter_rows>`).
|
|
This row view provides a dictionary interface and can be transparently used as a
|
|
dictionary. See the :class:`TableRow API <ray.data.row.TableRow>` for more information
|
|
on this row view object.
|
|
|
|
Note that a ``PandasRow`` is immmutable, so this row mapping cannot be updated
|
|
in-place. If wanting to update the row, copy this zero-copy row view into a plain
|
|
Python dictionary with :meth:`TableRow.as_pydict() <ray.data.row.TableRow.as_pydict>`
|
|
and then mutate and return that dictionary.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_table_row_out_row_udfs_begin__
|
|
:end-before: __writing_table_row_out_row_udfs_end__
|
|
|
|
.. tabbed:: ArrowRow
|
|
|
|
Tabular dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks.
|
|
|
|
An ``ArrowRow`` object is a zero-copy row view on an underlying Arrow Table block
|
|
that Datasets provides to per-row UDFs (``ds.map()``) and returns in the row
|
|
that Datasets provides to per-row UDFs (:meth:`ds.map() <ray.data.Dataset.map>`) and
|
|
returns in the row iterators (:meth:`ds.iter_rows <ray.data.Dataset.iter_rows>`).
|
|
This row view provides a dictionary interface and can be transparently used as a
|
|
dictionary. See the :class:`TableRow API <ray.data.row.TableRow>` for more information
|
|
on this row view object.
|
|
|
|
Note that an ``ArrowRow`` is immmutable, so this row mapping cannot be updated
|
|
in-place. If wanting to update the row, copy this zero-copy row view into a plain
|
|
Python dictionary with :meth:`TableRow.as_pydict() <ray.data.row.TableRow.as_pydict>`
|
|
and then mutate and return that dictionary.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_table_row_out_row_udfs_begin__
|
|
:end-before: __writing_table_row_out_row_udfs_end__
|
|
|
|
.. tabbed:: np.ndarray
|
|
|
|
Tensor dataset containing
|
|
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
|
|
blocks, representing the tensor using our
|
|
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed the tensor in
|
|
these tables under a single ``"__value__"`` column. Each such ``ndarray`` will be
|
|
treated as a row in this column.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_numpy_out_row_udfs_begin__
|
|
:end-before: __writing_numpy_out_row_udfs_end__
|
|
|
|
.. tabbed:: Any
|
|
|
|
All other return row types will result in a simple dataset containing list blocks.
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __writing_simple_out_row_udfs_begin__
|
|
:end-before: __writing_simple_out_row_udfs_end__
|
|
|
|
.. _transform_datasets_compute_strategy:
|
|
|
|
----------------
|
|
Compute Strategy
|
|
----------------
|
|
|
|
Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
|
|
or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
|
|
used (with ``compute="tasks"``). For transformations that require expensive setup,
|
|
it's preferrable to use Ray actors, which are stateful and allow setup to be reused
|
|
for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
|
|
Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
|
|
transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.
|
|
|
|
The following is an example of using the Ray tasks and actors compute strategy
|
|
for batch inference:
|
|
|
|
.. literalinclude:: ./doc_code/transforming_datasets.py
|
|
:language: python
|
|
:start-after: __dataset_compute_strategy_begin__
|
|
:end-before: __dataset_compute_strategy_end__
|