ray/doc/source/data/transforming-datasets.rst
Clark Zinzow 1701b923bc
[Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870)
This PR adds a NumPy "numpy" batch format for batch transformations and batch consumption that works with all block types. See #24811.
2022-06-17 16:01:02 -07:00

497 lines
20 KiB
ReStructuredText

.. _transforming_datasets:
=====================
Transforming Datasets
=====================
The Ray Datasets transformations take in datasets and produce new datasets.
For example, *map* is a transformation that applies a user-defined function (UDF)
on each row of input dataset and returns a new dataset as result. The Datasets
transformations are **composable**. Operations can be further applied on the result
dataset, forming a chain of transformations to express more complex computations.
Transformations are the core for expressing business logic in Datasets.
.. _transform_datasets_transformations:
---------------
Transformations
---------------
In general, we have two types of transformations:
* **One-to-one transformations:** each input block will contribute to only one output
block, such as :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`. In other
systems this may be called narrow transformations.
* **All-to-all transformations:** input blocks can contribute to multiple output blocks,
such as :meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle>`. In other
systems this may be called wide transformations.
Here is a table listing some common transformations supported by Ray Datasets.
.. list-table:: Common Ray Datasets transformations.
:header-rows: 1
* - Transformation
- Type
- Description
* - :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`
- One-to-one
- Apply a given function to batches of records of this dataset.
* - :meth:`ds.split() <ray.data.Dataset.split>`
- One-to-one
- | Split the dataset into N disjoint pieces.
* - :meth:`ds.repartition(shuffle=False) <ray.data.Dataset.repartition>`
- One-to-one
- | Repartition the dataset into N blocks, without shuffling the data.
* - :meth:`ds.repartition(shuffle=True) <ray.data.Dataset.repartition>`
- All-to-all
- | Repartition the dataset into N blocks, shuffling the data during repartition.
* - :meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle>`
- All-to-all
- | Randomly shuffle the elements of this dataset.
* - :meth:`ds.sort() <ray.data.Dataset.sort>`
- All-to-all
- | Sort the dataset by a sortkey.
* - :meth:`ds.groupby() <ray.data.Dataset.groupby>`
- All-to-all
- | Group the dataset by a groupkey.
.. tip::
Datasets also provides the convenience transformation methods :meth:`ds.map() <ray.data.Dataset.map>`,
:meth:`ds.flat_map() <ray.data.Dataset.flat_map>`, and :meth:`ds.filter() <ray.data.Dataset.filter>`,
which are not vectorized (slower than :meth:`ds.map_batches() <ray.data.Dataset.map_batches>`), but
may be useful for development.
The following is an example to make use of those transformation APIs for processing
the Iris dataset.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __dataset_transformation_begin__
:end-before: __dataset_transformation_end__
.. _transform_datasets_writing_udfs:
------------
Writing UDFs
------------
User-defined functions (UDFs) are routines that apply on one row (e.g.
:meth:`.map() <ray.data.Dataset.map>`) or a batch of rows (e.g.
:meth:`.map_batches() <ray.data.Dataset.map_batches>`) of a dataset. UDFs let you
express your customized business logic in transformations. Here we will focus on
:meth:`.map_batches() <ray.data.Dataset.map_batches>` as it's the primary mapping
API in Datasets.
A batch UDF can be a function or, if using the
:ref:`actor compute strategy <transform_datasets_compute_strategy>`, a
:ref:`callable class <transform_datasets_callable_classes>`.
These UDFs have several :ref:`batch format options <transform_datasets_batch_formats>`,
which control the format of the batches that are passed to the provided batch UDF.
Depending on the underlying :ref:`dataset format <transform_datasets_dataset_formats>`,
using a particular batch format may or may not incur a data conversion cost
(e.g. converting an Arrow Table to a Pandas DataFrame, or creating an Arrow Table from a
Python list, both of which would incur a full copy of the data).
.. _transform_datasets_dataset_formats:
Dataset Formats
===============
A **dataset format** refers to how Datasets represents data under-the-hood as data
**blocks**.
* **Tabular (Arrow or Pandas) Datasets:** Represented under-the-hood as
`Arrow Tables <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
or
`Pandas DataFrames <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.
Tabular datasets are created by reading on-disk tabular data (Parquet, CSV, JSON),
converting from in-memory tabular data (Pandas DataFrames, Arrow Tables,
dictionaries, Dask DataFrames, Modin DataFrames, Spark DataFrames, Mars DataFrames)
and by returning tabular data (Arrow, Pandas, dictionaries) from previous batch
UDFs.
Datasets will minimize conversions between the Arrow and Pandas
representation: tabular data read from disk will always be read as Arrow Tables, but
in-memory conversions of Pandas DataFrame data to a ``Dataset`` (e.g. converting from
Pandas, Dask, Modin, Spark, and Mars) will use Pandas DataFrames as the internal data
representation in order to avoid extra data conversions/copies.
* **Tensor Datasets:** Represented under-the-hood as a single-column
`Arrow Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
or
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__,
using our :ref:`tensor extension type <dataset-tensor-extension-api>` to embed the
tensor in these tables under a single ``"__value__"`` column. Tensor datasets are
created by reading on-disk tensor data (NPY), converting from in-memory tensor data
(NumPy), and by returning tensor data (NumPy) from previous batch UDFs.
Note that converting the underlying tabular representation to a NumPy ndarray is
zero-copy, including converting the entire column to an ndarray and converting just
a single column element to an ndarray.
* **Simple Datasets:** Represented under-the-hood as plain Python lists. Simple
datasets are created by reading on-disk binary and plain-text data, converting from
in-memory simple data (Python lists), and by returning simple data (Python lists)
from previou batch UDFs.
Simple datasets are mostly used as an escape hatch for data that's not cleanly
representable in Arrow Tables and Pandas DataFrames.
.. _transform_datasets_batch_formats:
Batch Formats
=============
The **batch format** is the format of the data that's given to batch UDFs, e.g.
Pandas DataFrames, Arrow Tables, NumPy ndarrays, or Python lists. The
:meth:`.map_batches() <ray.data.Dataset.map_batches>` API has a ``batch_format: str``
parameter that allows the user to dictate the batch format; we dig into the details of
each ``batch_format`` option below:
.. tabbed:: "native" (default)
The ``"native"`` batch format presents batches in the canonical format for each dataset
type:
* **Tabular (Arrow or Pandas) Datasets:** Each batch will be a
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__.
If the dataset is represented as Arrow Tables under-the-hood, the conversion to the
Pandas DataFrame batch format will incur a conversion cost (i.e., a full copy of
each batch will be created in order to convert it).
* **Tensor Datasets:** Each batch will be a
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__.
* **Simple Datasets:** Each batch will be a Python list.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_native_udfs_begin__
:end-before: __writing_native_udfs_end__
.. tabbed:: "pandas"
The ``"pandas"`` batch format provides batches in a
`pandas.DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
format. If converting a simple dataset to Pandas DataFrame batches, a single-column
dataframe with the column ``"value"`` will be created.
If the underlying datasets data is not already in Pandas DataFrame format, the
batch format conversion will incur a copy for each batch.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_pandas_udfs_begin__
:end-before: __writing_pandas_udfs_end__
.. tabbed:: "pyarrow"
The ``"pyarrow"`` batch format provides batches in a
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
format. If converting a simple dataset to Arrow Table batches, a single-column table
with the column ``"value"`` will be created.
If the underlying datasets data is not already in Arrow Table format, the
batch format conversion will incur a copy for each batch.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_arrow_udfs_begin__
:end-before: __writing_arrow_udfs_end__
.. tabbed:: "numpy"
The ``"numpy"`` batch format provides batches in a
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
format. The following details how each dataset format is converted into a NumPy batch:
* **Tensor Datasets:** Each batch will be a single
`numpy.ndarray <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
containing the single-tensor-column for this batch.
Note that this conversion should always be zero-copy.
* **Tabular (Arrow or Pandas) Datasets:** Each batch will be a dictionary of NumPy
ndarrays (``Dict[str, np.ndarray]``), with each key-value pair representing a column
in the table.
Note that this conversion should usually be zero-copy.
* **Simple Datasets:** Each batch will be a single NumPy ndarray, where Datasets will
attempt to convert each list-batch to an ndarray.
Note that this conversion will
incur a copy for primitive types that are representable as a NumPy dtype (since
NumPy will create a continguous ndarray in that case), and will not incur a copy if
the batch items aren't supported in the NumPy dtype system (since NumPy will create
an ndarray of ``np.object`` pointers to the batch items in that case).
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_numpy_udfs_begin__
:end-before: __writing_numpy_udfs_end__
The following table summarizes the conversion costs from a particular dataset format to
a particular batch format:
.. list-table:: Batch format conversion costs - is the conversion zero-copy?
:header-rows: 1
:stub-columns: 1
:widths: 40 15 15 15 15
:align: center
* - Dataset Format -> Batch Format
- ``"native"``
- ``"pandas"``
- ``"pyarrow"``
- ``"numpy"``
* - Tabular - Pandas
- Zero-Copy
- Zero-Copy
- Copy
- Zero-Copy
* - Tabular - Arrow
- Copy
- Copy
- Zero-Copy
- Zero-Copy
* - Tensor - Pandas
- Zero-Copy
- Zero-Copy
- Copy
- Zero-Copy
* - Tensor - Arrow
- Zero-Copy
- Copy
- Zero-Copy
- Zero-Copy
* - Simple - List
- Zero-Copy
- Copy
- Copy
- Copy
You should reference the `pyarrow.Table APIs
<https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__, the
`pandas.DataFrame APIs <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__,
or the `numpy.ndarray APIs <https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html>`__
when writing batch UDFs.
.. tip::
Write your UDFs to leverage built-in vectorized operations on the ``pandas.DataFrame``,
``pyarrow.Table``, and ``numpy.ndarray`` abstractions for better performance. For
example, suppose you want to compute the sum of a column in ``pandas.DataFrame``:
instead of iterating over each row of a batch and summing up values of that column,
you should use ``df_batch["col_foo"].sum()``.
.. _transform_datasets_callable_classes:
Callable Class UDFs
===================
When using the actor compute strategy, per-row and per-batch UDFs can also be
**callable classes**, i.e. classes that implement the ``__call__`` magic method. The
constructor of the class can be used for stateful setup, and will be only invoked once
per actor worker.
.. note::
These transformation APIs take the uninstantiated callable class as an argument,
**not** an instance of the class!
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_callable_classes_udfs_begin__
:end-before: __writing_callable_classes_udfs_end__
.. _transform_datasets_batch_output_types:
Batch UDF Output Types
======================
The return type of a batch UDF will determine the format of the resulting dataset. This
allows you to dynamically convert between data formats during batch transformations.
The following details how each **batch** UDF (as used in
:meth:`ds.map_batches() <ray.data.Dataset.map_batches>`) return type will be interpreted
by Datasets when constructing its internal blocks:
.. tabbed:: pd.DataFrame
Tabular dataset containing
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
blocks.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_pandas_out_udfs_begin__
:end-before: __writing_pandas_out_udfs_end__
.. tabbed:: pa.Table
Tabular dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_arrow_out_udfs_begin__
:end-before: __writing_arrow_out_udfs_end__
.. tabbed:: np.ndarray
Tensor dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks, representing the tensor using our
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed the tensor in
these tables under a single ``"__value__"`` column.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_numpy_out_udfs_begin__
:end-before: __writing_numpy_out_udfs_end__
.. tabbed:: Dict[str, np.ndarray]
Tabular dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks, where each key-value pair treats the key as the column name as the value as
the column tensor.
If a column tensor is 1-dimensional, then the native Arrow 1D list
type is used; if a column tensor has 2 or more dimensions, then we use our
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed these
n-dimensional tensors in the Arrow table.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_numpy_dict_out_udfs_begin__
:end-before: __writing_numpy_dict_out_udfs_end__
.. tabbed:: list
Simple dataset containing Python list blocks.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_simple_out_udfs_begin__
:end-before: __writing_simple_out_udfs_end__
.. _transform_datasets_row_output_types:
Row UDF Output Types
====================
The return type of a row UDF will determine the format of the resulting dataset. This
allows you to dynamically convert between data formats during row-based transformations.
The following details how each **row** UDF (as used in
:meth:`ds.map() <ray.data.Dataset.map>`) return type will be interpreted by Datasets
when constructing its internal blocks:
.. tabbed:: dict
Tabular dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks, treating the keys as the column names and the values as the column elements.
If Arrow Table construction fails due to a value type in the dictionary not
being supported by Arrow, then Datasets will fall back to a simple dataset containing
Python list blocks.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_dict_out_row_udfs_begin__
:end-before: __writing_dict_out_row_udfs_end__
.. tabbed:: PandasRow
Tabular dataset containing
`Pandas DataFrame <https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html>`__
blocks.
A ``PandasRow`` object is a zero-copy row view on an underlying Pandas DataFrame block
that Datasets provides to per-row UDFs (:meth:`ds.map() <ray.data.Dataset.map>`) and
returns in the row iterators (:meth:`ds.iter_rows <ray.data.Dataset.iter_rows>`).
This row view provides a dictionary interface and can be transparently used as a
dictionary. See the :class:`TableRow API <ray.data.row.TableRow>` for more information
on this row view object.
Note that a ``PandasRow`` is immmutable, so this row mapping cannot be updated
in-place. If wanting to update the row, copy this zero-copy row view into a plain
Python dictionary with :meth:`TableRow.as_pydict() <ray.data.row.TableRow.as_pydict>`
and then mutate and return that dictionary.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_table_row_out_row_udfs_begin__
:end-before: __writing_table_row_out_row_udfs_end__
.. tabbed:: ArrowRow
Tabular dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks.
An ``ArrowRow`` object is a zero-copy row view on an underlying Arrow Table block
that Datasets provides to per-row UDFs (``ds.map()``) and returns in the row
that Datasets provides to per-row UDFs (:meth:`ds.map() <ray.data.Dataset.map>`) and
returns in the row iterators (:meth:`ds.iter_rows <ray.data.Dataset.iter_rows>`).
This row view provides a dictionary interface and can be transparently used as a
dictionary. See the :class:`TableRow API <ray.data.row.TableRow>` for more information
on this row view object.
Note that an ``ArrowRow`` is immmutable, so this row mapping cannot be updated
in-place. If wanting to update the row, copy this zero-copy row view into a plain
Python dictionary with :meth:`TableRow.as_pydict() <ray.data.row.TableRow.as_pydict>`
and then mutate and return that dictionary.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_table_row_out_row_udfs_begin__
:end-before: __writing_table_row_out_row_udfs_end__
.. tabbed:: np.ndarray
Tensor dataset containing
`pyarrow.Table <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html>`__
blocks, representing the tensor using our
:ref:`tensor extension type <dataset-tensor-extension-api>` to embed the tensor in
these tables under a single ``"__value__"`` column. Each such ``ndarray`` will be
treated as a row in this column.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_numpy_out_row_udfs_begin__
:end-before: __writing_numpy_out_row_udfs_end__
.. tabbed:: Any
All other return row types will result in a simple dataset containing list blocks.
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __writing_simple_out_row_udfs_begin__
:end-before: __writing_simple_out_row_udfs_end__
.. _transform_datasets_compute_strategy:
----------------
Compute Strategy
----------------
Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
used (with ``compute="tasks"``). For transformations that require expensive setup,
it's preferrable to use Ray actors, which are stateful and allow setup to be reused
for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.
The following is an example of using the Ray tasks and actors compute strategy
for batch inference:
.. literalinclude:: ./doc_code/transforming_datasets.py
:language: python
:start-after: __dataset_compute_strategy_begin__
:end-before: __dataset_compute_strategy_end__