.. _transforming_datasets: ===================== Transforming Datasets ===================== The Ray Datasets transformations take in datasets and produce new datasets. For example, *map* is a transformation that applies a user-defined function (UDF) on each row of input dataset and returns a new dataset as result. The Datasets transformations are **composable**. Operations can be further applied on the result dataset, forming a chain of transformations to express more complex computations. Transformations are the core for expressing business logic in Datasets. .. _transform_datasets_transformations: --------------- Transformations --------------- In general, we have two types of transformations: * **One-to-one transformations:** each input block will contribute to only one output block, such as :meth:`ds.map_batches() `. In other systems this may be called narrow transformations. * **All-to-all transformations:** input blocks can contribute to multiple output blocks, such as :meth:`ds.random_shuffle() `. In other systems this may be called wide transformations. Here is a table listing some common transformations supported by Ray Datasets. .. list-table:: Common Ray Datasets transformations. :header-rows: 1 * - Transformation - Type - Description * - :meth:`ds.map_batches() ` - One-to-one - Apply a given function to batches of records of this dataset. * - :meth:`ds.split() ` - One-to-one - | Split the dataset into N disjoint pieces. * - :meth:`ds.repartition(shuffle=False) ` - One-to-one - | Repartition the dataset into N blocks, without shuffling the data. * - :meth:`ds.repartition(shuffle=True) ` - All-to-all - | Repartition the dataset into N blocks, shuffling the data during repartition. * - :meth:`ds.random_shuffle() ` - All-to-all - | Randomly shuffle the elements of this dataset. * - :meth:`ds.sort() ` - All-to-all - | Sort the dataset by a sortkey. * - :meth:`ds.groupby() ` - All-to-all - | Group the dataset by a groupkey. .. tip:: Datasets also provides the convenience transformation methods :meth:`ds.map() `, :meth:`ds.flat_map() `, and :meth:`ds.filter() `, which are not vectorized (slower than :meth:`ds.map_batches() `), but may be useful for development. The following is an example to make use of those transformation APIs for processing the Iris dataset. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __dataset_transformation_begin__ :end-before: __dataset_transformation_end__ .. _transform_datasets_writing_udfs: ------------ Writing UDFs ------------ User-defined functions (UDFs) are routines that apply on one row (e.g. :meth:`.map() `) or a batch of rows (e.g. :meth:`.map_batches() `) of a dataset. UDFs let you express your customized business logic in transformations. Here we will focus on :meth:`.map_batches() ` as it's the primary mapping API in Datasets. A batch UDF can be a function or, if using the :ref:`actor compute strategy `, a :ref:`callable class `. These UDFs have several :ref:`batch format options `, which control the format of the batches that are passed to the provided batch UDF. Depending on the underlying :ref:`dataset format `, using a particular batch format may or may not incur a data conversion cost (e.g. converting an Arrow Table to a Pandas DataFrame, or creating an Arrow Table from a Python list, both of which would incur a full copy of the data). .. _transform_datasets_dataset_formats: Dataset Formats =============== A **dataset format** refers to how Datasets represents data under-the-hood as data **blocks**. * **Tabular (Arrow or Pandas) Datasets:** Represented under-the-hood as `Arrow Tables `__ or `Pandas DataFrames `__. Tabular datasets are created by reading on-disk tabular data (Parquet, CSV, JSON), converting from in-memory tabular data (Pandas DataFrames, Arrow Tables, dictionaries, Dask DataFrames, Modin DataFrames, Spark DataFrames, Mars DataFrames) and by returning tabular data (Arrow, Pandas, dictionaries) from previous batch UDFs. Datasets will minimize conversions between the Arrow and Pandas representation: tabular data read from disk will always be read as Arrow Tables, but in-memory conversions of Pandas DataFrame data to a ``Dataset`` (e.g. converting from Pandas, Dask, Modin, Spark, and Mars) will use Pandas DataFrames as the internal data representation in order to avoid extra data conversions/copies. * **Tensor Datasets:** Represented under-the-hood as a single-column `Arrow Table `__ or `Pandas DataFrame `__, using our :ref:`tensor extension type ` to embed the tensor in these tables under a single ``"__value__"`` column. Tensor datasets are created by reading on-disk tensor data (NPY), converting from in-memory tensor data (NumPy), and by returning tensor data (NumPy) from previous batch UDFs. Note that converting the underlying tabular representation to a NumPy ndarray is zero-copy, including converting the entire column to an ndarray and converting just a single column element to an ndarray. * **Simple Datasets:** Represented under-the-hood as plain Python lists. Simple datasets are created by reading on-disk binary and plain-text data, converting from in-memory simple data (Python lists), and by returning simple data (Python lists) from previou batch UDFs. Simple datasets are mostly used as an escape hatch for data that's not cleanly representable in Arrow Tables and Pandas DataFrames. .. _transform_datasets_batch_formats: Batch Formats ============= The **batch format** is the format of the data that's given to batch UDFs, e.g. Pandas DataFrames, Arrow Tables, NumPy ndarrays, or Python lists. The :meth:`.map_batches() ` API has a ``batch_format: str`` parameter that allows the user to dictate the batch format; we dig into the details of each ``batch_format`` option below: .. tabbed:: "native" (default) The ``"native"`` batch format presents batches in the canonical format for each dataset type: * **Tabular (Arrow or Pandas) Datasets:** Each batch will be a `pandas.DataFrame `__. If the dataset is represented as Arrow Tables under-the-hood, the conversion to the Pandas DataFrame batch format will incur a conversion cost (i.e., a full copy of each batch will be created in order to convert it). * **Tensor Datasets:** Each batch will be a `numpy.ndarray `__. * **Simple Datasets:** Each batch will be a Python list. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_native_udfs_begin__ :end-before: __writing_native_udfs_end__ .. tabbed:: "pandas" The ``"pandas"`` batch format provides batches in a `pandas.DataFrame `__ format. If converting a simple dataset to Pandas DataFrame batches, a single-column dataframe with the column ``"value"`` will be created. If the underlying datasets data is not already in Pandas DataFrame format, the batch format conversion will incur a copy for each batch. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_pandas_udfs_begin__ :end-before: __writing_pandas_udfs_end__ .. tabbed:: "pyarrow" The ``"pyarrow"`` batch format provides batches in a `pyarrow.Table `__ format. If converting a simple dataset to Arrow Table batches, a single-column table with the column ``"value"`` will be created. If the underlying datasets data is not already in Arrow Table format, the batch format conversion will incur a copy for each batch. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_arrow_udfs_begin__ :end-before: __writing_arrow_udfs_end__ .. tabbed:: "numpy" The ``"numpy"`` batch format provides batches in a `numpy.ndarray `__ format. The following details how each dataset format is converted into a NumPy batch: * **Tensor Datasets:** Each batch will be a single `numpy.ndarray `__ containing the single-tensor-column for this batch. Note that this conversion should always be zero-copy. * **Tabular (Arrow or Pandas) Datasets:** Each batch will be a dictionary of NumPy ndarrays (``Dict[str, np.ndarray]``), with each key-value pair representing a column in the table. Note that this conversion should usually be zero-copy. * **Simple Datasets:** Each batch will be a single NumPy ndarray, where Datasets will attempt to convert each list-batch to an ndarray. Note that this conversion will incur a copy for primitive types that are representable as a NumPy dtype (since NumPy will create a continguous ndarray in that case), and will not incur a copy if the batch items aren't supported in the NumPy dtype system (since NumPy will create an ndarray of ``np.object`` pointers to the batch items in that case). .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_numpy_udfs_begin__ :end-before: __writing_numpy_udfs_end__ The following table summarizes the conversion costs from a particular dataset format to a particular batch format: .. list-table:: Batch format conversion costs - is the conversion zero-copy? :header-rows: 1 :stub-columns: 1 :widths: 40 15 15 15 15 :align: center * - Dataset Format -> Batch Format - ``"native"`` - ``"pandas"`` - ``"pyarrow"`` - ``"numpy"`` * - Tabular - Pandas - Zero-Copy - Zero-Copy - Copy - Zero-Copy * - Tabular - Arrow - Copy - Copy - Zero-Copy - Zero-Copy * - Tensor - Pandas - Zero-Copy - Zero-Copy - Copy - Zero-Copy * - Tensor - Arrow - Zero-Copy - Copy - Zero-Copy - Zero-Copy * - Simple - List - Zero-Copy - Copy - Copy - Copy You should reference the `pyarrow.Table APIs `__, the `pandas.DataFrame APIs `__, or the `numpy.ndarray APIs `__ when writing batch UDFs. .. tip:: Write your UDFs to leverage built-in vectorized operations on the ``pandas.DataFrame``, ``pyarrow.Table``, and ``numpy.ndarray`` abstractions for better performance. For example, suppose you want to compute the sum of a column in ``pandas.DataFrame``: instead of iterating over each row of a batch and summing up values of that column, you should use ``df_batch["col_foo"].sum()``. .. _transform_datasets_callable_classes: Callable Class UDFs =================== When using the actor compute strategy, per-row and per-batch UDFs can also be **callable classes**, i.e. classes that implement the ``__call__`` magic method. The constructor of the class can be used for stateful setup, and will be only invoked once per actor worker. .. note:: These transformation APIs take the uninstantiated callable class as an argument, **not** an instance of the class! .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_callable_classes_udfs_begin__ :end-before: __writing_callable_classes_udfs_end__ .. _transform_datasets_batch_output_types: Batch UDF Output Types ====================== The return type of a batch UDF will determine the format of the resulting dataset. This allows you to dynamically convert between data formats during batch transformations. The following details how each **batch** UDF (as used in :meth:`ds.map_batches() `) return type will be interpreted by Datasets when constructing its internal blocks: .. tabbed:: pd.DataFrame Tabular dataset containing `Pandas DataFrame `__ blocks. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_pandas_out_udfs_begin__ :end-before: __writing_pandas_out_udfs_end__ .. tabbed:: pa.Table Tabular dataset containing `pyarrow.Table `__ blocks. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_arrow_out_udfs_begin__ :end-before: __writing_arrow_out_udfs_end__ .. tabbed:: np.ndarray Tensor dataset containing `pyarrow.Table `__ blocks, representing the tensor using our :ref:`tensor extension type ` to embed the tensor in these tables under a single ``"__value__"`` column. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_numpy_out_udfs_begin__ :end-before: __writing_numpy_out_udfs_end__ .. tabbed:: Dict[str, np.ndarray] Tabular dataset containing `pyarrow.Table `__ blocks, where each key-value pair treats the key as the column name as the value as the column tensor. If a column tensor is 1-dimensional, then the native Arrow 1D list type is used; if a column tensor has 2 or more dimensions, then we use our :ref:`tensor extension type ` to embed these n-dimensional tensors in the Arrow table. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_numpy_dict_out_udfs_begin__ :end-before: __writing_numpy_dict_out_udfs_end__ .. tabbed:: list Simple dataset containing Python list blocks. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_simple_out_udfs_begin__ :end-before: __writing_simple_out_udfs_end__ .. _transform_datasets_row_output_types: Row UDF Output Types ==================== The return type of a row UDF will determine the format of the resulting dataset. This allows you to dynamically convert between data formats during row-based transformations. The following details how each **row** UDF (as used in :meth:`ds.map() `) return type will be interpreted by Datasets when constructing its internal blocks: .. tabbed:: dict Tabular dataset containing `pyarrow.Table `__ blocks, treating the keys as the column names and the values as the column elements. If Arrow Table construction fails due to a value type in the dictionary not being supported by Arrow, then Datasets will fall back to a simple dataset containing Python list blocks. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_dict_out_row_udfs_begin__ :end-before: __writing_dict_out_row_udfs_end__ .. tabbed:: PandasRow Tabular dataset containing `Pandas DataFrame `__ blocks. A ``PandasRow`` object is a zero-copy row view on an underlying Pandas DataFrame block that Datasets provides to per-row UDFs (:meth:`ds.map() `) and returns in the row iterators (:meth:`ds.iter_rows `). This row view provides a dictionary interface and can be transparently used as a dictionary. See the :class:`TableRow API ` for more information on this row view object. Note that a ``PandasRow`` is immmutable, so this row mapping cannot be updated in-place. If wanting to update the row, copy this zero-copy row view into a plain Python dictionary with :meth:`TableRow.as_pydict() ` and then mutate and return that dictionary. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_table_row_out_row_udfs_begin__ :end-before: __writing_table_row_out_row_udfs_end__ .. tabbed:: ArrowRow Tabular dataset containing `pyarrow.Table `__ blocks. An ``ArrowRow`` object is a zero-copy row view on an underlying Arrow Table block that Datasets provides to per-row UDFs (``ds.map()``) and returns in the row that Datasets provides to per-row UDFs (:meth:`ds.map() `) and returns in the row iterators (:meth:`ds.iter_rows `). This row view provides a dictionary interface and can be transparently used as a dictionary. See the :class:`TableRow API ` for more information on this row view object. Note that an ``ArrowRow`` is immmutable, so this row mapping cannot be updated in-place. If wanting to update the row, copy this zero-copy row view into a plain Python dictionary with :meth:`TableRow.as_pydict() ` and then mutate and return that dictionary. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_table_row_out_row_udfs_begin__ :end-before: __writing_table_row_out_row_udfs_end__ .. tabbed:: np.ndarray Tensor dataset containing `pyarrow.Table `__ blocks, representing the tensor using our :ref:`tensor extension type ` to embed the tensor in these tables under a single ``"__value__"`` column. Each such ``ndarray`` will be treated as a row in this column. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_numpy_out_row_udfs_begin__ :end-before: __writing_numpy_out_row_udfs_end__ .. tabbed:: Any All other return row types will result in a simple dataset containing list blocks. .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __writing_simple_out_row_udfs_begin__ :end-before: __writing_simple_out_row_udfs_end__ .. _transform_datasets_compute_strategy: ---------------- Compute Strategy ---------------- Datasets transformations are executed by either :ref:`Ray tasks ` or :ref:`Ray actors ` across a Ray cluster. By default, Ray tasks are used (with ``compute="tasks"``). For transformations that require expensive setup, it's preferrable to use Ray actors, which are stateful and allow setup to be reused for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``. The following is an example of using the Ray tasks and actors compute strategy for batch inference: .. literalinclude:: ./doc_code/transforming_datasets.py :language: python :start-after: __dataset_compute_strategy_begin__ :end-before: __dataset_compute_strategy_end__