Tensor (multi-dimensional array) data is ubiquitous in ML workloads. However, popular data formats such as Pandas, Parquet, and Arrow don't natively support tensor data types. To bridge this gap, Datasets provides a unified tensor data type that can be used to represent and store tensor data:
* For Pandas, Datasets will transparently convert ``List[np.ndarray]`` columns to and from the :class:`TensorDtype <ray.data.extensions.tensor_extension.TensorDtype>` extension type.
* For Parquet, the Datasets Arrow extension :class:`ArrowTensorType <ray.data.extensions.tensor_extension.ArrowTensorType>` allows Tensors to be loaded and stored in Parquet format.
* In addition, single-column Tensor datasets can be created from NumPy (.npy) files.
Datasets automatically converts between the extension types/arrays above. This means you can just think of "Tensors" as a single first-class data type in Datasets.
For tensors stored as raw NumPy ndarray bytes in C-contiguous order (e.g., via ``ndarray.tobytes()``), all you need to specify is the tensor column schema. The following is an end-to-end example:
By convention, single-column Tensor datasets are represented with a single ``__value__`` column.
This kind of dataset will be converted automatically to/from NumPy array format in all transformation and consumption APIs.
Transforming / Consuming Tensor Data
------------------------------------
Like any other Dataset, Datasets with tensor columns can be consumed / transformed in batches via the :meth:`ds.iter_batches(batch_format=\<format\>) <ray.data.Dataset.iter_batches>` and :meth:`ds.map_batches(fn, batch_format=\<format\>) <ray.data.Dataset.map_batches>` APIs. This section shows the available batch formats and their behavior:
..tabbed:: "native" (default)
**Single-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_native_begin__
:end-before:__consume_native_end__
**Multi-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_native_2_begin__
:end-before:__consume_native_2_end__
..tabbed:: "pandas"
**Single-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_pandas_begin__
:end-before:__consume_pandas_end__
**Multi-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_pandas_2_begin__
:end-before:__consume_pandas_2_end__
..tabbed:: "pyarrow"
**Single-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_pyarrow_begin__
:end-before:__consume_pyarrow_end__
**Multi-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_pyarrow_2_begin__
:end-before:__consume_pyarrow_2_end__
..tabbed:: "numpy"
**Single-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_numpy_begin__
:end-before:__consume_numpy_end__
**Multi-column**:
..literalinclude:: ./doc_code/tensor.py
:language:python
:start-after:__consume_numpy_2_begin__
:end-before:__consume_numpy_2_end__
Saving Tensor Datasets
----------------------
Because Tensor datasets rely on Datasets-specific extension types, they can only be saved in formats that preserve Arrow metadata (currently only Parquet). In addition, single-column Tensor datasets can be saved in NumPy format.
* All tensors in a tensor column must have the same shape; see GitHub issue `#18316 <https://github.com/ray-project/ray/issues/18316>`__. An error will be raised in the ragged tensor case. Automatic casting can be disabled with ``ray.data.context.DatasetContext.get_current().enable_tensor_extension_cast = False`` in the ragged tensor scenario.