ray/doc/source/ray-air
Clark Zinzow df124d0ad5
[AIR - Datasets] Hide tensor extension from UDFs. (#27019)
We previously added automatic tensor extension casting on Datasets transformation outputs to allow the user to not have to worry about tensor column casting; however, this current state creates several issues:

1. Not all tensors are supported, which means that we’ll need to have an opaque object dtype (i.e. ndarray of ndarray pointers) fallback for the Pandas-only case. Known unsupported tensor use cases:
a. Heterogeneous-shaped (i.e. ragged) tensors
b. Struct arrays
2. UDFs will expect a NumPy column and won’t know what to do with our TensorArray type. E.g., torchvision transforms don’t respect the array protocol (which they should), and instead only support Torch tensors and NumPy ndarrays; passing a TensorArray column or a TensorArrayElement (a single item in the TensorArray column) fails.
Implicit casting with object dtype fallback on UDF outputs can make the input type to downstream UDFs nondeterministic, where the user won’t know if they’ll get a TensorArray column or an object dtype column.
3. The tensor extension cast fallback warning spams the logs.

This PR:

1. Adds automatic casting of tensor extension columns to NumPy ndarray columns for Datasets UDF inputs, meaning the UDFs will never have to see tensor extensions and that the UDF input column types will be consistent and deterministic; this fixes both (2) and (3).
2. No longer implicitly falls back to an opaque object dtype when TensorArray casting fails (e.g. for ragged tensors), and instead raises an error; this fixes (4) but removes our support for (1).
3. Adds a global enable_tensor_extension_casting config flag, which is True by default, that controls whether we perform this automatic casting. Turning off the implicit casting provides a path for (1), where the tensor extension can be avoided if working with ragged tensors in Pandas land. Turning off this flag also allows the user to explicitly control their tensor extension casting, if they want to work with it in their UDFs in order to reap the benefits of less data copies, more efficient slicing, stronger column typing, etc.
2022-07-28 10:37:45 -07:00
..
doc_code [air][data] move train_test_split to ray.data.Dataset (#27065) 2022-07-27 09:53:37 -07:00
examples [AIR - Datasets] Hide tensor extension from UDFs. (#27019) 2022-07-28 10:37:45 -07:00
images [docs] Update the AIR data ingest guide (#26909) 2022-07-24 09:59:29 -07:00
benchmarks.rst [air/docs] add tensorflow benchmarks into table (#26800) 2022-07-20 17:12:40 -07:00
check-ingest.rst [docs] Update the AIR data ingest guide (#26909) 2022-07-24 09:59:29 -07:00
checkpoints.rst [AIR DOC] minor tweaks to checkpoint user guide for clarity and consistency subheadings (#26937) 2022-07-25 14:21:29 -07:00
config-scaling.rst [docs] Improve AIR table of contents titles (#26858) 2022-07-22 17:17:49 -07:00
deployment.rst [docs] Improve AIR table of contents titles (#26858) 2022-07-22 17:17:49 -07:00
getting-started.rst [docs] Add ecosystem map to AIR guide (#26859) 2022-07-21 19:06:47 -07:00
key-concepts.rst [AIR/Docs] Small improvements to Train user guide (#26577) 2022-07-16 16:51:17 -07:00
package-ref.rst [air][data] move train_test_split to ray.data.Dataset (#27065) 2022-07-27 09:53:37 -07:00
predictors.rst [docs] Improve AIR table of contents titles (#26858) 2022-07-22 17:17:49 -07:00
preprocessors.rst [air][data] move train_test_split to ray.data.Dataset (#27065) 2022-07-27 09:53:37 -07:00
user-guides.rst [docs] Improve AIR table of contents titles (#26858) 2022-07-22 17:17:49 -07:00