[Datasets] Add example of using map_batches to filter (#24202)

The documentation says 

> Consider using .map_batches() for better performance (you can implement filter by dropping records).

but there aren't any examples of how to do so.
This commit is contained in:
Balaji Veeramani 2022-05-18 16:28:15 -07:00 committed by GitHub
parent 0b6505e8c6
commit ebe2929d4c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -250,6 +250,16 @@ class Dataset(Generic[T]):
... compute=ActorPoolStrategy(2, 8), # doctest: +SKIP
... num_gpus=1) # doctest: +SKIP
You can use ``map_batches`` to efficiently filter records.
>>> import ray
>>> ds = ray.data.range(10000) # doctest: +SKIP
>>> ds.count() # doctest: +SKIP
10000
>>> ds = ds.map_batches(lambda batch: [x for x in batch if x % 2 == 0]) # doctest: +SKIP # noqa: #501
>>> ds.count() # doctest: +SKIP
5000
Time complexity: O(dataset size / parallelism)
Args: