[Datasets] Add example of using map_batches to filter (#24202)

The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so.
2025-03-06 10:31:39 -05:00 · 2022-05-18 16:28:15 -07:00 · 2022-05-18 16:28:15 -07:00 · ebe2929d4c
commit ebe2929d4c
parent 0b6505e8c6
1 changed files with 10 additions and 0 deletions
--- a/python/ray/data/dataset.py
+++ b/python/ray/data/dataset.py
@ -250,6 +250,16 @@ class Dataset(Generic[T]):
            ...     compute=ActorPoolStrategy(2, 8), # doctest: +SKIP
            ...     num_gpus=1) # doctest: +SKIP

+            You can use ``map_batches`` to efficiently filter records.
+
+            >>> import ray
+            >>> ds = ray.data.range(10000)  # doctest: +SKIP
+            >>> ds.count()  # doctest: +SKIP
+            10000
+            >>> ds = ds.map_batches(lambda batch: [x for x in batch if x % 2 == 0])  # doctest: +SKIP  # noqa: #501
+            >>> ds.count()  # doctest: +SKIP
+            5000
+
        Time complexity: O(dataset size / parallelism)

        Args: