Proofread the some datasets docs (#25068)

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2025-03-05 18:11:42 -05:00 · 2022-05-22 12:11:51 -07:00 · 2022-05-22 12:11:51 -07:00 · 9dd30d5f77
commit 9dd30d5f77
parent cd16dc4dae
3 changed files with 17 additions and 4 deletions
--- a/doc/source/data/faq.rst
+++ b/doc/source/data/faq.rst
@ -276,7 +276,7 @@ out of such a gradient rut. In the distributed data-parallel training case, the
 status quo solution is typically to have a per-shard in-memory shuffle buffer that you
 fill up and pop random batches from, without mixing data across shards between epochs.
 Ray Datasets also offers fully global random shuffling via
-:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()`, and doing so on an
+:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()>`, and doing so on an
 epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
 ``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
 global per-epoch shuffling instead of local shuffle buffer shuffling?
--- a/doc/source/data/performance-tips.rst
+++ b/doc/source/data/performance-tips.rst
@ -90,8 +90,21 @@ Parquet Column Pruning
 ~~~~~~~~~~~~~~~~~~~~~~

 Current Datasets will read all Parquet columns into memory.
-If you only need a subset of the columns, make sure to specify the list of columns explicitly when
-calling ``ray.data.read_parquet()`` to avoid loading unnecessary data.
+If you only need a subset of the columns, make sure to specify the list of columns
+explicitly when calling ``ray.data.read_parquet()`` to avoid loading unnecessary
+data (projection pushdown).
+For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"]`` to read
+just two of the five columns of Iris dataset.
+
+Parquet Row Pruning
+~~~~~~~~~~~~~~~~~~~
+
+Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
+which will be applied at the file scan so only rows that match the filter predicate
+will be returned.
+For example, use ``ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0``
+to read rows with sepal.length greater than 5.0.
+This can be used in conjunction with column pruning when appropriate to get the benefits of both.

 Tuning Read Parallelism
 ~~~~~~~~~~~~~~~~~~~~~~~
--- a/doc/source/data/transforming-datasets.rst
+++ b/doc/source/data/transforming-datasets.rst
@ -74,7 +74,7 @@ Compute Strategy
 Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
 or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
 used (with ``compute="tasks"``). For transformations that require expensive setup,
-it's preferrable to use Ray actors, which are stateful and allows setup to be reused
+it's preferrable to use Ray actors, which are stateful and allow setup to be reused
 for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
 Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
 transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.