mirror of
https://github.com/vale981/ray
synced 2025-03-05 18:11:42 -05:00
Proofread the some datasets docs (#25068)
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
This commit is contained in:
parent
cd16dc4dae
commit
9dd30d5f77
3 changed files with 17 additions and 4 deletions
|
@ -276,7 +276,7 @@ out of such a gradient rut. In the distributed data-parallel training case, the
|
|||
status quo solution is typically to have a per-shard in-memory shuffle buffer that you
|
||||
fill up and pop random batches from, without mixing data across shards between epochs.
|
||||
Ray Datasets also offers fully global random shuffling via
|
||||
:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()`, and doing so on an
|
||||
:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()>`, and doing so on an
|
||||
epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
|
||||
``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
|
||||
global per-epoch shuffling instead of local shuffle buffer shuffling?
|
||||
|
|
|
@ -90,8 +90,21 @@ Parquet Column Pruning
|
|||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Current Datasets will read all Parquet columns into memory.
|
||||
If you only need a subset of the columns, make sure to specify the list of columns explicitly when
|
||||
calling ``ray.data.read_parquet()`` to avoid loading unnecessary data.
|
||||
If you only need a subset of the columns, make sure to specify the list of columns
|
||||
explicitly when calling ``ray.data.read_parquet()`` to avoid loading unnecessary
|
||||
data (projection pushdown).
|
||||
For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"]`` to read
|
||||
just two of the five columns of Iris dataset.
|
||||
|
||||
Parquet Row Pruning
|
||||
~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
|
||||
which will be applied at the file scan so only rows that match the filter predicate
|
||||
will be returned.
|
||||
For example, use ``ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0``
|
||||
to read rows with sepal.length greater than 5.0.
|
||||
This can be used in conjunction with column pruning when appropriate to get the benefits of both.
|
||||
|
||||
Tuning Read Parallelism
|
||||
~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
|
|
@ -74,7 +74,7 @@ Compute Strategy
|
|||
Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
|
||||
or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
|
||||
used (with ``compute="tasks"``). For transformations that require expensive setup,
|
||||
it's preferrable to use Ray actors, which are stateful and allows setup to be reused
|
||||
it's preferrable to use Ray actors, which are stateful and allow setup to be reused
|
||||
for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
|
||||
Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
|
||||
transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.
|
||||
|
|
Loading…
Add table
Reference in a new issue