mirror of
https://github.com/vale981/ray
synced 2025-03-10 21:36:39 -04:00
![]() This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case: - The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8. - The parallelism is set to the estimated number of CPUs multiplied by 2. - The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size. These rules fix two common user problems: 1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster. 2. Overly large block sizes leading to OOMs when processing a single block. TODO: - [x] Unit tests - [x] Docs update Supercedes part of: https://github.com/ray-project/ray/pull/25708 Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal> |
||
---|---|---|
.. | ||
doc_code | ||
examples | ||
images | ||
modin | ||
accessing-datasets.rst | ||
advanced-pipelines.rst | ||
big_data_ingestion.yaml | ||
creating-datasets.rst | ||
custom-data.rst | ||
dask-on-ray.rst | ||
dataset-ml-preprocessing.rst | ||
dataset-tensor-support.rst | ||
dataset.rst | ||
faq.rst | ||
getting-started.rst | ||
integrations.rst | ||
key-concepts.rst | ||
mars-on-ray.rst | ||
memory-management.rst | ||
package-ref.rst | ||
performance-tips.rst | ||
pipelining-compute.rst | ||
random-access.rst | ||
raydp.rst | ||
saving-datasets.rst | ||
transforming-datasets.rst | ||
user-guide.rst |