ray/doc/source/data
Eric Liang 9de1add073
[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883)
This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case:
- The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8.
- The parallelism is set to the estimated number of CPUs multiplied by 2.
- The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size.

These rules fix two common user problems:
1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster.
2. Overly large block sizes leading to OOMs when processing a single block.

TODO:
- [x] Unit tests
- [x] Docs update

Supercedes part of: https://github.com/ray-project/ray/pull/25708

Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
2022-07-12 21:08:49 -07:00
..
doc_code [Datasets] Update docs for drop_columns and fix typos (#26317) 2022-07-07 17:17:33 -07:00
examples [air/tune] Documentation testing fixes (#26409) 2022-07-09 19:47:21 -07:00
images [minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
modin Fix broken links in documentation and put linkcheck linter in place on CI (#23340) 2022-03-18 21:02:52 -07:00
accessing-datasets.rst [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) 2022-05-19 12:50:00 -07:00
advanced-pipelines.rst [data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262) 2022-06-01 13:50:46 -07:00
big_data_ingestion.yaml Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763) 2022-01-20 15:30:56 -08:00
creating-datasets.rst [Datasets] Autodetect dataset parallelism based on available resources and data size (#25883) 2022-07-12 21:08:49 -07:00
custom-data.rst [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) 2022-05-17 16:23:42 -07:00
dask-on-ray.rst Update dask version for Ray 1.12.0 (#23197) 2022-03-15 19:22:19 -07:00
dataset-ml-preprocessing.rst [Datasets] Update docs for drop_columns and fix typos (#26317) 2022-07-07 17:17:33 -07:00
dataset-tensor-support.rst [Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031)" (#25531) 2022-06-08 10:33:25 -07:00
dataset.rst [Docs] [Serve] Has a consistent landing page style (#26029) 2022-07-08 11:58:21 -07:00
faq.rst Proofread the some datasets docs (#25068) 2022-05-22 12:11:51 -07:00
getting-started.rst [Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870) 2022-06-17 16:01:02 -07:00
integrations.rst Revamp the Getting Started page for Dataset (#24860) 2022-05-18 13:46:23 -07:00
key-concepts.rst [Datasets] Autodetect dataset parallelism based on available resources and data size (#25883) 2022-07-12 21:08:49 -07:00
mars-on-ray.rst [Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402) 2022-04-29 09:43:52 -07:00
memory-management.rst [data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262) 2022-06-01 13:50:46 -07:00
package-ref.rst [Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870) 2022-06-17 16:01:02 -07:00
performance-tips.rst [Datasets] Autodetect dataset parallelism based on available resources and data size (#25883) 2022-07-12 21:08:49 -07:00
pipelining-compute.rst [data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262) 2022-06-01 13:50:46 -07:00
random-access.rst [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) 2022-05-19 12:50:00 -07:00
raydp.rst [Docs] Ray Data docs target state (#21931) 2022-01-27 13:14:36 -08:00
saving-datasets.rst Revamp the Transforming Datasets user guide (#25033) 2022-05-20 19:25:06 -07:00
transforming-datasets.rst [Datasets] [Tensor Story - 2/2] Add "numpy" batch format for batch mapping and batch consumption. (#24870) 2022-06-17 16:01:02 -07:00
user-guide.rst [data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262) 2022-06-01 13:50:46 -07:00