ray/doc/source/data at b606169cb5dfd784144a61b46aa7e87ede592a95 - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 21:36:39 -04:00

History

Eric Liang 9de1add073 [Datasets] Autodetect dataset parallelism based on available resources and data size (#25883 ) This PR defaults the parallelism of Dataset reads to `-1`. The parallelism is determined according to the following rule in this case: - The number of available CPUs is estimated. If in a placement group, the number of CPUs in the cluster is scaled by the size of the placement group compared to the cluster size. If not in a placement group, this is the number of CPUs in the cluster. If the estimated CPUs is less than 8, it is set to 8. - The parallelism is set to the estimated number of CPUs multiplied by 2. - The in-memory data size is estimated. If the parallelism would create in-memory blocks larger than the target block size (512MiB), the parallelism is increased until the blocks are < 512MiB in size. These rules fix two common user problems: 1. Insufficient parallelism in a large cluster, or too much parallelism on a small cluster. 2. Overly large block sizes leading to OOMs when processing a single block. TODO: - [x] Unit tests - [x] Docs update Supercedes part of: https://github.com/ray-project/ray/pull/25708 Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>		2022-07-12 21:08:49 -07:00
..
doc_code	[Datasets] Update docs for drop_columns and fix typos (#26317 )	2022-07-07 17:17:33 -07:00
examples	[air/tune] Documentation testing fixes (#26409 )	2022-07-09 19:47:21 -07:00
images	[minor] Fix incorrect link to ray core user guide (#23316 )	2022-03-17 20:58:56 -07:00
modin	Fix broken links in documentation and put linkcheck linter in place on CI (#23340 )	2022-03-18 21:02:52 -07:00
accessing-datasets.rst	[Datasets] Overhaul "Accessing Datasets" feature guide. (#24963 )	2022-05-19 12:50:00 -07:00
advanced-pipelines.rst	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00
big_data_ingestion.yaml	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 )	2022-01-20 15:30:56 -08:00
creating-datasets.rst	[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883 )	2022-07-12 21:08:49 -07:00
custom-data.rst	[Datasets] Overhaul of "Creating Datasets" feature guide. (#24831 )	2022-05-17 16:23:42 -07:00
dask-on-ray.rst	Update dask version for Ray 1.12.0 (#23197 )	2022-03-15 19:22:19 -07:00
dataset-ml-preprocessing.rst	[Datasets] Update docs for drop_columns and fix typos (#26317 )	2022-07-07 17:17:33 -07:00
dataset-tensor-support.rst	[Datasets] Unrevert "[Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#25031 )" (#25531 )	2022-06-08 10:33:25 -07:00
dataset.rst	[Docs] [Serve] Has a consistent landing page style (#26029 )	2022-07-08 11:58:21 -07:00
faq.rst	Proofread the some datasets docs (#25068 )	2022-05-22 12:11:51 -07:00
getting-started.rst	[Datasets] [Tensor Story - 2/2] Add `"numpy"` batch format for batch mapping and batch consumption. (#24870 )	2022-06-17 16:01:02 -07:00
integrations.rst	Revamp the Getting Started page for Dataset (#24860 )	2022-05-18 13:46:23 -07:00
key-concepts.rst	[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883 )	2022-07-12 21:08:49 -07:00
mars-on-ray.rst	[Datasets] Integrate Mars-on-Ray with Datasets; improve docs and add tests (#23402 )	2022-04-29 09:43:52 -07:00
memory-management.rst	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00
package-ref.rst	[Datasets] [Tensor Story - 2/2] Add `"numpy"` batch format for batch mapping and batch consumption. (#24870 )	2022-06-17 16:01:02 -07:00
performance-tips.rst	[Datasets] Autodetect dataset parallelism based on available resources and data size (#25883 )	2022-07-12 21:08:49 -07:00
pipelining-compute.rst	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00
random-access.rst	[Datasets] Overhaul "Accessing Datasets" feature guide. (#24963 )	2022-05-19 12:50:00 -07:00
raydp.rst	[Docs] Ray Data docs target state (#21931 )	2022-01-27 13:14:36 -08:00
saving-datasets.rst	Revamp the Transforming Datasets user guide (#25033 )	2022-05-20 19:25:06 -07:00
transforming-datasets.rst	[Datasets] [Tensor Story - 2/2] Add `"numpy"` batch format for batch mapping and batch consumption. (#24870 )	2022-06-17 16:01:02 -07:00
user-guide.rst	[data] [docs] Doc audit-- rebalance basic vs advanced materials (#25262 )	2022-06-01 13:50:46 -07:00