Warn if the dataset's parallelism is limited by the number of files (#23573)

A common user confusion is that their dataset parallelism is limited by the number of files. Add a warning if the available parallelism is much less than the specified parallelism, and tell the user to repartition() in that case.
This commit is contained in:
Eric Liang 2022-03-29 18:54:54 -07:00 committed by GitHub
parent 6443f3da84
commit 5aead0bb91
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -242,6 +242,16 @@ def read_datasource(
)
)
if len(read_tasks) < parallelism and (
len(read_tasks) < ray.available_resources().get("CPU", parallelism) // 2
):
logger.warning(
"The number of blocks in this dataset ({}) limits its parallelism to {} "
"concurrent tasks. This is much less than the number of available "
"CPU slots in the cluster. Use `.repartition(n)` to increase the number of "
"dataset blocks.".format(len(read_tasks), len(read_tasks))
)
context = DatasetContext.get_current()
stats_actor = get_or_create_stats_actor()
stats_uuid = uuid.uuid4()