mirror of
https://github.com/vale981/ray
synced 2025-03-05 10:01:43 -05:00
Warn if the dataset's parallelism is limited by the number of files (#23573)
A common user confusion is that their dataset parallelism is limited by the number of files. Add a warning if the available parallelism is much less than the specified parallelism, and tell the user to repartition() in that case.
This commit is contained in:
parent
6443f3da84
commit
5aead0bb91
1 changed files with 10 additions and 0 deletions
|
@ -242,6 +242,16 @@ def read_datasource(
|
|||
)
|
||||
)
|
||||
|
||||
if len(read_tasks) < parallelism and (
|
||||
len(read_tasks) < ray.available_resources().get("CPU", parallelism) // 2
|
||||
):
|
||||
logger.warning(
|
||||
"The number of blocks in this dataset ({}) limits its parallelism to {} "
|
||||
"concurrent tasks. This is much less than the number of available "
|
||||
"CPU slots in the cluster. Use `.repartition(n)` to increase the number of "
|
||||
"dataset blocks.".format(len(read_tasks), len(read_tasks))
|
||||
)
|
||||
|
||||
context = DatasetContext.get_current()
|
||||
stats_actor = get_or_create_stats_actor()
|
||||
stats_uuid = uuid.uuid4()
|
||||
|
|
Loading…
Add table
Reference in a new issue