ray/python
Stephanie Wang 93aae48b80
[dataset] Pipeline task submission during reduce stage in push-based shuffle (#25795)
Reduce stage in push-based shuffle fails to complete at 100k output partitions or more. This is likely because of driver or raylet load from having too many tasks in flight at once.

We can fix this from ray core too, but for now, this PR adds pipelining for the reduce stage, to limit the total number of reduce tasks in flight at the same time. This is currently set to 2 * available parallelism in the cluster. We have to pick which reduce tasks to submit carefully since these are pinned to specific nodes. The PR does this by assigning tasks round-robin according to the corresponding merge task (which get spread throughout the cluster).

In addition, this PR refactors the map, merge, and reduce stages to use a common pipelined iterator pattern, since they all have a similar pattern of submitting a round of tasks at a time, then waiting for a previous round to finish before submitting more.
Related issue number

Closes #25412.
2022-06-17 17:33:16 -07:00
..
ray [dataset] Pipeline task submission during reduce stage in push-based shuffle (#25795) 2022-06-17 17:33:16 -07:00
requirements [air] Move python/ray/ml to python/ray/air (#25449) 2022-06-03 21:53:44 +01:00
asv.conf.json [docs] Move all /latest links to /master (#11897) 2020-11-10 10:53:28 -08:00
build-wheel-macos-arm64.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-macos.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-manylinux2014.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-windows.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
MANIFEST.in Exclude Bazel build files from Ray wheels (#25679) 2022-06-11 16:05:59 -07:00
README-building-wheels.md [build] Build wheels with manylinux2014 (#11621) 2020-11-03 19:36:32 -08:00
requirements.txt [dataset] Use polars for sorting (#25454) 2022-06-17 12:26:46 -07:00
requirements_linters.txt Add import sorting to format.sh (#25678) 2022-06-13 14:08:51 -07:00
requirements_ml_docker.txt [AIR] Add distributed torch_geometric example (#23580) 2022-04-21 09:48:43 -07:00
setup.py apply isort uniformly for a subset of directories (#25824) 2022-06-17 13:40:32 -07:00