ray/python
Stephanie Wang c1054a0baa
[Datasets] Implement push-based shuffle (#23758)
The simple shuffle currently implemented in Datasets does not reliably scale past 1000+ partitions due to metadata and I/O overhead.

This PR adds an experimental shuffle implementation for a "push-based shuffle", as described in this paper draft. This algorithm should see better performance at larger data scales. The algorithm works by merging intermediate map outputs at the reducer side while other map tasks are executing. Then, a final reduce task merges these merged outputs.

Currently, the PR exposes this option through the DatasetContext. It can also be set through a hidden OS environment variable (RAY_DATASET_PUSH_BASED_SHUFFLE). Once we have more comprehensive benchmarks, we can better document this option and allow the algorithm to be chosen at run time.

Related issue number

Closes #23758.
2022-04-27 11:59:41 -07:00
..
ray [Datasets] Implement push-based shuffle (#23758) 2022-04-27 11:59:41 -07:00
requirements [AIR] Add distributed torch_geometric example (#23580) 2022-04-21 09:48:43 -07:00
asv.conf.json [docs] Move all /latest links to /master (#11897) 2020-11-10 10:53:28 -08:00
build-wheel-macos-arm64.sh [ci] Clean up ci/ directory (refactor ci/travis) (#23866) 2022-04-13 18:11:30 +01:00
build-wheel-macos.sh [ci] Clean up ci/ directory (refactor ci/travis) (#23866) 2022-04-13 18:11:30 +01:00
build-wheel-manylinux2014.sh [ci] Clean up ci/ directory (refactor ci/travis) (#23866) 2022-04-13 18:11:30 +01:00
build-wheel-windows.sh [ci] Clean up ci/ directory (refactor ci/travis) (#23866) 2022-04-13 18:11:30 +01:00
MANIFEST.in Includes .pyi files in package data. (#21247) 2021-12-27 11:50:02 -08:00
README-building-wheels.md [build] Build wheels with manylinux2014 (#11621) 2020-11-03 19:36:32 -08:00
requirements.txt [core] Fix internal storage S3 bugs (#24167) 2022-04-27 09:57:14 -07:00
requirements_linters.txt Remove yapf dependency (#23656) 2022-04-04 21:50:04 -07:00
requirements_ml_docker.txt [AIR] Add distributed torch_geometric example (#23580) 2022-04-21 09:48:43 -07:00
setup.py [air] Move storage handling to pyarrow.fs.FileSystem (#23370) 2022-04-13 14:31:30 -07:00