ray/release/nightly_tests/dataset at b0f5073b31f80116269604f93be93fb73d7808c2 - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

History

Stephanie Wang c62e00ed6d [dataset] Use polars for sorting (#24523 ) Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed. On my laptop, this makes sorting 1GB about 2x faster: without polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 50.23415923118591 ... Stage 2 sort: executed in 38.59s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used with polars $ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100 Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total Finished in 24.097432136535645 ... Stage 2 sort: executed in 14.02s Substage 0 sort_map: 100/100 blocks executed * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Substage 1 sort_reduce: 100/100 blocks executed * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used Related issue number Closes #23612.		2022-05-12 18:35:50 -07:00
..
app_config.yaml	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 )	2022-02-10 14:22:41 -08:00
dataset_ingest_400G_compute.yaml	[dataset][cuj2] add another single node ingestion example (#20754 )	2021-12-07 02:50:17 -08:00
dataset_random_access.py	Add random access support for Datasets (experimental feature) (#22749 )	2022-03-17 15:01:12 -07:00
dataset_shuffle_data_loader.py	[CI] Format Python code with Black (#21975 )	2022-01-29 18:41:57 -08:00
inference.py	[CI] Format Python code with Black (#21975 )	2022-01-29 18:41:57 -08:00
inference.yaml	[Dataset] imagenet nightly test (#17069 )	2021-07-16 14:15:49 -07:00
parquet_metadata_resolution.py	[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665 )	2022-02-28 15:15:30 -08:00
pipelined_ingestion_app.yaml	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 )	2022-02-10 14:22:41 -08:00
pipelined_ingestion_compute.yaml	Don't advertise cpus on gpu nodes for pipelined ingestion tests (#21899 )	2022-01-27 09:17:01 -08:00
pipelined_training.py	Enable stage fusion by default for dataset pipelines (#22476 )	2022-02-23 17:34:05 -08:00
pipelined_training_app.yaml	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 )	2022-02-10 14:22:41 -08:00
pipelined_training_compute.yaml	[Dataset][nighlyt-test] spend less money #19488	2021-10-18 18:53:50 -07:00
ray_sgd_runner.py	Round robin during spread scheduling (#21303 )	2022-02-18 15:05:35 -08:00
ray_sgd_training.py	Round robin during spread scheduling (#21303 )	2022-02-18 15:05:35 -08:00
ray_sgd_training_app.yaml	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 )	2022-02-10 14:22:41 -08:00
ray_sgd_training_compute.yaml	[CUJ2] add nightly tests for running 500GB ray train (#20195 )	2021-11-21 20:04:45 -08:00
ray_sgd_training_compute_no_gpu.yaml	[nighly-test] update cuj2 to reflect latest change #20889	2021-12-06 09:59:21 -08:00
ray_sgd_training_smoke_compute.yaml	[Dataset][nighlytest] use latest ray for running test #21148	2021-12-17 23:48:44 -08:00
shuffle_app_config.yaml	[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277 )	2022-02-10 14:22:41 -08:00
shuffle_compute.yaml	add dataset shuffle data loader (#17917 )	2021-08-20 11:26:01 -07:00
sort.py	[dataset] Use polars for sorting (#24523 )	2022-05-12 18:35:50 -07:00