mirror of
https://github.com/vale981/ray
synced 2025-03-10 05:16:49 -04:00
![]() This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers. |
||
---|---|---|
.. | ||
app_config.yaml | ||
dataset_ingest_400G_compute.yaml | ||
dataset_random_access.py | ||
dataset_shuffle_data_loader.py | ||
inference.py | ||
inference.yaml | ||
parquet_metadata_resolution.py | ||
pipelined_ingestion_app.yaml | ||
pipelined_ingestion_compute.yaml | ||
pipelined_training.py | ||
pipelined_training_app.yaml | ||
pipelined_training_compute.yaml | ||
ray_sgd_runner.py | ||
ray_sgd_training.py | ||
ray_sgd_training_app.yaml | ||
ray_sgd_training_compute.yaml | ||
ray_sgd_training_compute_no_gpu.yaml | ||
ray_sgd_training_smoke_compute.yaml | ||
shuffle_app_config.yaml | ||
shuffle_compute.yaml |