mirror of
https://github.com/vale981/ray
synced 2025-03-08 19:41:38 -05:00
![]() This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers. |
||
---|---|---|
.. | ||
doc_code | ||
examples | ||
images | ||
modin | ||
advanced-pipelines.rst | ||
big_data_ingestion.yaml | ||
custom-data.rst | ||
dask-on-ray.rst | ||
dataset-ml-preprocessing.rst | ||
dataset-tensor-support.rst | ||
dataset.rst | ||
getting-started.rst | ||
integrations.rst | ||
key-concepts.rst | ||
mars-on-ray.rst | ||
package-ref.rst | ||
performance-tips.rst | ||
random-access.rst | ||
raydp.rst | ||
user-guide.rst |