mirror of
https://github.com/vale981/ray
synced 2025-03-09 04:46:38 -04:00

This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset. RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset. Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``. Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
24 lines
713 B
ReStructuredText
24 lines
713 B
ReStructuredText
.. _data_user_guide :
|
||
|
||
===========
|
||
User Guides
|
||
===========
|
||
|
||
If you’re new to Ray Datasets, we recommend starting with the :ref:`Ray Datasets Quick Start <ray_datasets_quick_start>`.
|
||
This user guide will help you navigate the Ray Datasets project and show you how achieve several tasks, for instance
|
||
you will learn
|
||
|
||
- how to load data and preprocess it for machine learning applications,
|
||
- how to use Tensors with Ray Datasets,
|
||
- how to run Dataset Pipelines in common scenarios,
|
||
- and how to tune your Ray Datasets applications for performance.
|
||
|
||
.. toctree::
|
||
:maxdepth: 2
|
||
|
||
dataset-ml-preprocessing
|
||
dataset-tensor-support
|
||
advanced-pipelines
|
||
random-access
|
||
custom-data
|
||
performance-tips
|