ray/doc/source/data/user-guide.rst
Eric Liang 015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00

24 lines
713 B
ReStructuredText
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

.. _data_user_guide :
===========
User Guides
===========
If youre new to Ray Datasets, we recommend starting with the :ref:`Ray Datasets Quick Start <ray_datasets_quick_start>`.
This user guide will help you navigate the Ray Datasets project and show you how achieve several tasks, for instance
you will learn
- how to load data and preprocess it for machine learning applications,
- how to use Tensors with Ray Datasets,
- how to run Dataset Pipelines in common scenarios,
- and how to tune your Ray Datasets applications for performance.
.. toctree::
:maxdepth: 2
dataset-ml-preprocessing
dataset-tensor-support
advanced-pipelines
random-access
custom-data
performance-tips