ray/doc/source/data/dataset.rst

.. _datasets:

Datasets: Flexible Distributed Data Loading
===========================================

.. tip::

  Datasets is available as **beta** in Ray 1.8+. Please file feature requests and bug reports on GitHub Issues or join the discussion on the `Ray Slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`__.

Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. Datasets provide basic distributed data transformations such as ``map``, ``filter``, and ``repartition``, and are compatible with a variety of file formats, datasources, and distributed frameworks.

.. image:: dataset.svg

..
  https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit

Concepts
--------
Ray Datasets implement `Distributed Arrow <https://arrow.apache.org/>`__. A Dataset consists of a list of Ray object references to *blocks*. Each block holds a set of items in either an `Arrow table <https://arrow.apache.org/docs/python/data.html#tables>`__ or a Python list (for Arrow incompatible objects). Having multiple blocks in a dataset allows for parallel transformation and ingest of the data (e.g., into :ref:`Ray Train <train-docs>` for ML training).

The following figure visualizes a Dataset that has three Arrow table blocks, each block holding 1000 rows each:

.. image:: dataset-arch.svg

..
  https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit

Since a Ray Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.

Compared to `Spark RDDs <https://spark.apache.org/docs/latest/rdd-programming-guide.html>`__ and `Dask Bags <https://docs.dask.org/en/latest/bag.html>`__, Datasets offers a more basic set of features, and executes operations eagerly for simplicity. It is intended that users cast Datasets into more featureful dataframe types (e.g., ``ds.to_dask()``) for advanced operations.

Datasource Compatibility Matrices
---------------------------------


.. list-table:: Input compatibility matrix
   :header-rows: 1

   * - Input Type
     - Read API
     - Status
   * - CSV File Format
     - ``ray.data.read_csv()``
     - ✅
   * - JSON File Format
     - ``ray.data.read_json()``
     - ✅
   * - Parquet File Format
     - ``ray.data.read_parquet()``
     - ✅
   * - Numpy File Format
     - ``ray.data.read_numpy()``
     - ✅
   * - Text Files
     - ``ray.data.read_text()``
     - ✅
   * - Binary Files
     - ``ray.data.read_binary_files()``
     - ✅
   * - Python Objects
     - ``ray.data.from_items()``
     - ✅
   * - Spark Dataframe
     - ``ray.data.from_spark()``
     - ✅
   * - Dask Dataframe
     - ``ray.data.from_dask()``
     - ✅
   * - Modin Dataframe
     - ``ray.data.from_modin()``
     - ✅
   * - MARS Dataframe
     - ``ray.data.from_mars()``
     - (todo)
   * - Pandas Dataframe Objects
     - ``ray.data.from_pandas()``
     - ✅
   * - NumPy ndarray Objects
     - ``ray.data.from_numpy()``
     - ✅
   * - Arrow Table Objects
     - ``ray.data.from_arrow()``
     - ✅
   * - Custom Datasource
     - ``ray.data.read_datasource()``
     - ✅


.. list-table:: Output compatibility matrix
   :header-rows: 1

   * - Output Type
     - Dataset API
     - Status
   * - CSV File Format
     - ``ds.write_csv()``
     - ✅
   * - JSON File Format
     - ``ds.write_json()``
     - ✅
   * - Parquet File Format
     - ``ds.write_parquet()``
     - ✅
   * - Numpy File Format
     - ``ds.write_numpy()``
     - ✅
   * - Spark Dataframe
     - ``ds.to_spark()``
     - ✅
   * - Dask Dataframe
     - ``ds.to_dask()``
     - ✅
   * - Modin Dataframe
     - ``ds.to_modin()``
     - ✅
   * - MARS Dataframe
     - ``ds.to_mars()``
     - (todo)
   * - Arrow Table Objects
     - ``ds.to_arrow_refs()``
     - ✅
   * - Arrow Table Iterator
     - ``ds.iter_batches(batch_format="pyarrow")``
     - ✅
   * - Single Pandas Dataframe
     - ``ds.to_pandas()``
     - ✅
   * - Pandas Dataframe Objects
     - ``ds.to_pandas_refs()``
     - ✅
   * - NumPy ndarray Objects
     - ``ds.to_numpy_refs()``
     - ✅
   * - Pandas Dataframe Iterator
     - ``ds.iter_batches(batch_format="pandas")``
     - ✅
   * - PyTorch Iterable Dataset
     - ``ds.to_torch()``
     - ✅
   * - TensorFlow Iterable Dataset
     - ``ds.to_tf()``
     - ✅
   * - Custom Datasource
     - ``ds.write_datasource()``
     - ✅


Creating Datasets
-----------------

.. tip::

   Run ``pip install "ray[data]"`` to get started!

Get started by creating Datasets from synthetic data using ``ray.data.range()`` and ``ray.data.from_items()``. Datasets can hold either plain Python objects (schema is a Python type), or Arrow records (schema is Arrow).

.. code-block:: python

    import ray
    
    # Create a Dataset of Python objects.
    ds = ray.data.range(10000)
    # -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)

    ds.take(5)
    # -> [0, 1, 2, 3, 4]

    ds.count()
    # -> 10000

    # Create a Dataset of Arrow records.
    ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)])
    # -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string})

    ds.show(5)
    # -> {'col1': 0, 'col2': '0'}
    # -> {'col1': 1, 'col2': '1'}
    # -> {'col1': 2, 'col2': '2'}
    # -> {'col1': 3, 'col2': '3'}
    # -> {'col1': 4, 'col2': '4'}

    ds.schema()
    # -> col1: int64
    # -> col2: string

Datasets can be created from files on local disk or remote datasources such as S3. Any filesystem `supported by pyarrow <http://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html>`__ can be used to specify file locations:

.. code-block:: python

    # Read a directory of files in remote storage.
    ds = ray.data.read_csv("s3://bucket/path")

    # Read multiple local files.
    ds = ray.data.read_csv(["/path/to/file1", "/path/to/file2"])

    # Read multiple directories.
    ds = ray.data.read_csv(["s3://bucket/path1", "s3://bucket/path2"])

Finally, you can create a ``Dataset`` from existing data in the Ray object store or Ray-compatible distributed DataFrames:

.. code-block:: python

    import pandas as pd
    import dask.dataframe as dd

    # Create a Dataset from a list of Pandas DataFrame objects.
    pdf = pd.DataFrame({"one": [1, 2, 3], "two": ["a", "b", "c"]})
    ds = ray.data.from_pandas([pdf])

    # Create a Dataset from a Dask-on-Ray DataFrame.
    dask_df = dd.from_pandas(pdf, npartitions=10)
    ds = ray.data.from_dask(dask_df)

Saving Datasets
---------------

Datasets can be written to local or remote storage using ``.write_csv()``, ``.write_json()``, and ``.write_parquet()``.

.. code-block:: python

    # Write to csv files in /tmp/output.
    ray.data.range(10000).write_csv("/tmp/output")
    # -> /tmp/output/data0.csv, /tmp/output/data1.csv, ...

    # Use repartition to control the number of output files:
    ray.data.range(10000).repartition(1).write_csv("/tmp/output2")
    # -> /tmp/output2/data0.csv

You can also convert a ``Dataset`` to Ray-compatibile distributed DataFrames:

.. code-block:: python

    # Convert a Ray Dataset into a Dask-on-Ray DataFrame.
    dask_df = ds.to_dask()

Transforming Datasets
---------------------

Datasets can be transformed in parallel using ``.map()``. Transformations are executed *eagerly* and block until the operation is finished. Datasets also supports ``.filter()`` and ``.flat_map()``.

.. code-block:: python

    ds = ray.data.range(10000)
    ds = ds.map(lambda x: x * 2)
    # -> Map Progress: 100%|████████████████████| 200/200 [00:00<00:00, 1123.54it/s]
    # -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)
    ds.take(5)
    # -> [0, 2, 4, 6, 8]

    ds.filter(lambda x: x > 5).take(5)
    # -> Map Progress: 100%|████████████████████| 200/200 [00:00<00:00, 1859.63it/s]
    # -> [6, 8, 10, 12, 14]

    ds.flat_map(lambda x: [x, -x]).take(5)
    # -> Map Progress: 100%|████████████████████| 200/200 [00:00<00:00, 1568.10it/s]
    # -> [0, 0, 2, -2, 4]

To take advantage of vectorized functions, use ``.map_batches()``. Note that you can also implement ``filter`` and ``flat_map`` using ``.map_batches()``, since your map function can return an output batch of any size.

.. code-block:: python

    ds = ray.data.range_arrow(10000)
    ds = ds.map_batches(
        lambda df: df.applymap(lambda x: x * 2), batch_format="pandas")
    # -> Map Progress: 100%|████████████████████| 200/200 [00:00<00:00, 1927.62it/s]
    ds.take(5)
    # -> [{'value': 0}, {'value': 2}, ...]

By default, transformations are executed using Ray tasks. For transformations that require setup, specify ``compute="actors"`` and Ray will use an autoscaling actor pool to execute your transforms instead. The following is an end-to-end example of reading, transforming, and saving batch inference results using Datasets:

.. code-block:: python

    # Example of GPU batch inference on an ImageNet model.
    def preprocess(image: bytes) -> bytes:
        return image

    class BatchInferModel:
        def __init__(self):
            self.model = ImageNetModel()
        def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:
            return self.model(batch)

    ds = ray.data.read_binary_files("s3://bucket/image-dir")

    # Preprocess the data.
    ds = ds.map(preprocess)
    # -> Map Progress: 100%|████████████████████| 200/200 [00:00<00:00, 1123.54it/s]

    # Apply GPU batch inference with actors, and assign each actor a GPU using
    # ``num_gpus=1`` (any Ray remote decorator argument can be used here).
    ds = ds.map_batches(BatchInferModel, compute="actors", batch_size=256, num_gpus=1)
    # -> Map Progress (16 actors 4 pending): 100%|██████| 200/200 [00:07, 27.60it/s]

    # Save the results.
    ds.repartition(1).write_json("s3://bucket/inference-results")

Exchanging datasets
-------------------

Datasets can be passed to Ray tasks or actors and read with ``.iter_batches()`` or ``.iter_rows()``. This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects:

.. code-block:: python

    @ray.remote
    def consume(data: Dataset[int]) -> int:
        num_batches = 0
        for batch in data.iter_batches():
            num_batches += 1
        return num_batches

    ds = ray.data.range(10000)
    ray.get(consume.remote(ds))
    # -> 200

Datasets can be split up into disjoint sub-datasets. Locality-aware splitting is supported if you pass in a list of actor handles to the ``split()`` function along with the number of desired splits. This is a common pattern useful for loading and splitting data between distributed training actors:

.. code-block:: python

    @ray.remote(num_gpus=1)
    class Worker:
        def __init__(self, rank: int):
            pass

        def train(self, shard: ray.data.Dataset[int]) -> int:
            for batch in shard.iter_batches(batch_size=256):
                pass
            return shard.count()

    workers = [Worker.remote(i) for i in range(16)]
    # -> [Actor(Worker, ...), Actor(Worker, ...), ...]

    ds = ray.data.range(10000)
    # -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)

    shards = ds.split(n=16, locality_hints=workers)
    # -> [Dataset(num_blocks=13, num_rows=650, schema=<class 'int'>),
    #     Dataset(num_blocks=13, num_rows=650, schema=<class 'int'>), ...]

    ray.get([w.train.remote(s) for s in shards])
    # -> [650, 650, ...]

Custom datasources
------------------

Datasets can read and write in parallel to `custom datasources <package-ref.html#custom-datasource-api>`__ defined in Python.

.. code-block:: python

    # Read from a custom datasource.
    ds = ray.data.read_datasource(YourCustomDatasource(), **read_args)

    # Write to a custom datasource.
    ds.write_datasource(YourCustomDatasource(), **write_args)

Contributing
------------

Contributions to Datasets are `welcome <https://docs.ray.io/en/master/development.html#python-develop>`__! There are many potential improvements, including:

- Supporting more datasources and transforms.
- Integration with more ecosystem libraries.
- Adding features that require partitioning such as groupby() and join().
- Performance optimizations.
Add link to datasets preview docs 2021-07-16 12:31:52 -07:00			`.. _datasets:`

Initial version of workflow documentation (#18138) 2021-08-27 16:20:48 -07:00			`Datasets: Flexible Distributed Data Loading`
			`===========================================`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`.. tip::`

[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332) 2021-10-12 18:32:29 -07:00			Datasets is available as beta in Ray 1.8+. Please file feature requests and bug reports on GitHub Issues or join the discussion on the `Ray Slack <https://forms.gle/9TSdDYUgxYs8SA9e8>`__.
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			Ray Datasets are the standard way to load and exchange data in Ray libraries and applications. Datasets provide basic distributed data transformations such as ``map``, ``filter``, and ``repartition``, and are compatible with a variety of file formats, datasources, and distributed frameworks.

			`.. image:: dataset.svg`

			`..`
			`https://docs.google.com/drawings/d/16AwJeBNR46_TsrkOmMbGaBK7u-OPsf_V8fHjU-d2PPQ/edit`

			`Concepts`
			`--------`
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00			Ray Datasets implement `Distributed Arrow <https://arrow.apache.org/>`__. A Dataset consists of a list of Ray object references to blocks. Each block holds a set of items in either an `Arrow table <https://arrow.apache.org/docs/python/data.html#tables>`__ or a Python list (for Arrow incompatible objects). Having multiple blocks in a dataset allows for parallel transformation and ingest of the data (e.g., into :ref:`Ray Train <train-docs>` for ML training).
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`The following figure visualizes a Dataset that has three Arrow table blocks, each block holding 1000 rows each:`

			`.. image:: dataset-arch.svg`

			`..`
			`https://docs.google.com/drawings/d/1PmbDvHRfVthme9XD7EYM-LIHPXtHdOfjCbc1SCsM64k/edit`

			`Since a Ray Dataset is just a list of Ray object references, it can be freely passed between Ray tasks, actors, and libraries like any other object reference. This flexibility is a unique characteristic of Ray Datasets.`

			Compared to `Spark RDDs <https://spark.apache.org/docs/latest/rdd-programming-guide.html>`__ and `Dask Bags <https://docs.dask.org/en/latest/bag.html>`__, Datasets offers a more basic set of features, and executes operations eagerly for simplicity. It is intended that users cast Datasets into more featureful dataframe types (e.g., ``ds.to_dask()``) for advanced operations.

			`Datasource Compatibility Matrices`
			`---------------------------------`


			`.. list-table:: Input compatibility matrix`
			`:header-rows: 1`

			`* - Input Type`
			`- Read API`
			`- Status`
			`* - CSV File Format`
			- ``ray.data.read_csv()``
			`- ✅`
			`* - JSON File Format`
			- ``ray.data.read_json()``
			`- ✅`
			`* - Parquet File Format`
			- ``ray.data.read_parquet()``
			`- ✅`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`* - Numpy File Format`
			- ``ray.data.read_numpy()``
			`- ✅`
			`* - Text Files`
			- ``ray.data.read_text()``
			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Binary Files`
			- ``ray.data.read_binary_files()``
			`- ✅`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`* - Python Objects`
			- ``ray.data.from_items()``
			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Spark Dataframe`
			- ``ray.data.from_spark()``
Increase dataset read parallelism by default (#18420) 2021-09-09 15:07:49 -07:00			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Dask Dataframe`
			- ``ray.data.from_dask()``
			`- ✅`
			`* - Modin Dataframe`
			- ``ray.data.from_modin()``
Dataset from modin (#18122) 2021-08-31 14:19:35 -04:00			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - MARS Dataframe`
			- ``ray.data.from_mars()``
			`- (todo)`
			`* - Pandas Dataframe Objects`
			- ``ray.data.from_pandas()``
			`- ✅`
[Datasets] Add from_numpy() and to_numpy() APIs (#18146) 2021-08-27 13:33:11 -07:00			`* - NumPy ndarray Objects`
			- ``ray.data.from_numpy()``
			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Arrow Table Objects`
			- ``ray.data.from_arrow()``
			`- ✅`
			`* - Custom Datasource`
			- ``ray.data.read_datasource()``
			`- ✅`


			`.. list-table:: Output compatibility matrix`
			`:header-rows: 1`

			`* - Output Type`
			`- Dataset API`
			`- Status`
			`* - CSV File Format`
			- ``ds.write_csv()``
			`- ✅`
			`* - JSON File Format`
			- ``ds.write_json()``
			`- ✅`
			`* - Parquet File Format`
			- ``ds.write_parquet()``
			`- ✅`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`* - Numpy File Format`
			- ``ds.write_numpy()``
			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Spark Dataframe`
			- ``ds.to_spark()``
Increase dataset read parallelism by default (#18420) 2021-09-09 15:07:49 -07:00			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Dask Dataframe`
			- ``ds.to_dask()``
			`- ✅`
			`* - Modin Dataframe`
			- ``ds.to_modin()``
Dataset from modin (#18122) 2021-08-31 14:19:35 -04:00			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - MARS Dataframe`
			- ``ds.to_mars()``
			`- (todo)`
			`* - Arrow Table Objects`
[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620) 2021-10-23 12:20:23 -07:00			- ``ds.to_arrow_refs()``
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`- ✅`
			`* - Arrow Table Iterator`
			- ``ds.iter_batches(batch_format="pyarrow")``
			`- ✅`
[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620) 2021-10-23 12:20:23 -07:00			`* - Single Pandas Dataframe`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			- ``ds.to_pandas()``
			`- ✅`
[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620) 2021-10-23 12:20:23 -07:00			`* - Pandas Dataframe Objects`
			- ``ds.to_pandas_refs()``
			`- ✅`
[Datasets] Add from_numpy() and to_numpy() APIs (#18146) 2021-08-27 13:33:11 -07:00			`* - NumPy ndarray Objects`
[data] Fix inconsistent naming of to_refs() methods, remove to_arrow() (#19620) 2021-10-23 12:20:23 -07:00			- ``ds.to_numpy_refs()``
[Datasets] Add from_numpy() and to_numpy() APIs (#18146) 2021-08-27 13:33:11 -07:00			`- ✅`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`* - Pandas Dataframe Iterator`
			- ``ds.iter_batches(batch_format="pandas")``
			`- ✅`
			`* - PyTorch Iterable Dataset`
			- ``ds.to_torch()``
			`- ✅`
			`* - TensorFlow Iterable Dataset`
			- ``ds.to_tf()``
			`- ✅`
			`* - Custom Datasource`
			- ``ds.write_datasource()``
			`- ✅`


			`Creating Datasets`
			`-----------------`

[dataset][usability] Dataset dependencies (#18346) 2021-09-29 17:29:31 -07:00			`.. tip::`

[Documentation] Fix quotes for windows installations (#19859) * [Documentation] Fix quotes for windows installations * update * formatting 2021-10-29 10:54:38 -07:00			Run ``pip install "ray[data]"`` to get started!
[dataset][usability] Dataset dependencies (#18346) 2021-09-29 17:29:31 -07:00
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			Get started by creating Datasets from synthetic data using ``ray.data.range()`` and ``ray.data.from_items()``. Datasets can hold either plain Python objects (schema is a Python type), or Arrow records (schema is Arrow).

			`.. code-block:: python`

Add imports to docs examples to make the code more runnable. (#17240) 2021-07-21 12:18:45 -06:00			`import ray`

First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# Create a Dataset of Python objects.`
			`ds = ray.data.range(10000)`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`ds.take(5)`
			`# -> [0, 1, 2, 3, 4]`

			`ds.count()`
			`# -> 10000`

			`# Create a Dataset of Arrow records.`
			`ds = ray.data.from_items([{"col1": i, "col2": str(i)} for i in range(10000)])`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`# -> Dataset(num_blocks=200, num_rows=10000, schema={col1: int64, col2: string})`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`ds.show(5)`
Make ArrowRow less scary (#19686) 2021-10-25 12:18:42 -07:00			`# -> {'col1': 0, 'col2': '0'}`
			`# -> {'col1': 1, 'col2': '1'}`
			`# -> {'col1': 2, 'col2': '2'}`
			`# -> {'col1': 3, 'col2': '3'}`
			`# -> {'col1': 4, 'col2': '4'}`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`ds.schema()`
			`# -> col1: int64`
			`# -> col2: string`

			Datasets can be created from files on local disk or remote datasources such as S3. Any filesystem `supported by pyarrow <http://arrow.apache.org/docs/python/generated/pyarrow.fs.FileSystem.html>`__ can be used to specify file locations:

			`.. code-block:: python`

			`# Read a directory of files in remote storage.`
			`ds = ray.data.read_csv("s3://bucket/path")`

			`# Read multiple local files.`
			`ds = ray.data.read_csv(["/path/to/file1", "/path/to/file2"])`

			`# Read multiple directories.`
			`ds = ray.data.read_csv(["s3://bucket/path1", "s3://bucket/path2"])`

[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301) 2021-09-08 10:09:01 -07:00			Finally, you can create a ``Dataset`` from existing data in the Ray object store or Ray-compatible distributed DataFrames:
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`.. code-block:: python`

Add imports to docs examples to make the code more runnable. (#17240) 2021-07-21 12:18:45 -06:00			`import pandas as pd`
			`import dask.dataframe as dd`

First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# Create a Dataset from a list of Pandas DataFrame objects.`
			`pdf = pd.DataFrame({"one": [1, 2, 3], "two": ["a", "b", "c"]})`
[Datasets] Delineate between ref and raw APIs for the Pandas/Arrow integrations. (#18992) 2021-10-01 13:08:25 -07:00			`ds = ray.data.from_pandas([pdf])`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`# Create a Dataset from a Dask-on-Ray DataFrame.`
			`dask_df = dd.from_pandas(pdf, npartitions=10)`
			`ds = ray.data.from_dask(dask_df)`

			`Saving Datasets`
			`---------------`

			Datasets can be written to local or remote storage using ``.write_csv()``, ``.write_json()``, and ``.write_parquet()``.

			`.. code-block:: python`

			`# Write to csv files in /tmp/output.`
[data] Fix the ObjectRef type in the dataset docs (#17111) * fix reft * remove exp * fix 2021-07-15 09:50:37 -07:00			`ray.data.range(10000).write_csv("/tmp/output")`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# -> /tmp/output/data0.csv, /tmp/output/data1.csv, ...`

			`# Use repartition to control the number of output files:`
[data] Fix the ObjectRef type in the dataset docs (#17111) * fix reft * remove exp * fix 2021-07-15 09:50:37 -07:00			`ray.data.range(10000).repartition(1).write_csv("/tmp/output2")`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# -> /tmp/output2/data0.csv`

[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301) 2021-09-08 10:09:01 -07:00			You can also convert a ``Dataset`` to Ray-compatibile distributed DataFrames:

			`.. code-block:: python`

			`# Convert a Ray Dataset into a Dask-on-Ray DataFrame.`
			`dask_df = ds.to_dask()`

First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`Transforming Datasets`
			`---------------------`

			Datasets can be transformed in parallel using ``.map()``. Transformations are executed eagerly and block until the operation is finished. Datasets also supports ``.filter()`` and ``.flat_map()``.

			`.. code-block:: python`

			`ds = ray.data.range(10000)`
			`ds = ds.map(lambda x: x * 2)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`# -> Map Progress: 100%\|████████████████████\| 200/200 [00:00<00:00, 1123.54it/s]`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`ds.take(5)`
			`# -> [0, 2, 4, 6, 8]`

			`ds.filter(lambda x: x > 5).take(5)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`# -> Map Progress: 100%\|████████████████████\| 200/200 [00:00<00:00, 1859.63it/s]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# -> [6, 8, 10, 12, 14]`

			`ds.flat_map(lambda x: [x, -x]).take(5)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`# -> Map Progress: 100%\|████████████████████\| 200/200 [00:00<00:00, 1568.10it/s]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`# -> [0, 0, 2, -2, 4]`

			To take advantage of vectorized functions, use ``.map_batches()``. Note that you can also implement ``filter`` and ``flat_map`` using ``.map_batches()``, since your map function can return an output batch of any size.

			`.. code-block:: python`

[data] Fix the ObjectRef type in the dataset docs (#17111) * fix reft * remove exp * fix 2021-07-15 09:50:37 -07:00			`ds = ray.data.range_arrow(10000)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`ds = ds.map_batches(`
			`lambda df: df.applymap(lambda x: x * 2), batch_format="pandas")`
			`# -> Map Progress: 100%\|████████████████████\| 200/200 [00:00<00:00, 1927.62it/s]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`ds.take(5)`
Make ArrowRow less scary (#19686) 2021-10-25 12:18:42 -07:00			`# -> [{'value': 0}, {'value': 2}, ...]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			By default, transformations are executed using Ray tasks. For transformations that require setup, specify ``compute="actors"`` and Ray will use an autoscaling actor pool to execute your transforms instead. The following is an end-to-end example of reading, transforming, and saving batch inference results using Datasets:

			`.. code-block:: python`

			`# Example of GPU batch inference on an ImageNet model.`
			`def preprocess(image: bytes) -> bytes:`
			`return image`

[dataset] Support callable classes to simplify state initialization (#17136) 2021-07-15 23:06:14 -07:00			`class BatchInferModel:`
			`def __init__(self):`
			`self.model = ImageNetModel()`
Add imports to docs examples to make the code more runnable. (#17240) 2021-07-21 12:18:45 -06:00			`def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:`
[dataset] Support callable classes to simplify state initialization (#17136) 2021-07-15 23:06:14 -07:00			`return self.model(batch)`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`ds = ray.data.read_binary_files("s3://bucket/image-dir")`

			`# Preprocess the data.`
			`ds = ds.map(preprocess)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`# -> Map Progress: 100%\|████████████████████\| 200/200 [00:00<00:00, 1123.54it/s]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`# Apply GPU batch inference with actors, and assign each actor a GPU using`
			# ``num_gpus=1`` (any Ray remote decorator argument can be used here).
[dataset] Support callable classes to simplify state initialization (#17136) 2021-07-15 23:06:14 -07:00			`ds = ds.map_batches(BatchInferModel, compute="actors", batch_size=256, num_gpus=1)`
Move ray.data out of experimental (#17560) 2021-08-04 13:31:10 -07:00			`# -> Map Progress (16 actors 4 pending): 100%\|██████\| 200/200 [00:07, 27.60it/s]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`# Save the results.`
			`ds.repartition(1).write_json("s3://bucket/inference-results")`

			`Exchanging datasets`
			`-------------------`

			Datasets can be passed to Ray tasks or actors and read with ``.iter_batches()`` or ``.iter_rows()``. This does not incur a copy, since the blocks of the Dataset are passed by reference as Ray objects:

			`.. code-block:: python`

			`@ray.remote`
			`def consume(data: Dataset[int]) -> int:`
			`num_batches = 0`
			`for batch in data.iter_batches():`
			`num_batches += 1`
			`return num_batches`

			`ds = ray.data.range(10000)`
			`ray.get(consume.remote(ds))`
			`# -> 200`

			Datasets can be split up into disjoint sub-datasets. Locality-aware splitting is supported if you pass in a list of actor handles to the ``split()`` function along with the number of desired splits. This is a common pattern useful for loading and splitting data between distributed training actors:

			`.. code-block:: python`

			`@ray.remote(num_gpus=1)`
			`class Worker:`
			`def __init__(self, rank: int):`
[RFC] API stability annotations (#17100) 2021-07-16 17:09:20 -07:00			`pass`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
Add imports to docs examples to make the code more runnable. (#17240) 2021-07-21 12:18:45 -06:00			`def train(self, shard: ray.data.Dataset[int]) -> int:`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`for batch in shard.iter_batches(batch_size=256):`
[RFC] API stability annotations (#17100) 2021-07-16 17:09:20 -07:00			`pass`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00			`return shard.count()`

			`workers = [Worker.remote(i) for i in range(16)]`
			`# -> [Actor(Worker, ...), Actor(Worker, ...), ...]`

			`ds = ray.data.range(10000)`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`# -> Dataset(num_blocks=200, num_rows=10000, schema=<class 'int'>)`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`shards = ds.split(n=16, locality_hints=workers)`
Support top-level tensor values in dataset (#17439) 2021-08-01 22:45:21 -07:00			`# -> [Dataset(num_blocks=13, num_rows=650, schema=<class 'int'>),`
			`# Dataset(num_blocks=13, num_rows=650, schema=<class 'int'>), ...]`
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`ray.get([w.train.remote(s) for s in shards])`
			`# -> [650, 650, ...]`

			`Custom datasources`
			`------------------`

			Datasets can read and write in parallel to `custom datasources <package-ref.html#custom-datasource-api>`__ defined in Python.

			`.. code-block:: python`

			`# Read from a custom datasource.`
			`ds = ray.data.read_datasource(YourCustomDatasource(), **read_args)`

			`# Write to a custom datasource.`
			`ds.write_datasource(YourCustomDatasource(), **write_args)`

			`Contributing`
			`------------`

Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00			Contributions to Datasets are `welcome <https://docs.ray.io/en/master/development.html#python-develop>`__! There are many potential improvements, including:
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
			`- Supporting more datasources and transforms.`
			`- Integration with more ecosystem libraries.`
			`- Adding features that require partitioning such as groupby() and join().`
			`- Performance optimizations.`