ray/doc/source/data/advanced-pipelines.rst

.. _data_pipeline_usage:

--------------------------
Advanced Pipeline Examples
--------------------------

This page covers more advanced examples for dataset pipelines.

.. _dataset-pipeline-per-epoch-shuffle:

Pre-repeat vs post-repeat transforms
====================================

Transformations prior to the call to ``.repeat()`` will be cached. However, note that the initial read will not be cached unless there is a subsequent transformation or ``.fully_executed()`` call. Transformations made to the DatasetPipeline after the repeat will always be executed once for each repetition of the Dataset.

For example, in the following pipeline, the ``map(func)`` transformation only occurs once. However, the random shuffle is applied to each repetition in the pipeline. However, if we omitted the map transformation, then the pipeline would re-read from the base data on each repetition.

.. note::
  Global per-epoch shuffling is an expensive operation that will slow down your ML
  ingest pipeline, prevents you from using a fully-streaming ML ingest pipeline, and
  can cause large increases in memory utilization and spilling to disk; only use
  global per-epoch shuffling if your model benefits from it! If your model doesn't
  benefit from global per-epoch shuffling and/or you run into performance or stability
  issues, you should try out windowed or local per-epoch shuffling.

**Code**:

.. code-block:: python

    # Create a pipeline that loops over its source dataset indefinitely.
    pipe: DatasetPipeline = ray.data \
        .read_datasource(...) \
        .map(func) \
        .repeat() \
        .random_shuffle_each_window()

    @ray.remote(num_gpus=1)
    def train_func(pipe: DatasetPipeline):
        model = MyModel()
        for batch in pipe.iter_torch_batches():
            model.fit(batch)

    # Read from the pipeline in a remote training function.
    ray.get(train_func.remote(pipe))


**Pipeline**:

.. image:: images/dataset-repeat-1.svg

.. important::

    Result caching only applies if there are *transformation* stages prior to the pipelining operation. If you ``repeat()`` or ``window()`` a Dataset right after the read call (e.g., ``ray.data.read_parquet(...).repeat()``), then the read will still be re-executed on each repetition. This optimization saves memory, at the cost of repeated reads from the datasource. To force result caching in all cases, use ``.fully_executed().repeat()``.

Changing Pipeline Structure
===========================

Sometimes, you may want to change the structure of an existing pipeline. For example, after generating a pipeline with ``ds.window(k)``, you may want to repeat that windowed pipeline ``n`` times. This can be done with ``ds.window(k).repeat(n)``. As another example, suppose you have a repeating pipeline generated with ``ds.repeat(n)``. The windowing of that pipeline can be changed with ``ds.repeat(n).rewindow(k)``. Note the subtle difference in the two examples: the former is repeating a windowed pipeline that has a base window size of ``k``, while the latter is re-windowing a pipeline of initial window size of ``ds.num_blocks()``. The latter may produce windows that span multiple copies of the same original data if ``preserve_epoch=False`` is set:

.. code-block:: python

    # Window followed by repeat.
    ray.data.from_items([0, 1, 2, 3, 4]) \
        .window(blocks_per_window=2) \
        .repeat(2) \
        .show_windows()
    # ->
    # ------ Epoch 0 ------
    # === Window 0 ===
    # 0
    # 1
    # === Window 1 ===
    # 2
    # 3
    # === Window 2 ===
    # 4
    # ------ Epoch 1 ------
    # === Window 3 ===
    # 0
    # 1
    # === Window 4 ===
    # 2
    # 3
    # === Window 5 ===
    # 4

    # Repeat followed by window. Since preserve_epoch=True, at epoch boundaries
    # windows may be smaller than the target size. If it was set to False, all
    # windows except the last would be the target size.
    ray.data.from_items([0, 1, 2, 3, 4]) \
        .repeat(2) \
        .rewindow(blocks_per_window=2, preserve_epoch=True) \
        .show_windows()
    # ->
    # ------ Epoch 0 ------
    # === Window 0 ===
    # 0
    # 1
    # === Window 1 ===
    # 2
    # 3
    # === Window 2 ===
    # 4
    # ------ Epoch 1 ------
    # === Window 3 ===
    # 0
    # 1
    # === Window 4 ===
    # 2
    # 3
    # === Window 5 ===
    # 4
[Docs] Ray Data docs target state (#21931) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838. 2022-01-27 22:14:36 +01:00			`.. _data_pipeline_usage:`
[SGD] Sgd v2 Dataset Integration (#17626) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> 2021-10-12 14:03:10 -07:00
[docs] Editing pass over Dataset docs (#26935) 2022-07-24 19:48:29 -07:00			`--------------------------`
			`Advanced Pipeline Examples`
			`--------------------------`
Add documentation for DatasetPipeline.from_iterable (#18106) 2021-08-25 22:31:23 -07:00
[docs] Editing pass over Dataset docs (#26935) 2022-07-24 19:48:29 -07:00			`This page covers more advanced examples for dataset pipelines.`
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00
[SGD] Sgd v2 Dataset Integration (#17626) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> 2021-10-12 14:03:10 -07:00			`.. _dataset-pipeline-per-epoch-shuffle:`

Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00			`Pre-repeat vs post-repeat transforms`
[Docs] Ray Data docs target state (#21931) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838. 2022-01-27 22:14:36 +01:00			`====================================`
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00
[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265) Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here. 2022-05-31 16:01:00 -07:00			Transformations prior to the call to ``.repeat()`` will be cached. However, note that the initial read will not be cached unless there is a subsequent transformation or ``.fully_executed()`` call. Transformations made to the DatasetPipeline after the repeat will always be executed once for each repetition of the Dataset.
[SGD] Sgd v2 Dataset Integration (#17626) * wip * wip * wip * draft * disable tf autosharding * wip * wip * wip * wip * add example * wip * wip * wip * use dataset.split * add unit tests * add linear example * concatenate tensors and fix example * WIP tune example * add tensorflow example * wip * random_shuffle_each_window * fault tolerance test * GPU, examples, CI * formatting * fix * Update python/ray/util/sgd/v2/tests/test_trainer.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * wip * type hints * wip * update user guide * fix * fix immediate issues * update example * update * fix tune gpu test * fix resources for smoke test - 1 CPU for dataset tasks * update tests, docs, examples * Apply suggestions from code review Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * address comments * add warning * fix tests * minor doc updates * update example in doc * configure tests * Update doc/source/raysgd/v2/user_guide.rst Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> * Update python/ray/data/dataset.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * fix docstring Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com> 2021-10-12 14:03:10 -07:00
[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265) Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here. 2022-05-31 16:01:00 -07:00			For example, in the following pipeline, the ``map(func)`` transformation only occurs once. However, the random shuffle is applied to each repetition in the pipeline. However, if we omitted the map transformation, then the pipeline would re-read from the base data on each repetition.
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00
[Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. 2022-05-18 16:17:48 -07:00			`.. note::`
			`Global per-epoch shuffling is an expensive operation that will slow down your ML`
			`ingest pipeline, prevents you from using a fully-streaming ML ingest pipeline, and`
			`can cause large increases in memory utilization and spilling to disk; only use`
			`global per-epoch shuffling if your model benefits from it! If your model doesn't`
			`benefit from global per-epoch shuffling and/or you run into performance or stability`
			`issues, you should try out windowed or local per-epoch shuffling.`

Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00			`Code:`

			`.. code-block:: python`

			`# Create a pipeline that loops over its source dataset indefinitely.`
			`pipe: DatasetPipeline = ray.data \`
			`.read_datasource(...) \`
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00			`.map(func) \`
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00			`.repeat() \`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`.random_shuffle_each_window()`
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00
			`@ray.remote(num_gpus=1)`
			`def train_func(pipe: DatasetPipeline):`
			`model = MyModel()`
[AIR] Replace references of `to_torch` with `iter_torch_batches` (#27574) 2022-08-07 20:14:12 -07:00			`for batch in pipe.iter_torch_batches():`
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00			`model.fit(batch)`

			`# Read from the pipeline in a remote training function.`
			`ray.get(train_func.remote(pipe))`


			`Pipeline:`

[Docs] Ray Data docs target state (#21931) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838. 2022-01-27 22:14:36 +01:00			`.. image:: images/dataset-repeat-1.svg`
Initial implementation of Dataset pipelining and docs (#17309) 2021-07-28 21:12:01 -07:00
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00			`.. important::`

[data] Fix pipeline pre-repeat caching, and improve the documentation (#25265) Currently the canonical way to cache a pipeline and repeat it: ds.fully_executed().repeat() crashes. Add a test, fix the docs and stats printing here. 2022-05-31 16:01:00 -07:00			Result caching only applies if there are transformation stages prior to the pipelining operation. If you ``repeat()`` or ``window()`` a Dataset right after the read call (e.g., ``ray.data.read_parquet(...).repeat()``), then the read will still be re-executed on each repetition. This optimization saves memory, at the cost of repeated reads from the datasource. To force result caching in all cases, use ``.fully_executed().repeat()``.
[minor] Fix incorrect link to ray core user guide (#23316) 2022-03-17 20:58:56 -07:00
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`Changing Pipeline Structure`
[Docs] Ray Data docs target state (#21931) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838. 2022-01-27 22:14:36 +01:00			`===========================`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			Sometimes, you may want to change the structure of an existing pipeline. For example, after generating a pipeline with ``ds.window(k)``, you may want to repeat that windowed pipeline ``n`` times. This can be done with ``ds.window(k).repeat(n)``. As another example, suppose you have a repeating pipeline generated with ``ds.repeat(n)``. The windowing of that pipeline can be changed with ``ds.repeat(n).rewindow(k)``. Note the subtle difference in the two examples: the former is repeating a windowed pipeline that has a base window size of ``k``, while the latter is re-windowing a pipeline of initial window size of ``ds.num_blocks()``. The latter may produce windows that span multiple copies of the same original data if ``preserve_epoch=False`` is set:
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00
			`.. code-block:: python`

			`# Window followed by repeat.`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`ray.data.from_items([0, 1, 2, 3, 4]) \`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`.window(blocks_per_window=2) \`
			`.repeat(2) \`
			`.show_windows()`
			`# ->`
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00			`# ------ Epoch 0 ------`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# === Window 0 ===`
			`# 0`
			`# 1`
			`# === Window 1 ===`
			`# 2`
			`# 3`
			`# === Window 2 ===`
			`# 4`
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00			`# ------ Epoch 1 ------`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# === Window 3 ===`
			`# 0`
			`# 1`
			`# === Window 4 ===`
			`# 2`
			`# 3`
			`# === Window 5 ===`
			`# 4`

[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`# Repeat followed by window. Since preserve_epoch=True, at epoch boundaries`
			`# windows may be smaller than the target size. If it was set to False, all`
			`# windows except the last would be the target size.`
			`ray.data.from_items([0, 1, 2, 3, 4]) \`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`.repeat(2) \`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`.rewindow(blocks_per_window=2, preserve_epoch=True) \`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`.show_windows()`
			`# ->`
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00			`# ------ Epoch 0 ------`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# === Window 0 ===`
			`# 0`
			`# 1`
			`# === Window 1 ===`
			`# 2`
			`# 3`
			`# === Window 2 ===`
			`# 4`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`# ------ Epoch 1 ------`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# === Window 3 ===`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`# 0`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# 1`
			`# === Window 4 ===`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`# 2`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# 3`
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00			`# === Window 5 ===`
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00			`# 4`
[Docs] Ray Data docs target state (#21931) Preview: [docs](https://ray--21931.org.readthedocs.build/en/21931/data/dataset.html) The Ray Data project's docs now have a clearer structure and have partly been rewritten/modified. In particular we have - [x] A Getting Started Guide - [x] An explicit User / How-To Guide - [x] A dedicated Key Concepts page - [x] A consistent naming convention in `Ray Data` whenever is is referred to the project. This surfaces quite clearly that, apart from the "Getting Started" sections, we really only have one real example. Once we have more, we can create an "Example" section like many other sub-projects have. This will be addressed in https://github.com/ray-project/ray/issues/21838. 2022-01-27 22:14:36 +01:00