ray/doc/source/iter.rst

Parallel Iterators
=====================

.. _`issue on GitHub`: https://github.com/ray-project/ray/issues

``ray.util.iter`` provides a parallel iterator API for simple data ingest and processing. It can be thought of as syntactic sugar around Ray actors and ``ray.wait`` loops.

Parallel iterators are lazy and can operate over infinite sequences of items. Iterator
transformations are only executed when the user calls ``next()`` to fetch the next output
item from the iterator.

.. note::

  This API is new and may be revised in future Ray releases. If you encounter
  any bugs, please file an `issue on GitHub`_.

Concepts
--------

**Parallel Iterators**: You can create a ``ParallelIterator`` object from an existing
set of items, range of numbers, set of iterators, or set of worker actors. Ray will
create a worker actor that produces the data for each shard of the iterator:

.. code-block:: python

    # Create an iterator with 2 worker actors over the list [1, 2, 3, 4].
    >>> it = ray.util.iter.from_items([1, 2, 3, 4], num_shards=2)
    ParallelIterator[from_items[int, 4, shards=2]]

    # Create an iterator with 32 worker actors over range(1000000).
    >>> it = ray.util.iter.from_range(1000000, num_shards=32)
    ParallelIterator[from_range[1000000, shards=32]]

    # Create an iterator over two range(10) generators.
    >>> it = ray.util.iter.from_iterators([range(10), range(10)])
    ParallelIterator[from_iterators[shards=2]]

    # Create an iterator from existing worker actors. These actors must
    # implement the ParallelIteratorWorker interface.
    >>> it = ray.util.iter.from_actors([a1, a2, a3, a4])
    ParallelIterator[from_actors[shards=4]]

Simple transformations can be chained on the iterator, such as mapping,
filtering, and batching. These will be executed in parallel on the workers:

.. code-block:: python

    # Apply a transformation to each element of the iterator.
    >>> it = it.for_each(lambda x: x ** 2)
    ParallelIterator[...].for_each()

    # Batch together items into a lists of 32 elements.
    >>> it = it.batch(32)
    ParallelIterator[...].for_each().batch(32)

    # Filter out items with odd values.
    >>> it = it.filter(lambda x: x % 2 == 0)
    ParallelIterator[...].for_each().batch(32).filter()

**Local Iterators**: To read elements from a parallel iterator, it has to be converted
to a ``LocalIterator`` by calling ``gather_sync()`` or ``gather_async()``. These
correspond to ``ray.get`` and ``ray.wait`` loops over the actors respectively:

.. code-block:: python

    # Gather items synchronously (deterministic round robin across shards):
    >>> it = ray.util.iter.from_range(1000000, 1)
    >>> it = it.gather_sync()
    LocalIterator[ParallelIterator[from_range[1000000, shards=1]].gather_sync()]

    # Local iterators can be used as any other Python iterator.
    >>> it.take(5)
    [0, 1, 2, 3, 4]

    # They also support chaining of transformations. Unlike transformations
    # applied on a ParallelIterator, they will be executed in the current process.
    >>> it.filter(lambda x: x % 2 == 0).take(5)
    [0, 2, 4, 6, 8]

    # Async gather can be used for better performance, but it is non-deterministic.
    >>> it = ray.util.iter.from_range(1000, 4).gather_async()
    >>> it.take(5)
    [0, 250, 500, 750, 1]

**Passing iterators to remote functions**: Both ``ParallelIterator`` and ``LocalIterator``
are serializable. They can be passed to any Ray remote function. However, note that
each shard should only be read by one process at a time:

.. code-block:: python

    # Get local iterators representing the shards of this ParallelIterator:
    >>> it = ray.util.iter.from_range(10000, 3)
    >>> [s0, s1, s2] = it.shards()
    [LocalIterator[from_range[10000, shards=3].shard[0]],
     LocalIterator[from_range[10000, shards=3].shard[1]],
     LocalIterator[from_range[10000, shards=3].shard[2]]]

    # Iterator shards can be passed to remote functions.
    >>> @ray.remote
    ... def do_sum(it):
    ...     return sum(it)
    ...
    >>> ray.get([do_sum.remote(s) for s in it.shards()])
    [5552778, 16661667, 27780555]

Semantic Guarantees
~~~~~~~~~~~~~~~~~~~

The parallel iterator API guarantees the following semantics:

**Fetch ordering**: When using ``it.gather_sync().foreach(fn)`` or
``it.gather_async().foreach(fn)`` (or any other transformation after a gather),
``fn(x_i)`` will be called on the element ``x_i`` before the next
element ``x_{i+1}`` is fetched from the source actor. This is useful if you need to
update the source actor between iterator steps. Note that for async gather, this
ordering only applies per shard.

**Operator state**: Operator state is preserved for each shard.
This means that you can pass a stateful callable to ``.foreach()``:

.. code-block:: python

    class CumulativeSum:
        def __init__(self):
            self.total = 0

        def __call__(self, x):
            self.total += x
            return (self.total, x)

    it = ray.util.iter.from_range(5, 1)
    for x in it.for_each(CumulativeSum()).gather_sync():
        print(x)

    ## This prints:
    #(0, 0)
    #(1, 1)
    #(3, 2)
    #(6, 3)
    #(10, 4)

Example: Streaming word frequency count
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Parallel iterators can be used for simple data processing use cases such as
streaming grep:

.. code-block:: python

    import ray
    import glob
    import gzip
    import numpy as np

    ray.init()

    file_list = glob.glob("/var/log/syslog*.gz")
    it = (
        ray.util.iter.from_items(file_list, num_shards=4)
           .for_each(lambda f: gzip.open(f).readlines())
           .flatten()
           .for_each(lambda line: line.decode("utf-8"))
           .for_each(lambda line: 1 if "cron" in line else 0)
           .batch(1024)
           .for_each(np.mean)
    )

    # Show the probability of a log line containing "cron", with a
    # sliding window of 1024 lines.
    for freq in it.gather_async():
        print(freq)

Example: Passing iterator shards to remote functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both parallel iterators and local iterators are fully serializable, so once
created you can pass them to Ray tasks and actors. This can be useful for
distributed training:

.. code-block:: python

    import ray
    import numpy as np

    ray.init()

    @ray.remote
    def train(data_shard):
        for batch in data_shard:
            print("train on", batch)  # perform model update with batch

    it = (
        ray.util.iter.from_range(1000000, num_shards=4, repeat=True)
            .batch(1024)
            .for_each(np.array)
    )

    work = [train.remote(shard) for shard in it.shards()]
    ray.get(work)

.. tip:: Using ParallelIterator built-in functions is typically most efficient.
         For example, if you find yourself using list comprehensions like
         ``[foo(x) for x in iter.gather_async()]``, consider using
         ``iter.for_each(foo)`` instead!

API Reference
-------------

.. automodule:: ray.util.iter
    :members:
    :show-inheritance: