ray/doc/source/iter.rst
2020-06-02 14:23:50 -07:00

211 lines
6.9 KiB
ReStructuredText

Parallel Iterators
=====================
.. _`issue on GitHub`: https://github.com/ray-project/ray/issues
``ray.util.iter`` provides a parallel iterator API for simple data ingest and processing. It can be thought of as syntactic sugar around Ray actors and ``ray.wait`` loops.
Parallel iterators are lazy and can operate over infinite sequences of items. Iterator
transformations are only executed when the user calls ``next()`` to fetch the next output
item from the iterator.
.. note::
This API is new and may be revised in future Ray releases. If you encounter
any bugs, please file an `issue on GitHub`_.
Concepts
--------
**Parallel Iterators**: You can create a ``ParallelIterator`` object from an existing
set of items, range of numbers, set of iterators, or set of worker actors. Ray will
create a worker actor that produces the data for each shard of the iterator:
.. code-block:: python
# Create an iterator with 2 worker actors over the list [1, 2, 3, 4].
>>> it = ray.util.iter.from_items([1, 2, 3, 4], num_shards=2)
ParallelIterator[from_items[int, 4, shards=2]]
# Create an iterator with 32 worker actors over range(1000000).
>>> it = ray.util.iter.from_range(1000000, num_shards=32)
ParallelIterator[from_range[1000000, shards=32]]
# Create an iterator over two range(10) generators.
>>> it = ray.util.iter.from_iterators([range(10), range(10)])
ParallelIterator[from_iterators[shards=2]]
# Create an iterator from existing worker actors. These actors must
# implement the ParallelIteratorWorker interface.
>>> it = ray.util.iter.from_actors([a1, a2, a3, a4])
ParallelIterator[from_actors[shards=4]]
Simple transformations can be chained on the iterator, such as mapping,
filtering, and batching. These will be executed in parallel on the workers:
.. code-block:: python
# Apply a transformation to each element of the iterator.
>>> it = it.for_each(lambda x: x ** 2)
ParallelIterator[...].for_each()
# Batch together items into a lists of 32 elements.
>>> it = it.batch(32)
ParallelIterator[...].for_each().batch(32)
# Filter out items with odd values.
>>> it = it.filter(lambda x: x % 2 == 0)
ParallelIterator[...].for_each().batch(32).filter()
**Local Iterators**: To read elements from a parallel iterator, it has to be converted
to a ``LocalIterator`` by calling ``gather_sync()`` or ``gather_async()``. These
correspond to ``ray.get`` and ``ray.wait`` loops over the actors respectively:
.. code-block:: python
# Gather items synchronously (deterministic round robin across shards):
>>> it = ray.util.iter.from_range(1000000, 1)
>>> it = it.gather_sync()
LocalIterator[ParallelIterator[from_range[1000000, shards=1]].gather_sync()]
# Local iterators can be used as any other Python iterator.
>>> it.take(5)
[0, 1, 2, 3, 4]
# They also support chaining of transformations. Unlike transformations
# applied on a ParallelIterator, they will be executed in the current process.
>>> it.filter(lambda x: x % 2 == 0).take(5)
[0, 2, 4, 6, 8]
# Async gather can be used for better performance, but it is non-deterministic.
>>> it = ray.util.iter.from_range(1000, 4).gather_async()
>>> it.take(5)
[0, 250, 500, 750, 1]
**Passing iterators to remote functions**: Both ``ParallelIterator`` and ``LocalIterator``
are serializable. They can be passed to any Ray remote function. However, note that
each shard should only be read by one process at a time:
.. code-block:: python
# Get local iterators representing the shards of this ParallelIterator:
>>> it = ray.util.iter.from_range(10000, 3)
>>> [s0, s1, s2] = it.shards()
[LocalIterator[from_range[10000, shards=3].shard[0]],
LocalIterator[from_range[10000, shards=3].shard[1]],
LocalIterator[from_range[10000, shards=3].shard[2]]]
# Iterator shards can be passed to remote functions.
>>> @ray.remote
... def do_sum(it):
... return sum(it)
...
>>> ray.get([do_sum.remote(s) for s in it.shards()])
[5552778, 16661667, 27780555]
Semantic Guarantees
~~~~~~~~~~~~~~~~~~~
The parallel iterator API guarantees the following semantics:
**Fetch ordering**: When using ``it.gather_sync().foreach(fn)`` or
``it.gather_async().foreach(fn)`` (or any other transformation after a gather),
``fn(x_i)`` will be called on the element ``x_i`` before the next
element ``x_{i+1}`` is fetched from the source actor. This is useful if you need to
update the source actor between iterator steps. Note that for async gather, this
ordering only applies per shard.
**Operator state**: Operator state is preserved for each shard.
This means that you can pass a stateful callable to ``.foreach()``:
.. code-block:: python
class CumulativeSum:
def __init__(self):
self.total = 0
def __call__(self, x):
self.total += x
return (self.total, x)
it = ray.util.iter.from_range(5, 1)
for x in it.for_each(CumulativeSum()).gather_sync():
print(x)
## This prints:
#(0, 0)
#(1, 1)
#(3, 2)
#(6, 3)
#(10, 4)
Example: Streaming word frequency count
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Parallel iterators can be used for simple data processing use cases such as
streaming grep:
.. code-block:: python
import ray
import glob
import gzip
import numpy as np
ray.init()
file_list = glob.glob("/var/log/syslog*.gz")
it = (
ray.util.iter.from_items(file_list, num_shards=4)
.for_each(lambda f: gzip.open(f).readlines())
.flatten()
.for_each(lambda line: line.decode("utf-8"))
.for_each(lambda line: 1 if "cron" in line else 0)
.batch(1024)
.for_each(np.mean)
)
# Show the probability of a log line containing "cron", with a
# sliding window of 1024 lines.
for freq in it.gather_async():
print(freq)
Example: Passing iterator shards to remote functions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Both parallel iterators and local iterators are fully serializable, so once
created you can pass them to Ray tasks and actors. This can be useful for
distributed training:
.. code-block:: python
import ray
import numpy as np
ray.init()
@ray.remote
def train(data_shard):
for batch in data_shard:
print("train on", batch) # perform model update with batch
it = (
ray.util.iter.from_range(1000000, num_shards=4, repeat=True)
.batch(1024)
.for_each(np.array)
)
work = [train.remote(shard) for shard in it.shards()]
ray.get(work)
.. tip:: Using ParallelIterator built-in functions is typically most efficient.
For example, if you find yourself using list comprehensions like
``[foo(x) for x in iter.gather_async()]``, consider using
``iter.for_each(foo)`` instead!
API Reference
-------------
.. automodule:: ray.util.iter
:members:
:show-inheritance: