ray/doc/source/example-streaming.rst
Philipp Moritz 2c0d5544ac Add streaming MapReduce example (#1251)
Add streaming MapReduce example.
2017-11-27 21:38:35 -08:00

151 lines
3.7 KiB
ReStructuredText

Streaming MapReduce
===================
This document walks through how to implement a simple streaming application
using Ray's actor capabilities. It implements a streaming MapReduce which
computes word counts on wikipedia articles.
You can view the `code for this example`_.
.. _`code for this example`: https://github.com/ray-project/ray/tree/master/examples/streaming
To run the example, you need to install the dependencies
.. code-block:: bash
pip install wikipedia
and then execute the script as follows:
.. code-block:: bash
python ray/examples/streaming/streaming.py
For each round of articles read, the script will output
the top 10 words in these articles together with their word count:
.. code-block:: text
article index = 0
the 2866
of 1688
and 1448
in 1101
to 593
a 553
is 509
as 325
are 284
by 261
article index = 1
the 3597
of 1971
and 1735
in 1429
to 670
a 623
is 578
as 401
by 293
for 285
article index = 2
the 3910
of 2123
and 1890
in 1468
to 658
a 653
is 488
as 364
by 362
for 297
article index = 3
the 2962
of 1667
and 1472
in 1220
a 546
to 538
is 516
as 307
by 253
for 243
article index = 4
the 3523
of 1866
and 1690
in 1475
to 645
a 583
is 572
as 352
by 318
for 306
...
Note that this examples uses `distributed actor handles`_, which are still
considered experimental.
.. _`distributed actor handles`: http://ray.readthedocs.io/en/latest/actors.html
There is a ``Mapper`` actor, which has a method ``get_range`` used to retrieve
word counts for words in a certain range:
.. code-block:: python
@ray.remote
class Mapper(object):
def __init__(self, title_stream):
# Constructor, the title stream parameter is a stream of wikipedia
# article titles that will be read by this mapper
def get_range(self, article_index, keys):
# Return counts of all the words with first
# letter between keys[0] and keys[1] in the
# articles that haven't been read yet with index
# up to article_index
The ``Reducer`` actor holds a list of mappers, calls ``get_range`` on them
and accumulates the results.
.. code-block:: python
@ray.remote
class Reducer(object):
def __init__(self, keys, *mappers):
# Constructor for a reducer that gets input from the list of mappers
# in the argument and accumulates word counts for words with first
# letter between keys[0] and keys[1]
def next_reduce_result(self, article_index):
# Get articles up to article_index that haven't been read yet,
# accumulate the word counts and return them
On the driver, we then create a number of mappers and reducers and run the
streaming MapReduce:
.. code-block:: python
streams = # Create list of num_mappers streams
keys = # Partition the keys among the reducers.
# Create a number of mappers.
mappers = [Mapper.remote(stream) for stream in streams]
# Create a number of reduces, each responsible for a different range of keys.
# This gives each Reducer actor a handle to each Mapper actor.
reducers = [Reducer.remote(key, *mappers) for key in keys]
article_index = 0
while True:
counts = ray.get([reducer.next_reduce_result.remote(article_index)
for reducer in reducers])
article_index += 1
The actual example reads a list of articles and creates a stream object which
produces an infinite stream of articles from the list. This is a toy example
meant to illustrate the idea. In practice we would produce a stream of
non-repeating items for each mapper.