mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

* Ray doc - changed python indentation to 4 spaces in documentation files actors.rst, api.rst, and example-*.rst * Ray documentation - changed Python to 4 space indentation for files install-*.rst, installation-troubleshooting.rst, internals-overview.rst, serialization.rst, troubleshootin.rst, tutorial.rst, using-ray-*.rst
281 lines
9.4 KiB
ReStructuredText
281 lines
9.4 KiB
ReStructuredText
Tutorial
|
|
========
|
|
|
|
To use Ray, you need to understand the following:
|
|
|
|
- How Ray executes tasks asynchronously to achieve parallelism.
|
|
- How Ray uses object IDs to represent immutable remote objects.
|
|
|
|
Overview
|
|
--------
|
|
|
|
Ray is a Python-based distributed execution engine. The same code can be run on
|
|
a single machine to achieve efficient multiprocessing, and it can be used on a
|
|
cluster for large computations.
|
|
|
|
When using Ray, several processes are involved.
|
|
|
|
- Multiple **worker** processes execute tasks and store results in object
|
|
stores. Each worker is a separate process.
|
|
- One **object store** per node stores immutable objects in shared memory and
|
|
allows workers to efficiently share objects on the same node with minimal
|
|
copying and deserialization.
|
|
- One **local scheduler** per node assigns tasks to workers on the same node.
|
|
- A **global scheduler** receives tasks from local schedulers and assigns them
|
|
to other local schedulers.
|
|
- A **driver** is the Python process that the user controls. For example, if the
|
|
user is running a script or using a Python shell, then the driver is the Python
|
|
process that runs the script or the shell. A driver is similar to a worker in
|
|
that it can submit tasks to its local scheduler and get objects from the object
|
|
store, but it is different in that the local scheduler will not assign tasks to
|
|
the driver to be executed.
|
|
- A **Redis server** maintains much of the system's state. For example, it keeps
|
|
track of which objects live on which machines and of the task specifications
|
|
(but not data). It can also be queried directly for debugging purposes.
|
|
|
|
Starting Ray
|
|
------------
|
|
|
|
To start Ray, start Python and run the following commands.
|
|
|
|
.. code-block:: python
|
|
|
|
import ray
|
|
ray.init()
|
|
|
|
This starts Ray.
|
|
|
|
Immutable remote objects
|
|
------------------------
|
|
|
|
In Ray, we can create and compute on objects. We refer to these objects as
|
|
**remote objects**, and we use **object IDs** to refer to them. Remote objects
|
|
are stored in **object stores**, and there is one object store per node in the
|
|
cluster. In the cluster setting, we may not actually know which machine each
|
|
object lives on.
|
|
|
|
An **object ID** is essentially a unique ID that can be used to refer to a
|
|
remote object. If you're familiar with Futures, our object IDs are conceptually
|
|
similar.
|
|
|
|
We assume that remote objects are immutable. That is, their values cannot be
|
|
changed after creation. This allows remote objects to be replicated in multiple
|
|
object stores without needing to synchronize the copies.
|
|
|
|
Put and Get
|
|
~~~~~~~~~~~
|
|
|
|
The commands ``ray.get`` and ``ray.put`` can be used to convert between Python
|
|
objects and object IDs, as shown in the example below.
|
|
|
|
.. code-block:: python
|
|
|
|
x = "example"
|
|
ray.put(x) # ObjectID(b49a32d72057bdcfc4dda35584b3d838aad89f5d)
|
|
|
|
The command ``ray.put(x)`` would be run by a worker process or by the driver
|
|
process (the driver process is the one running your script). It takes a Python
|
|
object and copies it to the local object store (here *local* means *on the same
|
|
node*). Once the object has been stored in the object store, its value cannot be
|
|
changed.
|
|
|
|
In addition, ``ray.put(x)`` returns an object ID, which is essentially an ID that
|
|
can be used to refer to the newly created remote object. If we save the object
|
|
ID in a variable with ``x_id = ray.put(x)``, then we can pass ``x_id`` into remote
|
|
functions, and those remote functions will operate on the corresponding remote
|
|
object.
|
|
|
|
The command ``ray.get(x_id)`` takes an object ID and creates a Python object from
|
|
the corresponding remote object. For some objects like arrays, we can use shared
|
|
memory and avoid copying the object. For other objects, this copies the object
|
|
from the object store to the worker process's heap. If the remote object
|
|
corresponding to the object ID ``x_id`` does not live on the same node as the
|
|
worker that calls ``ray.get(x_id)``, then the remote object will first be
|
|
transferred from an object store that has it to the object store that needs it.
|
|
|
|
.. code-block:: python
|
|
|
|
x_id = ray.put("example")
|
|
ray.get(x_id) # "example"
|
|
|
|
If the remote object corresponding to the object ID ``x_id`` has not been created
|
|
yet, the command ``ray.get(x_id)`` will wait until the remote object has been
|
|
created.
|
|
|
|
A very common use case of ``ray.get`` is to get a list of object IDs. In this
|
|
case, you can call ``ray.get(object_ids)`` where ``object_ids`` is a list of object
|
|
IDs.
|
|
|
|
.. code-block:: python
|
|
|
|
result_ids = [ray.put(i) for i in range(10)]
|
|
ray.get(result_ids) # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
|
|
|
|
Asynchronous Computation in Ray
|
|
-------------------------------
|
|
|
|
Ray enables arbitrary Python functions to be executed asynchronously. This is
|
|
done by designating a Python function as a **remote function**.
|
|
|
|
For example, a normal Python function looks like this.
|
|
|
|
.. code-block:: python
|
|
|
|
def add1(a, b):
|
|
return a + b
|
|
|
|
A remote function looks like this.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
def add2(a, b):
|
|
return a + b
|
|
|
|
Remote functions
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
Whereas calling ``add1(1, 2)`` returns ``3`` and causes the Python interpreter to
|
|
block until the computation has finished, calling ``add2.remote(1, 2)``
|
|
immediately returns an object ID and creates a **task**. The task will be
|
|
scheduled by the system and executed asynchronously (potentially on a different
|
|
machine). When the task finishes executing, its return value will be stored in
|
|
the object store.
|
|
|
|
.. code-block:: python
|
|
|
|
x_id = add2.remote(1, 2)
|
|
ray.get(x_id) # 3
|
|
|
|
The following simple example demonstrates how asynchronous tasks can be used
|
|
to parallelize computation.
|
|
|
|
.. code-block:: python
|
|
|
|
import time
|
|
|
|
def f1():
|
|
time.sleep(1)
|
|
|
|
@ray.remote
|
|
def f2():
|
|
time.sleep(1)
|
|
|
|
# The following takes ten seconds.
|
|
[f1() for _ in range(10)]
|
|
|
|
# The following takes one second (assuming the system has at least ten CPUs).
|
|
ray.get([f2.remote() for _ in range(10)])
|
|
|
|
There is a sharp distinction between *submitting a task* and *executing the
|
|
task*. When a remote function is called, the task of executing that function is
|
|
submitted to a local scheduler, and object IDs for the outputs of the task are
|
|
immediately returned. However, the task will not be executed until the system
|
|
actually schedules the task on a worker. Task execution is **not** done lazily.
|
|
The system moves the input data to the task, and the task will execute as soon
|
|
as its input dependencies are available and there are enough resources for the
|
|
computation.
|
|
|
|
**When a task is submitted, each argument may be passed in by value or by object
|
|
ID.** For example, these lines have the same behavior.
|
|
|
|
.. code-block:: python
|
|
|
|
add2.remote(1, 2)
|
|
add2.remote(1, ray.put(2))
|
|
add2.remote(ray.put(1), ray.put(2))
|
|
|
|
Remote functions never return actual values, they always return object IDs.
|
|
|
|
When the remote function is actually executed, it operates on Python objects.
|
|
That is, if the remote function was called with any object IDs, the system will
|
|
retrieve the corresponding objects from the object store.
|
|
|
|
Note that a remote function can return multiple object IDs.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote(num_return_vals=3)
|
|
def return_multiple():
|
|
return 1, 2, 3
|
|
|
|
a_id, b_id, c_id = return_multiple.remote()
|
|
|
|
Expressing dependencies between tasks
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Programmers can express dependencies between tasks by passing the object ID
|
|
output of one task as an argument to another task. For example, we can launch
|
|
three tasks as follows, each of which depends on the previous task.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
def f(x):
|
|
return x + 1
|
|
|
|
x = f.remote(0)
|
|
y = f.remote(x)
|
|
z = f.remote(y)
|
|
ray.get(z) # 3
|
|
|
|
The second task above will not execute until the first has finished, and the
|
|
third will not execute until the second has finished. In this example, there are
|
|
no opportunities for parallelism.
|
|
|
|
The ability to compose tasks makes it easy to express interesting dependencies.
|
|
Consider the following implementation of a tree reduce.
|
|
|
|
.. code-block:: python
|
|
|
|
import numpy as np
|
|
|
|
@ray.remote
|
|
def generate_data():
|
|
return np.random.normal(size=1000)
|
|
|
|
@ray.remote
|
|
def aggregate_data(x, y):
|
|
return x + y
|
|
|
|
# Generate some random data. This launches 100 tasks that will be scheduled on
|
|
# various nodes. The resulting data will be distributed around the cluster.
|
|
data = [generate_data.remote() for _ in range(100)]
|
|
|
|
# Perform a tree reduce.
|
|
while len(data) > 1:
|
|
data.append(aggregate_data.remote(data.pop(0), data.pop(0)))
|
|
|
|
# Fetch the result.
|
|
ray.get(data)
|
|
|
|
Remote Functions Within Remote Functions
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
So far, we have been calling remote functions only from the driver. But worker
|
|
processes can also call remote functions. To illustrate this, consider the
|
|
following example.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
def sub_experiment(i, j):
|
|
# Run the jth sub-experiment for the ith experiment.
|
|
return i + j
|
|
|
|
@ray.remote
|
|
def run_experiment(i):
|
|
sub_results = []
|
|
# Launch tasks to perform 10 sub-experiments in parallel.
|
|
for j in range(10):
|
|
sub_results.append(sub_experiment.remote(i, j))
|
|
# Return the sum of the results of the sub-experiments.
|
|
return sum(ray.get(sub_results))
|
|
|
|
results = [run_experiment.remote(i) for i in range(5)]
|
|
ray.get(results) # [45, 55, 65, 75, 85]
|
|
|
|
When the remote function ``run_experiment`` is executed on a worker, it calls the
|
|
remote function ``sub_experiment`` a number of times. This is an example of how
|
|
multiple experiments, each of which takes advantage of parallelism internally,
|
|
can all be run in parallel.
|