mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
568 lines
20 KiB
ReStructuredText
568 lines
20 KiB
ReStructuredText
Profiling for Ray Users
|
|
=======================
|
|
|
|
This document is intended for users of Ray who want to know how to evaluate
|
|
the performance of their code while running on Ray. Profiling the
|
|
performance of your code can be very helpful to determine performance
|
|
bottlenecks or to find out where your code may not be parallelized properly.
|
|
If you are interested in pinpointing why your Ray application may not be
|
|
achieving the expected speedup, read on!
|
|
|
|
|
|
Visualizing Tasks in the Ray Timeline
|
|
-------------------------------------
|
|
|
|
The most important tool is the timeline visualization tool. To visualize tasks
|
|
in the Ray timeline, you can dump the timeline as a JSON file using the
|
|
following command.
|
|
|
|
.. code-block:: python
|
|
|
|
ray.timeline(filename="/tmp/timeline.json")
|
|
|
|
Then open `chrome://tracing`_ in the Chrome web browser, and load
|
|
``timeline.json``.
|
|
|
|
.. _`chrome://tracing`: chrome://tracing
|
|
|
|
|
|
A Basic Example to Profile
|
|
--------------------------
|
|
|
|
Let's try to profile a simple example, and compare how different ways to
|
|
write a simple loop can affect performance.
|
|
|
|
As a proxy for a computationally intensive and possibly slower function,
|
|
let's define our remote function to just sleep for 0.5 seconds:
|
|
|
|
.. code-block:: python
|
|
|
|
import ray
|
|
import time
|
|
|
|
# Our time-consuming remote function
|
|
@ray.remote
|
|
def func():
|
|
time.sleep(0.5)
|
|
|
|
In our example setup, we wish to call our remote function ``func()`` five
|
|
times, and store the result of each call into a list. To compare the
|
|
performance of different ways of looping our calls to our remote function,
|
|
we can define each loop version as a separate function on the driver script.
|
|
|
|
For the first version **ex1**, each iteration of the loop calls the remote
|
|
function, then calls ``ray.get`` in an attempt to store the current result
|
|
into the list, as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
# This loop is suboptimal in Ray, and should only be used for the sake of this example
|
|
def ex1():
|
|
list1 = []
|
|
for i in range(5):
|
|
list1.append(ray.get(func.remote()))
|
|
|
|
For the second version **ex2**, each iteration of the loop calls the remote
|
|
function, and stores it into the list **without** calling ``ray.get`` each time.
|
|
``ray.get`` is used after the loop has finished, in preparation for processing
|
|
``func()``'s results:
|
|
|
|
.. code-block:: python
|
|
|
|
# This loop is more proper in Ray
|
|
def ex2():
|
|
list2 = []
|
|
for i in range(5):
|
|
list2.append(func.remote())
|
|
ray.get(list2)
|
|
|
|
Finally, for an example that's not so parallelizable, let's create a
|
|
third version **ex3** where the driver has to call a local
|
|
function in between each call to the remote function ``func()``:
|
|
|
|
.. code-block:: python
|
|
|
|
# A local function executed on the driver, not on Ray
|
|
def other_func():
|
|
time.sleep(0.3)
|
|
|
|
def ex3():
|
|
list3 = []
|
|
for i in range(5):
|
|
other_func()
|
|
list3.append(func.remote())
|
|
ray.get(list3)
|
|
|
|
|
|
Timing Performance Using Python's Timestamps
|
|
--------------------------------------------
|
|
|
|
One way to sanity-check the performance of the three loops is simply to
|
|
time how long it takes to complete each loop version. We can do this using
|
|
python's built-in ``time`` `module`_.
|
|
|
|
.. _`module`: https://docs.python.org/3/library/time.html
|
|
|
|
The ``time`` module contains a useful ``time()`` function that returns the
|
|
current timestamp in unix time whenever it's called. We can create a generic
|
|
function wrapper to call ``time()`` right before and right after each loop
|
|
function to print out how long each loop takes overall:
|
|
|
|
.. code-block:: python
|
|
|
|
# This is a generic wrapper for any driver function you want to time
|
|
def time_this(f):
|
|
def timed_wrapper(*args, **kw):
|
|
start_time = time.time()
|
|
result = f(*args, **kw)
|
|
end_time = time.time()
|
|
|
|
# Time taken = end_time - start_time
|
|
print('| func:%r args:[%r, %r] took: %2.4f seconds |' % \
|
|
(f.__name__, args, kw, end_time - start_time))
|
|
return result
|
|
return timed_wrapper
|
|
|
|
To always print out how long the loop takes to run each time the loop
|
|
function ``ex1()`` is called, we can evoke our ``time_this`` wrapper with
|
|
a function decorator. This can similarly be done to functions ``ex2()``
|
|
and ``ex3()``:
|
|
|
|
.. code-block:: python
|
|
|
|
@time_this # Added decorator
|
|
def ex1():
|
|
list1 = []
|
|
for i in range(5):
|
|
list1.append(ray.get(func.remote()))
|
|
|
|
def main():
|
|
ray.init()
|
|
ex1()
|
|
ex2()
|
|
ex3()
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
|
|
Then, running the three timed loops should yield output similar to this:
|
|
|
|
.. code-block:: bash
|
|
|
|
| func:'ex1' args:[(), {}] took: 2.5083 seconds |
|
|
| func:'ex2' args:[(), {}] took: 1.0032 seconds |
|
|
| func:'ex3' args:[(), {}] took: 2.0039 seconds |
|
|
|
|
Let's interpret these results.
|
|
|
|
Here, ``ex1()`` took substantially more time than ``ex2()``, where
|
|
their only difference is that ``ex1()`` calls ``ray.get`` on the remote
|
|
function before adding it to the list, while ``ex2()`` waits to fetch the
|
|
entire list with ``ray.get`` at once.
|
|
|
|
.. code-block:: python
|
|
|
|
@ray.remote
|
|
def func(): # A single call takes 0.5 seconds
|
|
time.sleep(0.5)
|
|
|
|
def ex1(): # Took Ray 2.5 seconds
|
|
list1 = []
|
|
for i in range(5):
|
|
list1.append(ray.get(func.remote()))
|
|
|
|
def ex2(): # Took Ray 1 second
|
|
list2 = []
|
|
for i in range(5):
|
|
list2.append(func.remote())
|
|
ray.get(list2)
|
|
|
|
Notice how ``ex1()`` took 2.5 seconds, exactly five times 0.5 seconds, or
|
|
the time it would take to wait for our remote function five times in a row.
|
|
|
|
By calling ``ray.get`` after each call to the remote function, ``ex1()``
|
|
removes all ability to parallelize work, by forcing the driver to wait for
|
|
each ``func()``'s result in succession. We are not taking advantage of Ray
|
|
parallelization here!
|
|
|
|
Meanwhile, ``ex2()`` takes about 1 second, much faster than it would normally
|
|
take to call ``func()`` five times iteratively. Ray is running each call to
|
|
``func()`` in parallel, saving us time.
|
|
|
|
``ex1()`` is actually a common user mistake in Ray. ``ray.get`` is not
|
|
necessary to do before adding the result of ``func()`` to the list. Instead,
|
|
the driver should send out all parallelizable calls to the remote function
|
|
to Ray before waiting to receive their results with ``ray.get``. ``ex1()``'s
|
|
suboptimal behavior can be noticed just using this simple timing test.
|
|
|
|
Realistically, however, many applications are not as highly parallelizable
|
|
as ``ex2()``, and the application includes sections where the code must run in
|
|
serial. ``ex3()`` is such an example, where the local function ``other_func()``
|
|
must run first before each call to ``func()`` can be submitted to Ray.
|
|
|
|
.. code-block:: python
|
|
|
|
# A local function that must run in serial
|
|
def other_func():
|
|
time.sleep(0.3)
|
|
|
|
def ex3(): # Took Ray 2 seconds, vs. ex1 taking 2.5 seconds
|
|
list3 = []
|
|
for i in range(5):
|
|
other_func()
|
|
list2.append(func.remote())
|
|
ray.get(list3)
|
|
|
|
What results is that while ``ex3()`` still gained 0.5 seconds of speedup
|
|
compared to the completely serialized ``ex1()`` version, this speedup is
|
|
still nowhere near the ideal speedup of ``ex2()``.
|
|
|
|
The dramatic speedup of ``ex2()`` is possible because ``ex2()`` is
|
|
theoretically completely parallelizable: if we were given 5 CPUs, all 5 calls
|
|
to ``func()`` can be run in parallel. What is happening with ``ex3()``,
|
|
however, is that each parallelized call to ``func()`` is staggered by a wait
|
|
of 0.3 seconds for the local ``other_func()`` to finish.
|
|
|
|
``ex3()`` is thus a manifestation of `Amdahls Law`_: the fastest theoretically
|
|
possible execution time from parallelizing an application is limited to be
|
|
no better than the time it takes to run all serial parts in serial.
|
|
|
|
.. _`Amdahls Law`: https://en.wikipedia.org/wiki/Amdahl%27s_law
|
|
|
|
Due to Amdahl's Law, ``ex3()`` must take at least 1.5
|
|
seconds -- the time it takes for 5 serial calls to ``other_func()`` to finish!
|
|
After an additional 0.5 seconds to execute func and get the result, the
|
|
computation is done.
|
|
|
|
|
|
Profiling Using An External Profiler (Line Profiler)
|
|
----------------------------------------------------
|
|
|
|
One way to profile the performance of our code using Ray is to use a third-party
|
|
profiler such as `Line_profiler`_. Line_profiler is a useful line-by-line
|
|
profiler for pure Python applications that formats its output side-by-side with
|
|
the profiled code itself.
|
|
|
|
Alternatively, another third-party profiler (not covered in this documentation)
|
|
that you could use is `Pyflame`_, which can generate profiling graphs.
|
|
|
|
.. _`Line_profiler`: https://github.com/rkern/line_profiler
|
|
.. _`Pyflame`: https://github.com/uber/pyflame
|
|
|
|
First install ``line_profiler`` with pip:
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install line_profiler
|
|
|
|
``line_profiler`` requires each section of driver code that you want to profile as
|
|
its own independent function. Conveniently, we have already done so by defining
|
|
each loop version as its own function. To tell ``line_profiler`` which functions
|
|
to profile, just add the ``@profile`` decorator to ``ex1()``, ``ex2()`` and
|
|
``ex3()``. Note that you do not need to import ``line_profiler`` into your Ray
|
|
application:
|
|
|
|
.. code-block:: python
|
|
|
|
@profile # Added decorator
|
|
def ex1():
|
|
list1 = []
|
|
for i in range(5):
|
|
list1.append(ray.get(func.remote()))
|
|
|
|
def main():
|
|
ray.init()
|
|
ex1()
|
|
ex2()
|
|
ex3()
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
|
|
Then, when we want to execute our Python script from the command line, instead
|
|
of ``python your_script_here.py``, we use the following shell command to run the
|
|
script with ``line_profiler`` enabled:
|
|
|
|
.. code-block:: bash
|
|
|
|
kernprof -l your_script_here.py
|
|
|
|
This command runs your script and prints only your script's output as usual.
|
|
``Line_profiler`` instead outputs its profiling results to a corresponding
|
|
binary file called ``your_script_here.py.lprof``.
|
|
|
|
To read ``line_profiler``'s results to terminal, use this shell command:
|
|
|
|
.. code-block:: bash
|
|
|
|
python -m line_profiler your_script_here.py.lprof
|
|
|
|
In our loop example, this command outputs results for ``ex1()`` as follows.
|
|
Note that execution time is given in units of 1e-06 seconds:
|
|
|
|
.. code-block:: bash
|
|
|
|
Timer unit: 1e-06 s
|
|
|
|
Total time: 2.50883 s
|
|
File: your_script_here.py
|
|
Function: ex1 at line 28
|
|
|
|
Line # Hits Time Per Hit % Time Line Contents
|
|
==============================================================
|
|
29 @profile
|
|
30 def ex1():
|
|
31 1 3.0 3.0 0.0 list1 = []
|
|
32 6 18.0 3.0 0.0 for i in range(5):
|
|
33 5 2508805.0 501761.0 100.0 list1.append(ray.get(func.remote()))
|
|
|
|
|
|
Notice that each hit to ``list1.append(ray.get(func.remote()))`` at line 33
|
|
takes the full 0.5 seconds waiting for ``func()`` to finish. Meanwhile, in
|
|
``ex2()`` below, each call of ``func.remote()`` at line 40 only takes 0.127 ms,
|
|
and the majority of the time (about 1 second) is spent on waiting for ``ray.get()``
|
|
at the end:
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
Total time: 1.00357 s
|
|
File: your_script_here.py
|
|
Function: ex2 at line 35
|
|
|
|
Line # Hits Time Per Hit % Time Line Contents
|
|
==============================================================
|
|
36 @profile
|
|
37 def ex2():
|
|
38 1 2.0 2.0 0.0 list2 = []
|
|
39 6 13.0 2.2 0.0 for i in range(5):
|
|
40 5 637.0 127.4 0.1 list2.append(func.remote())
|
|
41 1 1002919.0 1002919.0 99.9 ray.get(list2)
|
|
|
|
|
|
And finally, ``line_profiler``'s output for ``ex3()``. Each call to
|
|
``func.remote()`` at line 50 still take magnitudes faster than 0.5 seconds,
|
|
showing that Ray is successfully parallelizing the remote calls. However, each
|
|
call to the local function ``other_func()`` takes the full 0.3 seconds,
|
|
totalling up to the guaranteed minimum application execution time of 1.5
|
|
seconds:
|
|
|
|
.. code-block:: bash
|
|
|
|
Total time: 2.00446 s
|
|
File: basic_kernprof.py
|
|
Function: ex3 at line 44
|
|
|
|
Line # Hits Time Per Hit % Time Line Contents
|
|
==============================================================
|
|
44 @profile
|
|
45 #@time_this
|
|
46 def ex3():
|
|
47 1 2.0 2.0 0.0 list3 = []
|
|
48 6 13.0 2.2 0.0 for i in range(5):
|
|
49 5 1501934.0 300386.8 74.9 other_func()
|
|
50 5 917.0 183.4 0.0 list3.append(func.remote())
|
|
51 1 501589.0 501589.0 25.0 ray.get(list3)
|
|
|
|
|
|
Profiling Using Python's CProfile
|
|
---------------------------------
|
|
|
|
A second way to profile the performance of your Ray application is to
|
|
use Python's native cProfile `profiling module`_. Rather than tracking
|
|
line-by-line of your application code, cProfile can give the total runtime
|
|
of each loop function, as well as list the number of calls made and
|
|
execution time of all function calls made within the profiled code.
|
|
|
|
.. _`profiling module`: https://docs.python.org/3/library/profile.html#module-cProfile
|
|
|
|
Unlike ``line_profiler`` above, this detailed list of profiled function calls
|
|
**includes** internal function calls and function calls made within Ray!
|
|
|
|
However, similar to ``line_profiler``, cProfile can be enabled with minimal
|
|
changes to your application code (given that each section of the code you want
|
|
to profile is defined as its own function). To use cProfile, add an import
|
|
statement, then replace calls to the loop functions as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
import cProfile # Added import statement
|
|
|
|
def ex1():
|
|
list1 = []
|
|
for i in range(5):
|
|
list1.append(ray.get(func.remote()))
|
|
|
|
def main():
|
|
ray.init()
|
|
cProfile.run('ex1()') # Modified call to ex1
|
|
cProfile.run('ex2()')
|
|
cProfile.run('ex3()')
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
|
|
Now, when executing your Python script, a cProfile list of profiled function
|
|
calls will be outputted to terminal for each call made to ``cProfile.run()``.
|
|
At the very top of cProfile's output gives the total execution time for
|
|
``'ex1()'``:
|
|
|
|
.. code-block:: bash
|
|
|
|
601 function calls (595 primitive calls) in 2.509 seconds
|
|
|
|
Following is a snippet of profiled function calls for ``'ex1()'``. Most of
|
|
these calls are quick and take around 0.000 seconds, so the functions of
|
|
interest are the ones with non-zero execution times:
|
|
|
|
.. code-block:: bash
|
|
|
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
|
...
|
|
1 0.000 0.000 2.509 2.509 your_script_here.py:31(ex1)
|
|
5 0.000 0.000 0.001 0.000 remote_function.py:103(remote)
|
|
5 0.000 0.000 0.001 0.000 remote_function.py:107(_submit)
|
|
...
|
|
10 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
|
|
5 0.000 0.000 2.508 0.502 worker.py:2535(get)
|
|
5 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
|
|
10 0.000 0.000 2.507 0.251 worker.py:374(retrieve_and_deserialize)
|
|
5 0.000 0.000 2.508 0.502 worker.py:424(get_object)
|
|
5 0.000 0.000 0.000 0.000 worker.py:514(submit_task)
|
|
...
|
|
|
|
The 5 separate calls to Ray's ``get``, taking the full 0.502 seconds each call,
|
|
can be noticed at ``worker.py:2535(get)``. Meanwhile, the act of calling the
|
|
remote function itself at ``remote_function.py:103(remote)`` only takes 0.001
|
|
seconds over 5 calls, and thus is not the source of the slow performance of
|
|
``ex1()``.
|
|
|
|
|
|
Profiling Ray Actors with cProfile
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
Considering that the detailed output of cProfile can be quite different depending
|
|
on what Ray functionalities we use, let us see what cProfile's output might look
|
|
like if our example involved Actors (for an introduction to Ray actors, see our
|
|
`Actor documentation here`_).
|
|
|
|
.. _`Actor documentation here`: http://ray.readthedocs.io/en/latest/actors.html
|
|
|
|
Now, instead of looping over five calls to a remote function like in ``ex1``,
|
|
let's create a new example and loop over five calls to a remote function
|
|
**inside an actor**. Our actor's remote function again just sleeps for 0.5
|
|
seconds:
|
|
|
|
.. code-block:: python
|
|
|
|
# Our actor
|
|
@ray.remote
|
|
class Sleeper(object):
|
|
def __init__(self):
|
|
self.sleepValue = 0.5
|
|
|
|
# Equivalent to func(), but defined within an actor
|
|
def actor_func(self):
|
|
time.sleep(self.sleepValue)
|
|
|
|
Recalling the suboptimality of ``ex1``, let's first see what happens if we
|
|
attempt to perform all five ``actor_func()`` calls within a single actor:
|
|
|
|
.. code-block:: python
|
|
|
|
def ex4():
|
|
# This is suboptimal in Ray, and should only be used for the sake of this example
|
|
actor_example = Sleeper.remote()
|
|
|
|
five_results = []
|
|
for i in range(5):
|
|
five_results.append(actor_example.actor_func.remote())
|
|
|
|
# Wait until the end to call ray.get()
|
|
ray.get(five_results)
|
|
|
|
We enable cProfile on this example as follows:
|
|
|
|
.. code-block:: python
|
|
|
|
def main():
|
|
ray.init()
|
|
cProfile.run('ex4()')
|
|
|
|
if __name__ == "__main__":
|
|
main()
|
|
|
|
Running our new Actor example, cProfile's abbreviated output is as follows:
|
|
|
|
.. code-block:: bash
|
|
|
|
12519 function calls (11956 primitive calls) in 2.525 seconds
|
|
|
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
|
...
|
|
1 0.000 0.000 0.015 0.015 actor.py:546(remote)
|
|
1 0.000 0.000 0.015 0.015 actor.py:560(_submit)
|
|
1 0.000 0.000 0.000 0.000 actor.py:697(__init__)
|
|
...
|
|
1 0.000 0.000 2.525 2.525 your_script_here.py:63(ex4)
|
|
...
|
|
9 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
|
|
1 0.000 0.000 2.509 2.509 worker.py:2535(get)
|
|
9 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
|
|
4 0.000 0.000 2.508 0.627 worker.py:374(retrieve_and_deserialize)
|
|
1 0.000 0.000 2.509 2.509 worker.py:424(get_object)
|
|
8 0.000 0.000 0.001 0.000 worker.py:514(submit_task)
|
|
...
|
|
|
|
It turns out that the entire example still took 2.5 seconds to execute, or the
|
|
time for five calls to ``actor_func()`` to run in serial. We remember in ``ex1``
|
|
that this behavior was because we did not wait until after submitting all five
|
|
remote function tasks to call ``ray.get()``, but we can verify on cProfile's
|
|
output line ``worker.py:2535(get)`` that ``ray.get()`` was only called once at
|
|
the end, for 2.509 seconds. What happened?
|
|
|
|
It turns out Ray cannot parallelize this example, because we have only
|
|
initialized a single ``Sleeper`` actor. Because each actor is a single,
|
|
stateful worker, our entire code is submitted and ran on a single worker the
|
|
whole time.
|
|
|
|
To better parallelize the actors in ``ex4``, we can take advantage
|
|
that each call to ``actor_func()`` is independent, and instead
|
|
create five ``Sleeper`` actors. That way, we are creating five workers
|
|
that can run in parallel, instead of creating a single worker that
|
|
can only handle one call to ``actor_func()`` at a time.
|
|
|
|
.. code-block:: python
|
|
|
|
def ex4():
|
|
# Modified to create five separate Sleepers
|
|
five_actors = [Sleeper.remote() for i in range(5)]
|
|
|
|
# Each call to actor_func now goes to a different Sleeper
|
|
five_results = []
|
|
for actor_example in five_actors:
|
|
five_results.append(actor_example.actor_func.remote())
|
|
|
|
ray.get(five_results)
|
|
|
|
Our example in total now takes only 1.5 seconds to run:
|
|
|
|
.. code-block:: bash
|
|
|
|
1378 function calls (1363 primitive calls) in 1.567 seconds
|
|
|
|
ncalls tottime percall cumtime percall filename:lineno(function)
|
|
...
|
|
5 0.000 0.000 0.002 0.000 actor.py:546(remote)
|
|
5 0.000 0.000 0.002 0.000 actor.py:560(_submit)
|
|
5 0.000 0.000 0.000 0.000 actor.py:697(__init__)
|
|
...
|
|
1 0.000 0.000 1.566 1.566 your_script_here.py:71(ex4)
|
|
...
|
|
21 0.000 0.000 0.000 0.000 worker.py:2459(__init__)
|
|
1 0.000 0.000 1.564 1.564 worker.py:2535(get)
|
|
25 0.000 0.000 0.000 0.000 worker.py:2695(get_global_worker)
|
|
3 0.000 0.000 1.564 0.521 worker.py:374(retrieve_and_deserialize)
|
|
1 0.000 0.000 1.564 1.564 worker.py:424(get_object)
|
|
20 0.001 0.000 0.001 0.000 worker.py:514(submit_task)
|
|
...
|