mirror of
https://github.com/vale981/ray
synced 2025-03-07 02:51:39 -05:00
283 lines
12 KiB
Markdown
283 lines
12 KiB
Markdown
---
|
|
layout: post
|
|
title: "Fast Python Serialization with Ray and Apache Arrow"
|
|
excerpt: "This post describes how serialization works in Ray."
|
|
date: 2017-10-15 14:00:00
|
|
author: Philipp Moritz, Robert Nishihara
|
|
---
|
|
|
|
This post elaborates on the integration between [Ray][1] and [Apache Arrow][2].
|
|
The main problem this addresses is [data serialization][3].
|
|
|
|
From [Wikipedia][3], **serialization** is
|
|
|
|
> ... the process of translating data structures or object state into a format
|
|
> that can be stored ... or transmitted ... and reconstructed later (possibly
|
|
> in a different computer environment).
|
|
|
|
Why is any translation necessary? Well, when you create a Python object, it may
|
|
have pointers to other Python objects, and these objects are all allocated in
|
|
different regions of memory, and all of this has to make sense when unpacked by
|
|
another process on another machine.
|
|
|
|
Serialization and deserialization are **bottlenecks in parallel and distributed
|
|
computing**, especially in machine learning applications with large objects and
|
|
large quantities of data.
|
|
|
|
## Design Goals
|
|
|
|
As Ray is optimized for machine learning and AI applications, we have focused a
|
|
lot on serialization and data handling, with the following design goals:
|
|
|
|
1. It should be very efficient with **large numerical data** (this includes
|
|
NumPy arrays and Pandas DataFrames, as well as objects that recursively contain
|
|
Numpy arrays and Pandas DataFrames).
|
|
2. It should be about as fast as Pickle for **general Python types**.
|
|
3. It should be compatible with **shared memory**, allowing multiple processes
|
|
to use the same data without copying it.
|
|
4. **Deserialization** should be extremely fast (when possible, it should not
|
|
require reading the entire serialized object).
|
|
5. It should be **language independent** (eventually we'd like to enable Python
|
|
workers to use objects created by workers in Java or other languages and vice
|
|
versa).
|
|
|
|
## Our Approach and Alternatives
|
|
|
|
The go-to serialization approach in Python is the **pickle** module. Pickle is
|
|
very general, especially if you use variants like [cloudpickle][4]. However, it
|
|
does not satisfy requirements 1, 3, 4, or 5. Alternatives like **json** satisfy
|
|
5, but not 1-4.
|
|
|
|
**Our Approach:** To satisfy requirements 1-5, we chose to use the
|
|
[Apache Arrow][2] format as our underlying data representation. In collaboration
|
|
with the Apache Arrow team, we built [libraries][7] for mapping general Python
|
|
objects to and from the Arrow format. Some properties of this approach:
|
|
|
|
- The data layout is language independent (requirement 5).
|
|
- Offsets into a serialized data blob can be computed in constant time without
|
|
reading the full object (requirements 1 and 4).
|
|
- Arrow supports **zero-copy reads**, so objects can naturally be stored in
|
|
shared memory and used by multiple processes (requirements 1 and 3).
|
|
- We can naturally fall back to pickle for anything we can't handle well
|
|
(requirement 2).
|
|
|
|
**Alternatives to Arrow:** We could have built on top of
|
|
[**Protocol Buffers**][5], but protocol buffers really isn't designed for
|
|
numerical data, and that approach wouldn't satisfy 1, 3, or 4. Building on top
|
|
of [**Flatbuffers**][6] actually could be made to work, but it would have
|
|
required implementing a lot of the facilities that Arrow already has and we
|
|
preferred a columnar data layout more optimized for big data.
|
|
|
|
## Speedups
|
|
|
|
Here we show some performance improvements over Python's pickle module. The
|
|
experiments were done using `pickle.HIGHEST_PROTOCOL`. Code for generating these
|
|
plots is included at the end of the post.
|
|
|
|
**With NumPy arrays:** In machine learning and AI applications, data (e.g.,
|
|
images, neural network weights, text documents) are typically represented as
|
|
data structures containing NumPy arrays. When using NumPy arrays, the speedups
|
|
are impressive.
|
|
|
|
The fact that the Ray bars for deserialization are barely visible is not a
|
|
mistake. This is a consequence of the support for zero-copy reads (the savings
|
|
largely come from the lack of memory movement).
|
|
|
|
<div align="center">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/speedups0.png" width="365" height="255">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/speedups1.png" width="365" height="255">
|
|
</div>
|
|
|
|
Note that the biggest wins are with deserialization. The speedups here are
|
|
multiple orders of magnitude and get better as the NumPy arrays get larger
|
|
(thanks to design goals 1, 3, and 4). Making **deserialization** fast is
|
|
important for two reasons. First, an object may be serialized once and then
|
|
deserialized many times (e.g., an object that is broadcast to all workers).
|
|
Second, a common pattern is for many objects to be serialized in parallel and
|
|
then aggregated and deserialized one at a time on a single worker making
|
|
deserialization the bottleneck.
|
|
|
|
**Without NumPy arrays:** When using regular Python objects, for which we
|
|
cannot take advantage of shared memory, the results are comparable to pickle.
|
|
|
|
<div align="center">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/speedups2.png" width="365" height="255">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/speedups3.png" width="365" height="255">
|
|
</div>
|
|
|
|
These are just a few examples of interesting Python objects. The most important
|
|
case is the case where NumPy arrays are nested within other objects. Note that
|
|
our serialization library works with very general Python types including custom
|
|
Python classes and deeply nested objects.
|
|
|
|
## The API
|
|
|
|
The serialization library can be used directly through pyarrow as follows. More
|
|
documentation is available [here][7].
|
|
|
|
```python
|
|
x = [(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]
|
|
serialized_x = pyarrow.serialize(x).to_buffer()
|
|
deserialized_x = pyarrow.deserialize(serialized_x)
|
|
```
|
|
|
|
It can be used directly through the Ray API as follows.
|
|
|
|
```python
|
|
x = [(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]
|
|
x_id = ray.put(x)
|
|
deserialized_x = ray.get(x_id)
|
|
```
|
|
|
|
## Data Representation
|
|
|
|
We use Apache Arrow as the underlying language-independent data layout. Objects
|
|
are stored in two parts: a **schema** and a **data blob**. At a high level, the
|
|
data blob is roughly a flattened concatenation of all of the data values
|
|
recursively contained in the object, and the schema defines the types and
|
|
nesting structure of the data blob.
|
|
|
|
**Technical Details:** Python sequences (e.g., dictionaries, lists, tuples,
|
|
sets) are encoded as Arrow [UnionArrays][8] of other types (e.g., bools, ints,
|
|
strings, bytes, floats, doubles, date64s, tensors (i.e., NumPy arrays), lists,
|
|
tuples, dicts and sets). Nested sequences are encoded using Arrow
|
|
[ListArrays][9]. All tensors are collected and appended to the end of the
|
|
serialized object, and the UnionArray contains references to these tensors.
|
|
|
|
To give a concrete example, consider the following object.
|
|
|
|
```python
|
|
[(1, 2), 'hello', 3, 4, np.array([5.0, 6.0])]
|
|
```
|
|
|
|
It would be represented in Arrow with the following structure.
|
|
|
|
```
|
|
UnionArray(type_ids=[tuple, string, int, int, ndarray],
|
|
tuples=ListArray(offsets=[0, 2],
|
|
UnionArray(type_ids=[int, int],
|
|
ints=[1, 2])),
|
|
strings=['hello'],
|
|
ints=[3, 4],
|
|
ndarrays=[<offset of numpy array>])
|
|
```
|
|
|
|
Arrow uses Flatbuffers to encode serialized schemas. **Using only the schema, we
|
|
can compute the offsets of each value in the data blob without scanning through
|
|
the data blob** (unlike Pickle, this is what enables fast deserialization). This
|
|
means that we can avoid copying or otherwise converting large arrays and other
|
|
values during deserialization. Tensors are appended at the end of the UnionArray
|
|
and can be efficiently shared and accessed using shared memory.
|
|
|
|
Note that the actual object would be laid out in memory as shown below.
|
|
|
|
<div align="center">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/python_object.png" width="600">
|
|
</div>
|
|
<div><i>The layout of a Python object in the heap. Each box is allocated in a
|
|
different memory region, and arrows between boxes represent pointers.</i></div>
|
|
<br />
|
|
|
|
The Arrow serialized representation would be as follows.
|
|
|
|
<div align="center">
|
|
<img src="{{ site.base-url }}/assets/fast_python_serialization_with_ray_and_arrow/arrow_object.png" width="400">
|
|
</div>
|
|
<div><i>The memory layout of the Arrow-serialized object.</i></div>
|
|
<br />
|
|
|
|
## Getting Involved
|
|
|
|
We welcome contributions, especially in the following areas.
|
|
|
|
- Use the C++ and Java implementations of Arrow to implement versions of this
|
|
for C++ and Java.
|
|
- Implement support for more Python types and better test coverage.
|
|
|
|
## Reproducing the Figures Above
|
|
|
|
For reference, the figures can be reproduced with the following code.
|
|
Benchmarking `ray.put` and `ray.get` instead of `pyarrow.serialize` and
|
|
`pyarrow.deserialize` gives similar figures. The plots were generated at this
|
|
[commit][10].
|
|
|
|
```python
|
|
import pickle
|
|
import pyarrow
|
|
import matplotlib.pyplot as plt
|
|
import numpy as np
|
|
import timeit
|
|
|
|
|
|
def benchmark_object(obj, number=10):
|
|
# Time serialization and deserialization for pickle.
|
|
pickle_serialize = timeit.timeit(
|
|
lambda: pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL),
|
|
number=number)
|
|
serialized_obj = pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)
|
|
pickle_deserialize = timeit.timeit(lambda: pickle.loads(serialized_obj),
|
|
number=number)
|
|
|
|
# Time serialization and deserialization for Ray.
|
|
ray_serialize = timeit.timeit(
|
|
lambda: pyarrow.serialize(obj).to_buffer(), number=number)
|
|
serialized_obj = pyarrow.serialize(obj).to_buffer()
|
|
ray_deserialize = timeit.timeit(
|
|
lambda: pyarrow.deserialize(serialized_obj), number=number)
|
|
|
|
return [[pickle_serialize, pickle_deserialize],
|
|
[ray_serialize, ray_deserialize]]
|
|
|
|
|
|
def plot(pickle_times, ray_times, title, i):
|
|
fig, ax = plt.subplots()
|
|
fig.set_size_inches(3.8, 2.7)
|
|
|
|
bar_width = 0.35
|
|
index = np.arange(2)
|
|
opacity = 0.6
|
|
|
|
plt.bar(index, pickle_times, bar_width,
|
|
alpha=opacity, color='r', label='Pickle')
|
|
|
|
plt.bar(index + bar_width, ray_times, bar_width,
|
|
alpha=opacity, color='c', label='Ray')
|
|
|
|
plt.title(title, fontweight='bold')
|
|
plt.ylabel('Time (seconds)', fontsize=10)
|
|
labels = ['serialization', 'deserialization']
|
|
plt.xticks(index + bar_width / 2, labels, fontsize=10)
|
|
plt.legend(fontsize=10, bbox_to_anchor=(1, 1))
|
|
plt.tight_layout()
|
|
plt.yticks(fontsize=10)
|
|
plt.savefig('plot-' + str(i) + '.png', format='png')
|
|
|
|
|
|
test_objects = [
|
|
[np.random.randn(50000) for i in range(100)],
|
|
{'weight-' + str(i): np.random.randn(50000) for i in range(100)},
|
|
{i: set(['string1' + str(i), 'string2' + str(i)]) for i in range(100000)},
|
|
[str(i) for i in range(200000)]
|
|
]
|
|
|
|
titles = [
|
|
'List of large numpy arrays',
|
|
'Dictionary of large numpy arrays',
|
|
'Large dictionary of small sets',
|
|
'Large list of strings'
|
|
]
|
|
|
|
for i in range(len(test_objects)):
|
|
plot(*benchmark_object(test_objects[i]), titles[i], i)
|
|
```
|
|
|
|
[1]: http://docs.ray.io/en/master/index.html
|
|
[2]: https://arrow.apache.org/
|
|
[3]: https://en.wikipedia.org/wiki/Serialization
|
|
[4]: https://github.com/cloudpipe/cloudpickle/
|
|
[5]: https://developers.google.com/protocol-buffers/
|
|
[6]: https://google.github.io/flatbuffers/
|
|
[7]: https://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
|
|
[8]: http://arrow.apache.org/docs/memory_layout.html#dense-union-type
|
|
[9]: http://arrow.apache.org/docs/memory_layout.html#list-type
|
|
[10]: https://github.com/apache/arrow/tree/894f7400977693b4e0e8f4b9845fd89481f6bf29
|