mirror of
https://github.com/vale981/ray
synced 2025-03-05 18:11:42 -05:00
Add plasma in-memory object store blog post (#895)
* add plasma in-memory object store blog post * modifications for Ray blog * add arrow blog reference * update * rename * Improve formatting.
This commit is contained in:
parent
7496c98010
commit
4e4a4e4e06
3 changed files with 139 additions and 5 deletions
125
site/_posts/2017-08-08-plasma-in-memory-object-store.markdown
Normal file
125
site/_posts/2017-08-08-plasma-in-memory-object-store.markdown
Normal file
|
@ -0,0 +1,125 @@
|
|||
---
|
||||
layout: post
|
||||
title: "The Plasma In-Memory Object Store"
|
||||
excerpt: "This post announces Plasma, an in-memory object store for communicating data between processes."
|
||||
date: 2017-08-08 00:00:00
|
||||
---
|
||||
|
||||
*This was originally posted on the [Apache Arrow blog][1]*.
|
||||
|
||||
This blog post presents Plasma, an in-memory object store that is being
|
||||
developed as part of Apache Arrow. **Plasma holds immutable objects in shared
|
||||
memory so that they can be accessed efficiently by many clients across process
|
||||
boundaries.** In light of the trend toward larger and larger multicore machines,
|
||||
Plasma enables critical performance optimizations in the big data regime.
|
||||
|
||||
Plasma was initially developed as part of [Ray][2], and has recently been moved
|
||||
to Apache Arrow in the hopes that it will be broadly useful.
|
||||
|
||||
One of the goals of Apache Arrow is to serve as a common data layer enabling
|
||||
zero-copy data exchange between multiple frameworks. A key component of this
|
||||
vision is the use of off-heap memory management (via Plasma) for storing and
|
||||
sharing Arrow-serialized objects between applications.
|
||||
|
||||
**Expensive serialization and deserialization as well as data copying are a
|
||||
common performance bottleneck in distributed computing.** For example, a
|
||||
Python-based execution framework that wishes to distribute computation across
|
||||
multiple Python “worker” processes and then aggregate the results in a single
|
||||
“driver” process may choose to serialize data using the built-in `pickle`
|
||||
library. Assuming one Python process per core, each worker process would have to
|
||||
copy and deserialize the data, resulting in excessive memory usage. The driver
|
||||
process would then have to deserialize results from each of the workers,
|
||||
resulting in a bottleneck.
|
||||
|
||||
Using Plasma plus Arrow, the data being operated on would be placed in the
|
||||
Plasma store once, and all of the workers would read the data without copying or
|
||||
deserializing it (the workers would map the relevant region of memory into their
|
||||
own address spaces). The workers would then put the results of their computation
|
||||
back into the Plasma store, which the driver could then read and aggregate
|
||||
without copying or deserializing the data.
|
||||
|
||||
### The Plasma API:
|
||||
|
||||
Below we illustrate a subset of the API. The C++ API is documented more fully
|
||||
[here][5], and the Python API is documented [here][6].
|
||||
|
||||
**Object IDs:** Each object is associated with a string of bytes.
|
||||
|
||||
**Creating an object:** Objects are stored in Plasma in two stages. First, the
|
||||
object store *creates* the object by allocating a buffer for it. At this point,
|
||||
the client can write to the buffer and construct the object within the allocated
|
||||
buffer. When the client is done, the client *seals* the buffer making the object
|
||||
immutable and making it available to other Plasma clients.
|
||||
|
||||
```python
|
||||
# Create an object.
|
||||
object_id = pyarrow.plasma.ObjectID(20 * b'a')
|
||||
object_size = 1000
|
||||
buffer = memoryview(client.create(object_id, object_size))
|
||||
|
||||
# Write to the buffer.
|
||||
for i in range(1000):
|
||||
buffer[i] = 0
|
||||
|
||||
# Seal the object making it immutable and available to other clients.
|
||||
client.seal(object_id)
|
||||
```
|
||||
|
||||
**Getting an object:** After an object has been sealed, any client who knows the
|
||||
object ID can get the object.
|
||||
|
||||
```python
|
||||
# Get the object from the store. This blocks until the object has been sealed.
|
||||
object_id = pyarrow.plasma.ObjectID(20 * b'a')
|
||||
[buff] = client.get([object_id])
|
||||
buffer = memoryview(buff)
|
||||
```
|
||||
|
||||
If the object has not been sealed yet, then the call to `client.get` will block
|
||||
until the object has been sealed.
|
||||
|
||||
### A sorting application
|
||||
|
||||
To illustrate the benefits of Plasma, we demonstrate an **11x speedup** (on a
|
||||
machine with 20 physical cores) for sorting a large pandas DataFrame (one
|
||||
billion entries). The baseline is the built-in pandas sort function, which sorts
|
||||
the DataFrame in 477 seconds. To leverage multiple cores, we implement the
|
||||
following standard distributed sorting scheme.
|
||||
|
||||
* We assume that the data is partitioned across K pandas DataFrames and that
|
||||
each one already lives in the Plasma store.
|
||||
* We subsample the data, sort the subsampled data, and use the result to define
|
||||
L non-overlapping buckets.
|
||||
* For each of the K data partitions and each of the L buckets, we find the
|
||||
subset of the data partition that falls in the bucket, and we sort that
|
||||
subset.
|
||||
* For each of the L buckets, we gather all of the K sorted subsets that fall in
|
||||
that bucket.
|
||||
* For each of the L buckets, we merge the corresponding K sorted subsets.
|
||||
* We turn each bucket into a pandas DataFrame and place it in the Plasma store.
|
||||
|
||||
Using this scheme, we can sort the DataFrame (the data starts and ends in the
|
||||
Plasma store), in 44 seconds, giving an 11x speedup over the baseline.
|
||||
|
||||
### Design
|
||||
|
||||
The Plasma store runs as a separate process. It is written in C++ and is
|
||||
designed as a single-threaded event loop based on the [Redis][3] event loop library.
|
||||
The plasma client library can be linked into applications. Clients communicate
|
||||
with the Plasma store via messages serialized using [Google Flatbuffers][4].
|
||||
|
||||
### Call for contributions
|
||||
|
||||
Plasma is a work in progress, and the API is currently unstable. Today Plasma is
|
||||
primarily used in [Ray][2] as an in-memory cache for Arrow serialized objects.
|
||||
We are looking for a broader set of use cases to help refine Plasma’s API. In
|
||||
addition, we are looking for contributions in a variety of areas including
|
||||
improving performance and building other language bindings. Please let us know
|
||||
if you are interested in getting involved with the project.
|
||||
|
||||
[1]: http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/
|
||||
[2]: https://github.com/ray-project/ray
|
||||
[3]: https://redis.io/
|
||||
[4]: https://google.github.io/flatbuffers/
|
||||
[5]: https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md
|
||||
[6]: https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst
|
3
site/css/main.css
Normal file
3
site/css/main.css
Normal file
|
@ -0,0 +1,3 @@
|
|||
.posts { list-style-type: none; }
|
||||
|
||||
.posts li { margin-bottom: 30px; }
|
|
@ -2,12 +2,18 @@
|
|||
layout: default
|
||||
---
|
||||
|
||||
<link rel="stylesheet" href="{{ "/css/main.css" | prepend: site.baseurl }}">
|
||||
|
||||
<div class="home">
|
||||
|
||||
{% for post in site.posts %}
|
||||
<span>{{ post.date | date: "%b %-d, %Y" }}</span>
|
||||
<a class="post-link" href="{{ post.url | prepend: site.baseurl }}">{{ post.title }}</a>
|
||||
{{ post.excerpt }}
|
||||
{% endfor %}
|
||||
<ul class="posts">
|
||||
{% for post in site.posts %}
|
||||
<li>
|
||||
<span>{{ post.date | date: "%b %-d, %Y" }}</span>
|
||||
<a class="post-link" href="{{ post.url | prepend: site.baseurl }}">{{ post.title }}</a>
|
||||
{{ post.excerpt }}
|
||||
</li>
|
||||
{% endfor %}
|
||||
</ul>
|
||||
|
||||
</div>
|
||||
|
|
Loading…
Add table
Reference in a new issue