From 4e4a4e4e062d37f3fb1c518ea5b0d0d7a32e5a60 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Wed, 30 Aug 2017 23:40:46 -0700 Subject: [PATCH] Add plasma in-memory object store blog post (#895) * add plasma in-memory object store blog post * modifications for Ray blog * add arrow blog reference * update * rename * Improve formatting. --- ...-08-plasma-in-memory-object-store.markdown | 125 ++++++++++++++++++ site/css/main.css | 3 + site/index.html | 16 ++- 3 files changed, 139 insertions(+), 5 deletions(-) create mode 100644 site/_posts/2017-08-08-plasma-in-memory-object-store.markdown create mode 100644 site/css/main.css diff --git a/site/_posts/2017-08-08-plasma-in-memory-object-store.markdown b/site/_posts/2017-08-08-plasma-in-memory-object-store.markdown new file mode 100644 index 000000000..72bebb038 --- /dev/null +++ b/site/_posts/2017-08-08-plasma-in-memory-object-store.markdown @@ -0,0 +1,125 @@ +--- +layout: post +title: "The Plasma In-Memory Object Store" +excerpt: "This post announces Plasma, an in-memory object store for communicating data between processes." +date: 2017-08-08 00:00:00 +--- + +*This was originally posted on the [Apache Arrow blog][1]*. + +This blog post presents Plasma, an in-memory object store that is being +developed as part of Apache Arrow. **Plasma holds immutable objects in shared +memory so that they can be accessed efficiently by many clients across process +boundaries.** In light of the trend toward larger and larger multicore machines, +Plasma enables critical performance optimizations in the big data regime. + +Plasma was initially developed as part of [Ray][2], and has recently been moved +to Apache Arrow in the hopes that it will be broadly useful. + +One of the goals of Apache Arrow is to serve as a common data layer enabling +zero-copy data exchange between multiple frameworks. A key component of this +vision is the use of off-heap memory management (via Plasma) for storing and +sharing Arrow-serialized objects between applications. + +**Expensive serialization and deserialization as well as data copying are a +common performance bottleneck in distributed computing.** For example, a +Python-based execution framework that wishes to distribute computation across +multiple Python “worker” processes and then aggregate the results in a single +“driver” process may choose to serialize data using the built-in `pickle` +library. Assuming one Python process per core, each worker process would have to +copy and deserialize the data, resulting in excessive memory usage. The driver +process would then have to deserialize results from each of the workers, +resulting in a bottleneck. + +Using Plasma plus Arrow, the data being operated on would be placed in the +Plasma store once, and all of the workers would read the data without copying or +deserializing it (the workers would map the relevant region of memory into their +own address spaces). The workers would then put the results of their computation +back into the Plasma store, which the driver could then read and aggregate +without copying or deserializing the data. + +### The Plasma API: + +Below we illustrate a subset of the API. The C++ API is documented more fully +[here][5], and the Python API is documented [here][6]. + +**Object IDs:** Each object is associated with a string of bytes. + +**Creating an object:** Objects are stored in Plasma in two stages. First, the +object store *creates* the object by allocating a buffer for it. At this point, +the client can write to the buffer and construct the object within the allocated +buffer. When the client is done, the client *seals* the buffer making the object +immutable and making it available to other Plasma clients. + +```python +# Create an object. +object_id = pyarrow.plasma.ObjectID(20 * b'a') +object_size = 1000 +buffer = memoryview(client.create(object_id, object_size)) + +# Write to the buffer. +for i in range(1000): + buffer[i] = 0 + +# Seal the object making it immutable and available to other clients. +client.seal(object_id) +``` + +**Getting an object:** After an object has been sealed, any client who knows the +object ID can get the object. + +```python +# Get the object from the store. This blocks until the object has been sealed. +object_id = pyarrow.plasma.ObjectID(20 * b'a') +[buff] = client.get([object_id]) +buffer = memoryview(buff) +``` + +If the object has not been sealed yet, then the call to `client.get` will block +until the object has been sealed. + +### A sorting application + +To illustrate the benefits of Plasma, we demonstrate an **11x speedup** (on a +machine with 20 physical cores) for sorting a large pandas DataFrame (one +billion entries). The baseline is the built-in pandas sort function, which sorts +the DataFrame in 477 seconds. To leverage multiple cores, we implement the +following standard distributed sorting scheme. + +* We assume that the data is partitioned across K pandas DataFrames and that + each one already lives in the Plasma store. +* We subsample the data, sort the subsampled data, and use the result to define + L non-overlapping buckets. +* For each of the K data partitions and each of the L buckets, we find the + subset of the data partition that falls in the bucket, and we sort that + subset. +* For each of the L buckets, we gather all of the K sorted subsets that fall in + that bucket. +* For each of the L buckets, we merge the corresponding K sorted subsets. +* We turn each bucket into a pandas DataFrame and place it in the Plasma store. + +Using this scheme, we can sort the DataFrame (the data starts and ends in the +Plasma store), in 44 seconds, giving an 11x speedup over the baseline. + +### Design + +The Plasma store runs as a separate process. It is written in C++ and is +designed as a single-threaded event loop based on the [Redis][3] event loop library. +The plasma client library can be linked into applications. Clients communicate +with the Plasma store via messages serialized using [Google Flatbuffers][4]. + +### Call for contributions + +Plasma is a work in progress, and the API is currently unstable. Today Plasma is +primarily used in [Ray][2] as an in-memory cache for Arrow serialized objects. +We are looking for a broader set of use cases to help refine Plasma’s API. In +addition, we are looking for contributions in a variety of areas including +improving performance and building other language bindings. Please let us know +if you are interested in getting involved with the project. + +[1]: http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-object-store/ +[2]: https://github.com/ray-project/ray +[3]: https://redis.io/ +[4]: https://google.github.io/flatbuffers/ +[5]: https://github.com/apache/arrow/blob/master/cpp/apidoc/tutorials/plasma.md +[6]: https://github.com/apache/arrow/blob/master/python/doc/source/plasma.rst diff --git a/site/css/main.css b/site/css/main.css new file mode 100644 index 000000000..4542b078f --- /dev/null +++ b/site/css/main.css @@ -0,0 +1,3 @@ +.posts { list-style-type: none; } + +.posts li { margin-bottom: 30px; } diff --git a/site/index.html b/site/index.html index bbb7a0329..e6c877b7e 100644 --- a/site/index.html +++ b/site/index.html @@ -2,12 +2,18 @@ layout: default --- + +
-{% for post in site.posts %} - {{ post.date | date: "%b %-d, %Y" }} - {{ post.title }} - {{ post.excerpt }} -{% endfor %} +