Commit graph

185 commits

Author SHA1 Message Date
Melih Elibol
cff37765b1 Addresses missed comments from multichunk object transfer PR. (#1908)
* Move object manager parameters to ray config,
object manager config bug fix.
addresses other comments from #1827.

* linting and uint?

* typos

* remove uint.
2018-04-15 21:35:51 -07:00
Robert Nishihara
6ca2c2a609 Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (#1816)
* Allow numpy arrays and larger objects to be passed by value in task specifications.

* Fix bug.

* Fix bug. Inline all bug numpy object arrays.

* Increase size limit for inlining args in task spec.

* Give numpy init different signatures in Python 2 and Python 3.

* Simplify code.

* Fix test.

* Use import_array1 instead of import_array.
2018-04-15 20:36:01 -07:00
Robert Nishihara
256389dc59 Use new task spec for computing IDs in raylet code path. (#1830)
* Use new task spec for computing IDs in raylet code path.

* Fix linting.

* Fixes

* Fix test.
2018-04-08 13:31:55 -07:00
Stephanie Wang
bf194db4bc [xray] Basic actor support (#1835) 2018-04-06 00:17:14 -07:00
Melih Elibol
6e06a9e338 XRay Task Forwarding Milestone (#1785)
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.

Raylet Changes:

Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:

LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:

Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:

Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).


Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
2018-03-31 18:02:58 -07:00
Stephanie Wang
925e392b2d Add an Append call to the GCS Log that checks for current length (#1788)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log

* AppendAt Redis and gcs Log command

* unit test for AppendAt

* Add a Log for task reconstruction data

* Add check for unique entries in TABLE_APPEND

* Documentation
2018-03-27 13:04:43 -07:00
Stephanie Wang
0fd4112354 Introduce a log interface for the new GCS (#1771)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint
2018-03-26 16:00:43 -07:00
Stephanie Wang
0ad1054b8b
Add a GCS table for the xray task flatbuffer (#1775)
* Introduce Task flatbuffer into xray, add to GCS

* Compile and test raylet TaskTable
2018-03-23 13:18:23 -07:00
Stephanie Wang
8704c8618c
Request and cancel notifications in the new GCS API (#1758)
* Add TableRequestNotifications and TableCancelNotifications to Redis modules

* Add RequestNotifications and CancelNotifications to generic GCS Table

* Add tests for subscribing to specific keys

* Remove TODO!

* Return the current value at the key directly from RequestNotifications instead of through publish

* Add unit test for Lookup failure callback

* Modify tests to account for empty subscription response

* Remove ObjectTable notification methods

* Clean up message parsing and doc in redis context

* Use vectors of DataT in all GCS callbacks

* Clean up SubscriptionCallback

* Move Table definitions into tables.cc

* Refactor and document redis modules

* doc

* Fix new GCS build

* Cleanups

* Revert "Fix new GCS build"

This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.

* Use vectors for internal callback interface, user-facing interface takes a reference to a single item

* Fix new GCS build

* Add unit test for Lookup failure callback

* Fix compiler errors

* Cleanup

* Publish the entry ID with the notification

* Check that the ID for a notification matches in client tests
2018-03-22 10:31:07 -07:00
Stephanie Wang
5c7ef34b05
Define string prefixes for all tables in the new GCS API (#1755)
* Define string prefixes for all tables in the new GCS API

* Extra check for TablePrefix enum

* Remove unused field and add doc for existing fields
2018-03-20 20:27:11 -07:00
Robert Nishihara
4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara
96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00
Melih Elibol
3c080f4baa
Add a callback for gcs table lookup failures. (#1702)
* Add callback to gcs client for table lookup failures.

* update plasma_manager reflecting changes to gcs callback.
2018-03-15 22:25:01 -07:00
Stephanie Wang
6114b6d20e
Implement the client table for the new GCS (#1674)
* Add subscription callback to CallbackData

* Implement ClientTable

* Hook up ClientTable to AsyncGCSClient

* Add client_info to GCSClient Connect interface

* client table callbacks

* Unit test for client table

* Doc

* Fix idempotency check

* Fix mac build

* Fix memory issues in gcs client test

* Fix disconnection bug

* lint
2018-03-11 19:17:18 -07:00
Stephanie Wang
0a6edb55a8 Implement the Subscribe call for the new GCS API (#1652)
* Implement the Subscribe call for the new GCS API

* Document tests

* Upper case function name

* Fix build errors

* lint
2018-03-06 09:56:12 -08:00
Zhenyu Guo
f1e5789c26 restructure how to organize 3rd party libs (#1630)
* restructure how to organize 3rd party libs

* Minor whitespace changes.

* Fix compilation on Linux.

* Pass around Python executable so that the correct version of Python is used.
2018-03-01 14:29:56 -08:00
Robert Nishihara
0fcceef772 Update logging and check macros. (#1627)
* Update logging and check macros.

* Fix linting.

* Fix RAY_DCHECK and unused variable.

* Fix linting
2018-02-28 15:13:00 -08:00
Robert Nishihara
ba1ce85f58 Download Redis and flatbuffers differently. (#1602)
* Download Redis differently.

* Get flatbuffers with curl
2018-02-25 20:32:33 -08:00
Alexey Tumanov
844a6afcdd Implement simple random spillback policy. (#1493)
* spillback policy implementation: global + local scheduler

* modernize global scheduler policy state; factor out random number engine and generator

* Minimal version.

* Fix test.

* Make load balancing test less strenuous.
2018-02-13 00:09:35 -08:00
Philipp Moritz
1ab2e63dbd Tune transfer buffer size (#1363)
Increase buffsize from `4096` to `80*1024`.
2018-02-09 14:56:36 -08:00
Philipp Moritz
a3f8fa426b Start integrating new GCS APIs (#1379)
* Start integrating new GCS calls

* fixes

* tests

* cleanup

* cleanup and valgrind fix

* update tests

* fix valgrind

* fix more valgrind

* fixes

* add separate tests for GCS

* fix linting

* update tests

* cleanup

* fix python linting

* more fixes

* fix linting

* add plasma manager callback

* add some documentation

* fix linting

* fix linting

* fixes

* update

* fix linting

* fix

* add spillback count

* fixes

* linting

* fixes

* fix linting

* fix

* fix

* fix
2018-01-31 11:01:12 -08:00
Robert Nishihara
3195c6aa63 Fix local scheduler crash when driver creates actor and exits. (#1474)
* Make check failures in redis.cc more informative.

* Fix bug by calling task_table_add_task.

* Add test.
2018-01-26 14:29:53 -08:00
Alexey Tumanov
f1303291b4 Ray scheduler spillback plumbing + mechanism (#1362)
* spillback mechanism and plumbing : adding spillback counter + timestamp

* linting fix

* documentation

* Fix argument name.
2018-01-23 20:18:12 -08:00
Melih Elibol
4b1c8be4fe Fix setting log-level to debug. (#1432) 2018-01-21 21:51:05 -08:00
Stephanie Wang
74718efa73
Nondeterministic reconstruction for actors (#1344)
* Add failing unit test for nondeterministic reconstruction

* Retry scheduling actor tasks if reassigned to local scheduler

* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction

* Fix bug for updating checkpoint task execution dependencies

* Update comments for deterministic reconstruction

* cleanup

* Add (and skip) failing test case for nondeterministic reconstruction

* Suppress test output
2018-01-21 13:44:13 -08:00
Robert Nishihara
088f01496c Remove unused object info table code. (#1388) 2018-01-05 11:00:06 -08:00
Philipp Moritz
3d224c4edf Second Part of Internal API Refactor (#1326) 2017-12-26 16:22:04 -08:00
Stephanie Wang
12fdb3f53a Convert actor dummy objects to task execution edges. (#1281)
* Define execution dependencies flatbuffer and add to Redis commands

* Convert TaskSpec to TaskExecutionSpec

* Add execution dependencies to Python bindings

* Submitting actor tasks uses execution dependency API instead of dummy argument

* Fix dependency getters and some cleanup for fetching missing dependencies

* C++ convention

* Make TaskExecutionSpec a C++ class

* Convert local scheduler to use TaskExecutionSpec class

* Convert some pointers to references

* Finish conversion to TaskExecutionSpec class

* fix

* Fix

* Fix memory errors?

* Cast flatbuffers GetSize to size_t

* Fixes

* add more retries in global scheduler unit test

* fix linting and cast fbb.GetSize to size_t

* Style and doc

* Fix linting and simplify from_flatbuf.
2017-12-14 20:47:54 -08:00
Philipp Moritz
cac5f47600 First Part of Internal Ray API Refactor (#1173)
* add Ray status class

* add C++ util files

* add ID types

* more APIs

* build system integration

* add test infrastructure and implement some APIs

* add more tests

* fix bugs

* add task table tests

* update

* add toolchain file

* fix

* test

* link with pthread

* update

* fix

* more fixes

* fixes

* always vendor gtest and gflags

* linting

* fixes

* add constants file

* comments

* more fixes

* fix linting
2017-12-14 14:54:09 -08:00
Stephanie Wang
bac39a134e
Define a wrapper class for callback_data.data (#1301) 2017-12-08 11:48:21 -08:00
Robert Nishihara
c21e189371 Allow scheduling with arbitrary user-defined resource labels. (#1236)
* Enable scheduling with custom resource labels.

* Fix.

* Minor fixes and ref counting fix.

* Linting

* Use .data() instead of .c_str().

* Fix linting.

* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.

* Sleep in test so that all tasks are submitted before any completes.
2017-12-01 11:41:40 -08:00
Robert Nishihara
e0a340ee7e Allow actors to pin at most 1000 dummy objects at a time. (#1241)
* Allow actors to pin at most 1000 dummy objects at a time.

* Fix linting.
2017-11-22 13:38:01 -08:00
Eric Liang
9233e496cc Raise exception when getting the task results of workers that died (#1224)
* wip

* with test

* add timeout

* also add test for f

* remove on cleanup

* update

* wip

* fix tests

* mark actor removed in redis

* clang-format

* fix bug when no-inprogress tasks

* try to set task status done

* Add comment.
2017-11-20 15:18:39 -08:00
Peter Schafhalter
e0360eb429 Remove UT libraries and clean up remaining UT datastructures (#1230)
* Remove UT string include from redis

* Remove UT string include from DB tests

* Modify TaskSpec_print to remove UT string

* Remove UT libraries
2017-11-19 15:01:33 -08:00
Peter Schafhalter
4cbc2b1978 Clean up UT datastructures in Python extension (#1227) 2017-11-17 01:07:12 -08:00
Peter Schafhalter
9a7b15447b Replace UT string in redis tests (#1211)
* Replace UT arg formatting with vsnprintf

* Fix bug with va_list usage
2017-11-15 22:21:56 -08:00
Peter Schafhalter
428858c1ff Convert UT string to std::string (#1210) 2017-11-12 21:00:36 -08:00
Peter Schafhalter
9a6a056609 Convert UT datastructures in tests (#1203)
* bind_ipc_sock_retry returns std::string

* snprintf -> std::snprintf

* Fix formatting

* Use stringstream instead of snprintf

* Fix typo
2017-11-11 16:55:05 -08:00
Philipp Moritz
e798a652bc Change TaskSpec to allow multiple object IDs per argument. (#1204)
* Implement object ID bags

* linting

* fix tests

* fix linting

* fix comments
2017-11-10 16:33:34 -08:00
Stephanie Wang
07f0532b9b Local scheduler filters out dead clients during reconstruction (#1182)
* Object table lookup returns vector of DBClientID instead of address strings

* Add node IP address to DBClient notification

* DB client cache stores entire DB client, convert addresses to std::string

* get cached db client returns the client

* Expose a call to initialize the redis cache

* Local scheduler filters out dead clients during reconstruction

* Remove node ip address from dbclient, use aux_address for plasma managers

* Get entire db client entry when not found in cache

* Fix common tests

* Fix address in tests

* Push error to driver if driver task did the put

* Address Robert's comments and cleanup

* Remove unused Redis command

* Fix db test
2017-11-10 11:29:24 -08:00
Robert Nishihara
d3c082d325 More checking in redis.cc. (#1057) 2017-11-08 23:25:19 -08:00
Robert Nishihara
1c6b30b5e2 Move all config constants into single file. (#1192)
* Initial pass at factoring out C++ configuration into a single file.

* Expose config through Python.

* Forward declarations.

* Fixes with Python extensions

* Remove old code.

* Consistent naming for constants.

* Fixes

* Fix linting.

* More linting.

* Whitespace

* rename config -> _config.

* Move config inside a class.

* update naming convention

* Fix linting.

* More linting

* More linting.

* Add in some more constants.

* Fix linting
2017-11-08 11:10:38 -08:00
Peter Schafhalter
a8032b9ca1 Convert connections from UT_array to std::vector (#1190) 2017-11-07 20:59:41 -08:00
Peter Schafhalter
7215f7d228 Remove UT String from logging (#1184)
* Removed unnecessary utarray include

* Removed ut_string from logging

* Fix formatting
2017-11-05 14:05:20 -08:00
Peter Schafhalter
ad4cbd4016 Updated outstanding_callbacks to unordered_map (#1108)
* Updated outstanding_callbacks to unordered_map

* Fix bug in destroy_outstanding_callbacks and comments
2017-10-20 10:06:22 -07:00
Stephanie Wang
af47737bd5 Prototype distributed actor handles (#1137)
* Add actor handle ID to the task spec

* Local scheduler dispatches actor tasks according to a task counter per handle

* Fix python test

* Allow passing actor handles into tasks. Not completely working yet. Also this is very messy.

* Fixes, should be roughly working now.

* Refactor actor handle wrapper

* Fix __init__ tests

* Terminate actor when the original handle goes out of scope

* TODO and a couple test cases

* Make tests for unsupported cases

* Fix Python mode tests

* Linting.

* Cache actor definitions that occur before ray.init() is called.

* Fix export actor class

* Deterministically compute actor handle ID

* Fix __getattribute__

* Fix string encoding for python3

* doc

* Add comment and assertion.
2017-10-19 23:49:59 -07:00
Robert Nishihara
f3e3c7ec71 Add is_actor_checkpoint_method to TaskSpec. (#1117)
* Add is_actor_checkpoint_method to TaskSpec.

* Fix linting.

* Fix rebase error.

* Fix errors from rebase.
2017-10-15 16:52:10 -07:00
Robert Nishihara
d6062ef8f6 Compile with -rdynamic for better debugging symbols. (#1123)
* Compile with -rdynamic.

* Only use -rdynamic on Linux.

* Add comment.
2017-10-13 21:39:11 -07:00
Stephanie Wang
15486a14a0 Refactor actor task queues (#1118)
* Refactor add_task_to_actor_queue into queue_actor_task and insert_actor_task_queue

* Refactor actor task queue to share the waiting task queue

* Fix
2017-10-13 20:52:11 -07:00
Robert Nishihara
486cb64e3f Compile with -Werror and -Wall (#1116)
* Compile global scheduler with -Werror -Wall.

* Compile plasma manager with -Werror -Wall.

* Compile local scheduler with -Werror -Wall.

* Compile common code with -Werror -Wall.

* Signed/unsigned comparisons.

* More signed/unsigned fixes.

* More signed/unsigned fixes and added extern keyword.

* Fix linting.

* Don't check strict-aliasing because Python.h doesn't pass.
2017-10-12 21:00:23 -07:00