Commit graph

209 commits

Author SHA1 Message Date
Hao Chen
c1575e98c1 Make local scheduler client thread-safe (#2386)
* Make local scheduler client thread-safe for python

* lock write_messages

* remove allow-threads

* fix linter

* rename _write_message to do_write_message
2018-07-13 16:19:00 -07:00
Stephanie Wang
c50f1966e0 Publish a notification for empty keys in the GCS (#2347)
* Publish an empty notification for empty keys

* Add failure callback to Table::Subscribe, add unit test for new behavior
2018-07-05 13:39:07 -07:00
Robert Nishihara
b90e551b41 [xray] Implement timeline and profiling API. (#2306)
* Add profile table and store profiling information there.

* Code for dumping timeline.

* Improve color scheme.

* Push timeline events on driver only for raylet.

* Improvements to profiling and timeline visualization

* Some linting

* Small fix.

* Linting

* Propagate node IP address through profiling events.

* Fix test.

* object_id.hex() should return byte string in python 2.

* Include gcs.fbs in node_manager.fbs.

* Remove flatbuffer definition duplication.

* Decode to unicode in Python 3 and bytes in Python 2.

* Minor

* Submit profile events in a batch. Revert some CMake changes.

* Fix

* Workaround test failure.

* Fix linting

* Linting

* Don't return anything from chrome_tracing_dump when filename is provided.

* Remove some redundancy from profile table.

* Linting

* Move TODOs out of docstring.

* Minor
2018-07-04 23:23:48 -07:00
Zongheng Yang
ba28dddf6f Make xray object table credis-managed and hence flushable. (#2338)
* monitor.py: issue flushes to data shard

* ResultTableAdd & ObjectTableAdd: add credis-managed versions

* Fix return codes

* Credis-manage xray object table & associated ray.table_append cmd

* Fix incorrect return code from TableAppend_DoWrite()

* Revert "ResultTableAdd & ObjectTableAdd: add credis-managed versions"

This reverts commit 628c2ea190df4c861dda0c284fab7ca6faa1ea24.

* Address comments

* Lint: fix indent

* Address comment
2018-07-03 17:32:44 -07:00
Philipp Moritz
f21d783e6d Remove new gcs code from legacy Ray codepath (#2329) 2018-07-03 11:48:50 -07:00
Robert Nishihara
ff2217251f [xray] Add error table and push error messages to driver through node manager. (#2256)
* Fix documentation indentation.

* Add error table to GCS and push error messages through node manager.

* Add type to error data.

* Linting

* Fix failure_test bug.

* Linting.

* Enable one more test.

* Attempt to fix doc building.

* Restructuring

* Fixes

* More fixes.

* Move current_time_ms function into util.h.
2018-06-20 21:29:28 -07:00
Zongheng Yang
8190ff1fd0 Experimental: enable automatic GCS flushing with configurable policy. (#2266)
* build_credis.sh: use an up-to-date credis commit.

* build_credis.sh: leveldb is updated, so update build cmds for it

* WIP: make monitor.py issue flush; switch gcs client to use credis

* Experimental: enable automatic GCS flushing with configurable policy.

* Fix linux compilation error

* Fix leveldb build

* Use optimized build for credis

* Address comments

* Attempt to fix tests
2018-06-20 14:40:57 -07:00
Melih Elibol
60bc3a014f [xray] Sets good object manager defaults. (#2255)
* better object manager defaults. added max for number of chunks.

* change source of cores.
2018-06-20 14:10:57 -07:00
Hao Chen
8efd0f7b1b [xray] support multi-workers per process (#2244)
* support multi-workers per process

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* use RayConfig

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

Signed-off-by: Hao Chen <chenh1024@gmail.com>

* fix

* remove clear

* address comments

* fix lint

* fix bug

* make WorkerPool and WorkerPoolMock more consistent
2018-06-13 10:14:05 -07:00
Yuhong Guo
0a34bea0b0 Use scoped enums in C++ and flatbuffers. (#2194)
* Enable --scoped-enums in flatbuffer compiler.

* Change enum to c++11 style (enum class).

* Resolve conflicts.

* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.

* Merge with master and fix CI failure.
2018-06-07 01:01:21 -07:00
songqing
4dd4698564 unify build dir for Python and Java (#2171)
* unify build dir for Python and Java

* enable executables auto installed when just running 'make'

* fix plasma_store copy error

* fix cmake error about copying executables

* lint fix

* recover python/setup.py

* enable to copy optional file automatically

* a small fix of path

* lint fix

* lint fix

* lint fix

* Add comment.
2018-06-01 16:28:27 -07:00
Yuhong Guo
c1de03acac Add timeout mechanism to Push function instead of retries (#2148)
Use timer instead of retries in Push when objects are not local.
2018-06-01 01:21:05 -07:00
Stephanie Wang
117107cb15 [xray] Evict tasks from the lineage cache (#2152) 2018-05-31 00:24:39 -07:00
Robert Nishihara
6172f94c04 Implement Python global state API for xray. (#2125)
* Implement global state API for xray.

* Fix object table.

* Fixes for log structure.

* Implement cluster_resources.

* Add driver task to task table.

* Remove python flatbuffers code

* Get some global state API tests running.

* Python linting.

* Fix linting.

* Fix mock modules for doc

* Copy over flatbuffer bindings.

* Fix for tests.

* Linting

* Fix monitor crash.
2018-05-29 16:25:54 -07:00
Yuhong Guo
a8517cc82a Fix infinite retry in Push function. (#2133) 2018-05-25 01:16:44 -07:00
Yujie Liu
5c2b2c7b49 [JavaWorker] Changes to the directory under src for support java worker (#2093)
* Changes to the directory under src for support java worker
--------------------------
This commit includes changes to the directory under src, which is part of the java worker support of Ray.
It consists of the following changes:
 src/common/task.cc - just fix null point problem
 org_ray_spi_impl_DefaultLocalSchedulerClient.* - JNI support for local scheduler client, and the org_ray_spi_impl_DefaultLocalSchedulerClient.cc file is not autogenerated
2018-05-25 00:59:05 -07:00
Zongheng Yang
fa97acbc89 Integrate credis with Ray & route task table entries into credis. (#1841) 2018-05-24 23:35:25 -07:00
yuyiming
9ff3d57429 do not fetch from dead Plasma Manager (#2116) 2018-05-23 16:13:09 -07:00
Robert Nishihara
9b9ff19dd0 Use automatic memory management in Redis modules. (#1797) 2018-05-22 01:05:09 -07:00
eric-jj
eb078766d8 Performance fix (#2110) 2018-05-20 18:07:55 -07:00
Yujie Liu
5918776dd4 [JavaWorker] Changes to the build system for support java worker (#2092)
* Changes to the build system for support java worker
--------------------------
This commit includes changes to the build system, which is part of the java worker support of Ray.
It consists of the following changes:
 - the changes of CMakeLists.txt files
 - the changes of the python setup.py and init files for the adaptation of the changed build system
 - move the location of local_scheduler_extension.cc for the adaptation of the changed build system which maybe better support multi-language worker

* minor whitespace

* Linting
2018-05-18 19:09:23 -07:00
eric-jj
34bc6ce6ea remove UniqueIDHasher (#1957)
* remove UniqueIDHasher

* Format the change

* remove unused line

* Fix format

* fix lint error

* fix linting whitespace
2018-04-30 06:31:23 -07:00
Philipp Moritz
af88fdefcf Incorporate C++ Buffer management and Seal global threadpool fix from arrow (#1950) 2018-04-25 22:53:44 -07:00
Robert Nishihara
cffda73da1 Allow task_table_update to fail when tasks are finished. (#1927)
* Allow task_table_update to fail when tasks are finished.

* Add comment.
2018-04-20 11:34:29 -07:00
Melih Elibol
cff37765b1 Addresses missed comments from multichunk object transfer PR. (#1908)
* Move object manager parameters to ray config,
object manager config bug fix.
addresses other comments from #1827.

* linting and uint?

* typos

* remove uint.
2018-04-15 21:35:51 -07:00
Robert Nishihara
6ca2c2a609 Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (#1816)
* Allow numpy arrays and larger objects to be passed by value in task specifications.

* Fix bug.

* Fix bug. Inline all bug numpy object arrays.

* Increase size limit for inlining args in task spec.

* Give numpy init different signatures in Python 2 and Python 3.

* Simplify code.

* Fix test.

* Use import_array1 instead of import_array.
2018-04-15 20:36:01 -07:00
Robert Nishihara
256389dc59 Use new task spec for computing IDs in raylet code path. (#1830)
* Use new task spec for computing IDs in raylet code path.

* Fix linting.

* Fixes

* Fix test.
2018-04-08 13:31:55 -07:00
Stephanie Wang
bf194db4bc [xray] Basic actor support (#1835) 2018-04-06 00:17:14 -07:00
Melih Elibol
6e06a9e338 XRay Task Forwarding Milestone (#1785)
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.

Raylet Changes:

Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:

LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:

Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:

Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).


Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
2018-03-31 18:02:58 -07:00
Stephanie Wang
925e392b2d Add an Append call to the GCS Log that checks for current length (#1788)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log

* AppendAt Redis and gcs Log command

* unit test for AppendAt

* Add a Log for task reconstruction data

* Add check for unique entries in TABLE_APPEND

* Documentation
2018-03-27 13:04:43 -07:00
Stephanie Wang
0fd4112354 Introduce a log interface for the new GCS (#1771)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint
2018-03-26 16:00:43 -07:00
Stephanie Wang
0ad1054b8b
Add a GCS table for the xray task flatbuffer (#1775)
* Introduce Task flatbuffer into xray, add to GCS

* Compile and test raylet TaskTable
2018-03-23 13:18:23 -07:00
Stephanie Wang
8704c8618c
Request and cancel notifications in the new GCS API (#1758)
* Add TableRequestNotifications and TableCancelNotifications to Redis modules

* Add RequestNotifications and CancelNotifications to generic GCS Table

* Add tests for subscribing to specific keys

* Remove TODO!

* Return the current value at the key directly from RequestNotifications instead of through publish

* Add unit test for Lookup failure callback

* Modify tests to account for empty subscription response

* Remove ObjectTable notification methods

* Clean up message parsing and doc in redis context

* Use vectors of DataT in all GCS callbacks

* Clean up SubscriptionCallback

* Move Table definitions into tables.cc

* Refactor and document redis modules

* doc

* Fix new GCS build

* Cleanups

* Revert "Fix new GCS build"

This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.

* Use vectors for internal callback interface, user-facing interface takes a reference to a single item

* Fix new GCS build

* Add unit test for Lookup failure callback

* Fix compiler errors

* Cleanup

* Publish the entry ID with the notification

* Check that the ID for a notification matches in client tests
2018-03-22 10:31:07 -07:00
Stephanie Wang
5c7ef34b05
Define string prefixes for all tables in the new GCS API (#1755)
* Define string prefixes for all tables in the new GCS API

* Extra check for TablePrefix enum

* Remove unused field and add doc for existing fields
2018-03-20 20:27:11 -07:00
Robert Nishihara
4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara
96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00
Melih Elibol
3c080f4baa
Add a callback for gcs table lookup failures. (#1702)
* Add callback to gcs client for table lookup failures.

* update plasma_manager reflecting changes to gcs callback.
2018-03-15 22:25:01 -07:00
Stephanie Wang
6114b6d20e
Implement the client table for the new GCS (#1674)
* Add subscription callback to CallbackData

* Implement ClientTable

* Hook up ClientTable to AsyncGCSClient

* Add client_info to GCSClient Connect interface

* client table callbacks

* Unit test for client table

* Doc

* Fix idempotency check

* Fix mac build

* Fix memory issues in gcs client test

* Fix disconnection bug

* lint
2018-03-11 19:17:18 -07:00
Stephanie Wang
0a6edb55a8 Implement the Subscribe call for the new GCS API (#1652)
* Implement the Subscribe call for the new GCS API

* Document tests

* Upper case function name

* Fix build errors

* lint
2018-03-06 09:56:12 -08:00
Zhenyu Guo
f1e5789c26 restructure how to organize 3rd party libs (#1630)
* restructure how to organize 3rd party libs

* Minor whitespace changes.

* Fix compilation on Linux.

* Pass around Python executable so that the correct version of Python is used.
2018-03-01 14:29:56 -08:00
Robert Nishihara
0fcceef772 Update logging and check macros. (#1627)
* Update logging and check macros.

* Fix linting.

* Fix RAY_DCHECK and unused variable.

* Fix linting
2018-02-28 15:13:00 -08:00
Robert Nishihara
ba1ce85f58 Download Redis and flatbuffers differently. (#1602)
* Download Redis differently.

* Get flatbuffers with curl
2018-02-25 20:32:33 -08:00
Alexey Tumanov
844a6afcdd Implement simple random spillback policy. (#1493)
* spillback policy implementation: global + local scheduler

* modernize global scheduler policy state; factor out random number engine and generator

* Minimal version.

* Fix test.

* Make load balancing test less strenuous.
2018-02-13 00:09:35 -08:00
Philipp Moritz
1ab2e63dbd Tune transfer buffer size (#1363)
Increase buffsize from `4096` to `80*1024`.
2018-02-09 14:56:36 -08:00
Philipp Moritz
a3f8fa426b Start integrating new GCS APIs (#1379)
* Start integrating new GCS calls

* fixes

* tests

* cleanup

* cleanup and valgrind fix

* update tests

* fix valgrind

* fix more valgrind

* fixes

* add separate tests for GCS

* fix linting

* update tests

* cleanup

* fix python linting

* more fixes

* fix linting

* add plasma manager callback

* add some documentation

* fix linting

* fix linting

* fixes

* update

* fix linting

* fix

* add spillback count

* fixes

* linting

* fixes

* fix linting

* fix

* fix

* fix
2018-01-31 11:01:12 -08:00
Robert Nishihara
3195c6aa63 Fix local scheduler crash when driver creates actor and exits. (#1474)
* Make check failures in redis.cc more informative.

* Fix bug by calling task_table_add_task.

* Add test.
2018-01-26 14:29:53 -08:00
Alexey Tumanov
f1303291b4 Ray scheduler spillback plumbing + mechanism (#1362)
* spillback mechanism and plumbing : adding spillback counter + timestamp

* linting fix

* documentation

* Fix argument name.
2018-01-23 20:18:12 -08:00
Melih Elibol
4b1c8be4fe Fix setting log-level to debug. (#1432) 2018-01-21 21:51:05 -08:00
Stephanie Wang
74718efa73
Nondeterministic reconstruction for actors (#1344)
* Add failing unit test for nondeterministic reconstruction

* Retry scheduling actor tasks if reassigned to local scheduler

* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction

* Fix bug for updating checkpoint task execution dependencies

* Update comments for deterministic reconstruction

* cleanup

* Add (and skip) failing test case for nondeterministic reconstruction

* Suppress test output
2018-01-21 13:44:13 -08:00
Robert Nishihara
088f01496c Remove unused object info table code. (#1388) 2018-01-05 11:00:06 -08:00