Commit graph

1323 commits

Author SHA1 Message Date
Stephanie Wang
a292d7ba32
[xray] Fix UniqueID hashing for object and task IDs. (#2017)
* Skip object prefix in UniqueIDHasher, choose shard based on hash

* lint
2018-05-10 21:56:12 -07:00
alonamid
32fa862408 add pthread linking (#1986) 2018-05-02 21:50:29 -07:00
eric-jj
34bc6ce6ea remove UniqueIDHasher (#1957)
* remove UniqueIDHasher

* Format the change

* remove unused line

* Fix format

* fix lint error

* fix linting whitespace
2018-04-30 06:31:23 -07:00
Philipp Moritz
af88fdefcf Incorporate C++ Buffer management and Seal global threadpool fix from arrow (#1950) 2018-04-25 22:53:44 -07:00
Philipp Moritz
dad465a2bf [XRay] Add consistency check for protocol between node_manager and local_scheduler_client (#1944) 2018-04-23 23:51:25 -07:00
Melih Elibol
8264e64b18 Handle interrupts correctly for ASIO synchronous reads and writes. (#1929)
* handle interrupts correctly.

* linting

* handle interrupts on read_some/write_some.
2018-04-20 22:55:40 -07:00
Robert Nishihara
cffda73da1 Allow task_table_update to fail when tasks are finished. (#1927)
* Allow task_table_update to fail when tasks are finished.

* Add comment.
2018-04-20 11:34:29 -07:00
Stephanie Wang
aa07f1ce4e [xray] Workers blocked in a ray.get release their resources (#1920)
* [xray] Throttle task dispatch by required resources
* Pass in number of initial workers into raylet command
* Workers blocked in a ray.get release resources
2018-04-18 20:59:58 -07:00
Alexey Tumanov
1c965fcfeb Raylet task dispatch and throttling worker startup (#1912)
* separate task placement and task dispatch; throttle task dispatch with locally available resournces

* keep track of worker's being started/in flight and suppress starting extraneous workers

* cleanup comments

* remove early termination in task dispatch to support zero-resource actor tasks

* info -> debug

* add documentation

* linting

* mock the worker pool for testing

* some linting

* kill all workers in flight; clear the worker pool in dtor

* remove fixed todo

* lint
2018-04-18 10:58:11 -07:00
Eric Liang
7ab890f4a1 [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (#1848) 2018-04-16 16:58:15 -07:00
Stephanie Wang
2e25972d4d Preemptively push local arguments for actor tasks (#1901) 2018-04-16 16:26:59 -07:00
Melih Elibol
ddfc875149 Multithreading refactor for ObjectManager. (#1911)
* removes transfer service. adds separate pool for sends and receives.

* get rid of send/receive transfer counts.

* update comment.

* remove clang formatting.

* clang formatting.
2018-04-16 15:51:53 -07:00
Melih Elibol
cff37765b1 Addresses missed comments from multichunk object transfer PR. (#1908)
* Move object manager parameters to ray config,
object manager config bug fix.
addresses other comments from #1827.

* linting and uint?

* typos

* remove uint.
2018-04-15 21:35:51 -07:00
Robert Nishihara
6ca2c2a609 Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (#1816)
* Allow numpy arrays and larger objects to be passed by value in task specifications.

* Fix bug.

* Fix bug. Inline all bug numpy object arrays.

* Increase size limit for inlining args in task spec.

* Give numpy init different signatures in Python 2 and Python 3.

* Simplify code.

* Fix test.

* Use import_array1 instead of import_array.
2018-04-15 20:36:01 -07:00
Stephanie Wang
6bd944ae0d [xray] Lineage cache requests notifications from the GCS about remote tasks (#1834)
* Add PubsubInterface to GCS tables

* Add task table PubsubInterface to lineage cache and tests

* Request notifications for remote tasks in the lineage cache

* Add RegisterGCS method to node manager

* Fix NodeManager member initialization order, subscribe to task table notifications

* Comments

* Use returned statuses.

* Fix double commit bug in lineage cache

* lint

* More linting.

* Fix pure virtual method declarations
2018-04-15 20:16:55 -07:00
Robert Nishihara
3383553dc0 Remove unnecessary calls to .hex() for object IDs. (#1910) 2018-04-15 13:52:51 -07:00
Stephanie Wang
4b655b0ff6 [xray] Turn on flushing to the GCS for the lineage cache (#1907) 2018-04-14 23:40:56 -07:00
Melih Elibol
fcd30444a8 Single Big Object Parallel Transfer. (#1827)
* cache all object info from object added store notification.

* Adds parallel transfer for big objects.

* documentation and clean up.

* compare objects...

* merge buffer_state with chunk vec. Make separate buffer state for get and create.

* use references for Get. Allow partial failure of Create.

* single plasma client.

* changes based on review.

* update documentation and add parameters for object manager in main.cc.

* review feedback.

* use vector consturctor.

* linting

* remove profile visualizations.

* test fixes.

* linting.

* kill specific pids and use less memory.

* linting.

* simplify tests.

* Asynchronous IO for ObjectManager messages and object transfer.

* Revert "Asynchronous IO for ObjectManager messages and object transfer."

This reverts commit 4af43b159babc04daf80d1543e27c2cb46b7b19d.

* update test configuration to reflect changes in #1891

* review feedback.

* linting.
2018-04-14 17:08:19 -07:00
Melih Elibol
6a84b1f26e Remove num_threads as a parameter. (#1891)
* remove num_threads as a parameter.

* linting.

* add additional checks.

* Invoke TransferCompleted on failures.

* Fix issue with failed Gets on store.

* ray check status of writing object headers.

* fix mac issues.
2018-04-14 15:22:59 -07:00
Melih Elibol
6be73350c6 Adds Valgrind tests for multi-threaded object manager. (#1890)
* adds valgrind to new object manager.

* Add some comments.

* Update run_object_manager_valgrind.sh

typo

* Update run_object_manager_tests.sh

* update tests to reflect changes in #1891.

* reduce # tests.
2018-04-13 21:56:12 -07:00
Robert Nishihara
d0fffec2d0 Update arrow and parquet-cpp. (#1875)
* Update arrow.

* Fix bug.

* Cherry-pick commit for fixing parquet segfault.

* Update arrow and revert auto-releasing buffer commit.

* Remove parquet cherry-pick.
2018-04-12 16:17:12 -07:00
Alexey Tumanov
39cf6ff6e1 raylet command line resource configuration plumbing (#1882)
* raylet command line resource configuration plumbing

* Small changes.
2018-04-12 02:37:15 -07:00
Philipp Moritz
834e594709 [XRay] Register object store and raylet with the GCS (#1860) 2018-04-09 18:56:33 -07:00
Robert Nishihara
256389dc59 Use new task spec for computing IDs in raylet code path. (#1830)
* Use new task spec for computing IDs in raylet code path.

* Fix linting.

* Fixes

* Fix test.
2018-04-08 13:31:55 -07:00
Robert Nishihara
0b7ad668ff Fix unused lambda capture compilation error. (#1844)
* Fix unused lambda capture compilation error.

* Fix linting.
2018-04-07 14:54:21 -07:00
Stephanie Wang
bef1d872b4
[xray] Cleanup Raylet processes on exit (#1839)
* Add raylet monitor script to timeout Raylet heartbeats

* Unit test for removing a different client from the client table

* Set node manager heartbeat according to global config

* Doc and fixes

* Add regression test for client table disconnect, refactor client table

* Convert 'Terminate' methods to destructors

* Destroy the Raylet on a SIGTERM

* Clean up workers on a SIGTERM
2018-04-06 17:21:51 -07:00
Melih Elibol
3bf80839cb Remove all runtime errors. (#1840) 2018-04-06 17:20:52 -07:00
Melih Elibol
c7e11e9057 lint fix. (#1842) 2018-04-06 13:28:52 -07:00
Melih Elibol
24a8cede88
Cache object info from store notification. (#1815)
Cache all object info from object added store notification & submit to GCS via object directory.
2018-04-06 02:33:23 -07:00
Stephanie Wang
bf194db4bc [xray] Basic actor support (#1835) 2018-04-06 00:17:14 -07:00
Melih Elibol
313b864e66
disconnect bug fix. (#1837) 2018-04-05 22:10:51 -07:00
Stephanie Wang
cbf3181fd2 [xray] Monitor for Raylet processes (#1831)
* Add raylet monitor script to timeout Raylet heartbeats

* Unit test for removing a different client from the client table

* Set node manager heartbeat according to global config

* Doc and fixes

* Add regression test for client table disconnect, refactor client table

* Fix linting.
2018-04-05 20:45:38 -07:00
Alexey Tumanov
5a9e83761d fix unused-lambda-capture on clang version 9.1 (#1823)
* fix unused-lambda-capture on clang9.1

* unused lambda capture fix continued

* lambda capture: NM

* lambda capture

* Fix linting.
2018-04-04 11:04:10 -07:00
Robert Nishihara
e0193a5501 Print backtrace for RAY_LOG(FATAL) and also add file and line number … (#1805)
* Print backtrace for RAY_LOG(FATAL) and also add file and line number in common case.

* Fix linting.
2018-04-03 10:12:46 -07:00
Robert Nishihara
fbfbb1c079 [xray] Integrate worker.py with raylet. (#1810)
* Integrate worker with raylet.

* Begin allowing worker to attach to cluster.

* Fix linting and documentation.

* Fix linting.

* Comment tests back in.

* Fix type of worker command.

* Remove xray python files and tests.

* Fix from rebase.

* Add test.

* Copy over raylet executable.

* Small cleanup.
2018-04-03 02:38:56 -07:00
Robert Nishihara
27a0d58e54 Include resource string in error message for infeasible actors. (#1768) 2018-04-02 00:31:30 -07:00
Philipp Moritz
71829a2af9 [XRay] Pass in node IP address to Raylet (#1808) 2018-04-02 00:21:19 -07:00
Philipp Moritz
0bda11e009 [XRay] Fix linting (#1809) 2018-04-01 23:11:06 -07:00
Melih Elibol
6e06a9e338 XRay Task Forwarding Milestone (#1785)
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.

Raylet Changes:

Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:

LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:

Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:

Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).


Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
2018-03-31 18:02:58 -07:00
Stephanie Wang
925e392b2d Add an Append call to the GCS Log that checks for current length (#1788)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log

* AppendAt Redis and gcs Log command

* unit test for AppendAt

* Add a Log for task reconstruction data

* Add check for unique entries in TABLE_APPEND

* Documentation
2018-03-27 13:04:43 -07:00
Stephanie Wang
51fdbe3867 Convert the ObjectTable implementation to a Log (#1779)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint

* Compile and test raylet TaskTable

* Modify GCS tables to handle unique_ptrs from nested flatbuffers

* Add raylet::TaskTable unit tests to replace ObjectTable ones

* Convert ObjectTable to a log

* Convert ObjectTable tests to the Log
2018-03-26 20:36:48 -07:00
Stephanie Wang
0fd4112354 Introduce a log interface for the new GCS (#1771)
* TABLE_APPEND call

* Convert callbacks back to taking in a string...

* GCS returns flatbuffers, define Log class

* Cleanups

* Modify client table to use the Log interface

* Fix bug where we replied twice from redis

* Fixes

* lint
2018-03-26 16:00:43 -07:00
Stephanie Wang
0ad1054b8b
Add a GCS table for the xray task flatbuffer (#1775)
* Introduce Task flatbuffer into xray, add to GCS

* Compile and test raylet TaskTable
2018-03-23 13:18:23 -07:00
Stephanie Wang
8704c8618c
Request and cancel notifications in the new GCS API (#1758)
* Add TableRequestNotifications and TableCancelNotifications to Redis modules

* Add RequestNotifications and CancelNotifications to generic GCS Table

* Add tests for subscribing to specific keys

* Remove TODO!

* Return the current value at the key directly from RequestNotifications instead of through publish

* Add unit test for Lookup failure callback

* Modify tests to account for empty subscription response

* Remove ObjectTable notification methods

* Clean up message parsing and doc in redis context

* Use vectors of DataT in all GCS callbacks

* Clean up SubscriptionCallback

* Move Table definitions into tables.cc

* Refactor and document redis modules

* doc

* Fix new GCS build

* Cleanups

* Revert "Fix new GCS build"

This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.

* Use vectors for internal callback interface, user-facing interface takes a reference to a single item

* Fix new GCS build

* Add unit test for Lookup failure callback

* Fix compiler errors

* Cleanup

* Publish the entry ID with the notification

* Check that the ID for a notification matches in client tests
2018-03-22 10:31:07 -07:00
Robert Nishihara
0c835a379f Fix resource bookkeeping for blocked actor methods. (#1766) 2018-03-21 20:48:04 -07:00
Stephanie Wang
5c7ef34b05
Define string prefixes for all tables in the new GCS API (#1755)
* Define string prefixes for all tables in the new GCS API

* Extra check for TablePrefix enum

* Remove unused field and add doc for existing fields
2018-03-20 20:27:11 -07:00
Robert Nishihara
4bccabd910 Redirect output of all processes by default. (#1752)
* Redirect output of all processes by default.

* Add separate flag for redirecting worker output.

* Fix tests.
2018-03-20 18:14:54 -07:00
Robert Nishihara
794f547d0a Always send actor creation tasks to the global scheduler. (#1757) 2018-03-20 14:55:20 -07:00
Robert Nishihara
4658d0a180 Print error when actor takes too long to start, and refactor error me… (#1747)
* Print error when actor takes too long to start, and refactor error message pushing.

* Print warning every ten seconds.

* Fix linting and tests.

* Fix tests.
2018-03-19 20:24:35 -07:00
Robert Nishihara
96913be939 Treat actor creation like a regular task. (#1668)
* Treat actor creation like a regular task.

* Small cleanups.

* Change semantics of actor resource handling.

* Bug fix.

* Minor linting

* Bug fix

* Fix jenkins test.

* Fix actor tests

* Some cleanups

* Bug fix

* Fix bug.

* Remove cached actor tasks when a driver is removed.

* Add more info to taskspec in global state API.

* Fix cyclic import bug in tune.

* Fix

* Fix linting.

* Fix linting.

* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.

* Bug fix.

* Add test for 0 CPU case

* Fix linting

* Address comments.

* Fix typos and add comment.

* Add assertion and fix test.
2018-03-16 11:18:07 -07:00