* [xray] Throttle task dispatch by required resources
* Pass in number of initial workers into raylet command
* Workers blocked in a ray.get release resources
* separate task placement and task dispatch; throttle task dispatch with locally available resournces
* keep track of worker's being started/in flight and suppress starting extraneous workers
* cleanup comments
* remove early termination in task dispatch to support zero-resource actor tasks
* info -> debug
* add documentation
* linting
* mock the worker pool for testing
* some linting
* kill all workers in flight; clear the worker pool in dtor
* remove fixed todo
* lint
* removes transfer service. adds separate pool for sends and receives.
* get rid of send/receive transfer counts.
* update comment.
* remove clang formatting.
* clang formatting.
* Allow numpy arrays and larger objects to be passed by value in task specifications.
* Fix bug.
* Fix bug. Inline all bug numpy object arrays.
* Increase size limit for inlining args in task spec.
* Give numpy init different signatures in Python 2 and Python 3.
* Simplify code.
* Fix test.
* Use import_array1 instead of import_array.
* Add PubsubInterface to GCS tables
* Add task table PubsubInterface to lineage cache and tests
* Request notifications for remote tasks in the lineage cache
* Add RegisterGCS method to node manager
* Fix NodeManager member initialization order, subscribe to task table notifications
* Comments
* Use returned statuses.
* Fix double commit bug in lineage cache
* lint
* More linting.
* Fix pure virtual method declarations
* cache all object info from object added store notification.
* Adds parallel transfer for big objects.
* documentation and clean up.
* compare objects...
* merge buffer_state with chunk vec. Make separate buffer state for get and create.
* use references for Get. Allow partial failure of Create.
* single plasma client.
* changes based on review.
* update documentation and add parameters for object manager in main.cc.
* review feedback.
* use vector consturctor.
* linting
* remove profile visualizations.
* test fixes.
* linting.
* kill specific pids and use less memory.
* linting.
* simplify tests.
* Asynchronous IO for ObjectManager messages and object transfer.
* Revert "Asynchronous IO for ObjectManager messages and object transfer."
This reverts commit 4af43b159babc04daf80d1543e27c2cb46b7b19d.
* update test configuration to reflect changes in #1891
* review feedback.
* linting.
* remove num_threads as a parameter.
* linting.
* add additional checks.
* Invoke TransferCompleted on failures.
* Fix issue with failed Gets on store.
* ray check status of writing object headers.
* fix mac issues.
* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Convert 'Terminate' methods to destructors
* Destroy the Raylet on a SIGTERM
* Clean up workers on a SIGTERM
* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Fix linting.
* Integrate worker with raylet.
* Begin allowing worker to attach to cluster.
* Fix linting and documentation.
* Fix linting.
* Comment tests back in.
* Fix type of worker command.
* Remove xray python files and tests.
* Fix from rebase.
* Add test.
* Copy over raylet executable.
* Small cleanup.
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.
Raylet Changes:
Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:
LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:
Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:
Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* AppendAt Redis and gcs Log command
* unit test for AppendAt
* Add a Log for task reconstruction data
* Add check for unique entries in TABLE_APPEND
* Documentation
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Add TableRequestNotifications and TableCancelNotifications to Redis modules
* Add RequestNotifications and CancelNotifications to generic GCS Table
* Add tests for subscribing to specific keys
* Remove TODO!
* Return the current value at the key directly from RequestNotifications instead of through publish
* Add unit test for Lookup failure callback
* Modify tests to account for empty subscription response
* Remove ObjectTable notification methods
* Clean up message parsing and doc in redis context
* Use vectors of DataT in all GCS callbacks
* Clean up SubscriptionCallback
* Move Table definitions into tables.cc
* Refactor and document redis modules
* doc
* Fix new GCS build
* Cleanups
* Revert "Fix new GCS build"
This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.
* Use vectors for internal callback interface, user-facing interface takes a reference to a single item
* Fix new GCS build
* Add unit test for Lookup failure callback
* Fix compiler errors
* Cleanup
* Publish the entry ID with the notification
* Check that the ID for a notification matches in client tests
* Print error when actor takes too long to start, and refactor error message pushing.
* Print warning every ten seconds.
* Fix linting and tests.
* Fix tests.
* Treat actor creation like a regular task.
* Small cleanups.
* Change semantics of actor resource handling.
* Bug fix.
* Minor linting
* Bug fix
* Fix jenkins test.
* Fix actor tests
* Some cleanups
* Bug fix
* Fix bug.
* Remove cached actor tasks when a driver is removed.
* Add more info to taskspec in global state API.
* Fix cyclic import bug in tune.
* Fix
* Fix linting.
* Fix linting.
* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.
* Bug fix.
* Add test for 0 CPU case
* Fix linting
* Address comments.
* Fix typos and add comment.
* Add assertion and fix test.