* Allow numpy arrays and larger objects to be passed by value in task specifications.
* Fix bug.
* Fix bug. Inline all bug numpy object arrays.
* Increase size limit for inlining args in task spec.
* Give numpy init different signatures in Python 2 and Python 3.
* Simplify code.
* Fix test.
* Use import_array1 instead of import_array.
* Add PubsubInterface to GCS tables
* Add task table PubsubInterface to lineage cache and tests
* Request notifications for remote tasks in the lineage cache
* Add RegisterGCS method to node manager
* Fix NodeManager member initialization order, subscribe to task table notifications
* Comments
* Use returned statuses.
* Fix double commit bug in lineage cache
* lint
* More linting.
* Fix pure virtual method declarations
* cache all object info from object added store notification.
* Adds parallel transfer for big objects.
* documentation and clean up.
* compare objects...
* merge buffer_state with chunk vec. Make separate buffer state for get and create.
* use references for Get. Allow partial failure of Create.
* single plasma client.
* changes based on review.
* update documentation and add parameters for object manager in main.cc.
* review feedback.
* use vector consturctor.
* linting
* remove profile visualizations.
* test fixes.
* linting.
* kill specific pids and use less memory.
* linting.
* simplify tests.
* Asynchronous IO for ObjectManager messages and object transfer.
* Revert "Asynchronous IO for ObjectManager messages and object transfer."
This reverts commit 4af43b159babc04daf80d1543e27c2cb46b7b19d.
* update test configuration to reflect changes in #1891
* review feedback.
* linting.
* remove num_threads as a parameter.
* linting.
* add additional checks.
* Invoke TransferCompleted on failures.
* Fix issue with failed Gets on store.
* ray check status of writing object headers.
* fix mac issues.
* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Convert 'Terminate' methods to destructors
* Destroy the Raylet on a SIGTERM
* Clean up workers on a SIGTERM
* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Fix linting.
* Integrate worker with raylet.
* Begin allowing worker to attach to cluster.
* Fix linting and documentation.
* Fix linting.
* Comment tests back in.
* Fix type of worker command.
* Remove xray python files and tests.
* Fix from rebase.
* Add test.
* Copy over raylet executable.
* Small cleanup.
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.
Raylet Changes:
Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:
LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:
Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:
Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* AppendAt Redis and gcs Log command
* unit test for AppendAt
* Add a Log for task reconstruction data
* Add check for unique entries in TABLE_APPEND
* Documentation
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Add TableRequestNotifications and TableCancelNotifications to Redis modules
* Add RequestNotifications and CancelNotifications to generic GCS Table
* Add tests for subscribing to specific keys
* Remove TODO!
* Return the current value at the key directly from RequestNotifications instead of through publish
* Add unit test for Lookup failure callback
* Modify tests to account for empty subscription response
* Remove ObjectTable notification methods
* Clean up message parsing and doc in redis context
* Use vectors of DataT in all GCS callbacks
* Clean up SubscriptionCallback
* Move Table definitions into tables.cc
* Refactor and document redis modules
* doc
* Fix new GCS build
* Cleanups
* Revert "Fix new GCS build"
This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.
* Use vectors for internal callback interface, user-facing interface takes a reference to a single item
* Fix new GCS build
* Add unit test for Lookup failure callback
* Fix compiler errors
* Cleanup
* Publish the entry ID with the notification
* Check that the ID for a notification matches in client tests
* Print error when actor takes too long to start, and refactor error message pushing.
* Print warning every ten seconds.
* Fix linting and tests.
* Fix tests.
* Treat actor creation like a regular task.
* Small cleanups.
* Change semantics of actor resource handling.
* Bug fix.
* Minor linting
* Bug fix
* Fix jenkins test.
* Fix actor tests
* Some cleanups
* Bug fix
* Fix bug.
* Remove cached actor tasks when a driver is removed.
* Add more info to taskspec in global state API.
* Fix cyclic import bug in tune.
* Fix
* Fix linting.
* Fix linting.
* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.
* Bug fix.
* Add test for 0 CPU case
* Fix linting
* Address comments.
* Fix typos and add comment.
* Add assertion and fix test.
* directory for raylet
* some initial class scaffolding -- in progress
* node_manager build code and test stub files.
* class scaffolding for resources, workers, and the worker pool
* Node manager server loop
* raylet policy and queue - wip checkpoint
* fix dependencies
* add gen_nm_fbs as target.
* object manager build, stub, and test code.
* Start integrating WorkerPool into node manager
* fix build on mac
* tmp
* adding LsResources boilerplate
* add/build Task spec boilerplate
* checkpoint ActorInformation and LsQueue
* Worker pool maintains started and removed workers
* todos for e2e task assignment
* fix build on mac
* build/add lsqueue interface
* channel resource config through from NodeServer to LsResources; prep LsResources to replace/provide worker_pool
* progress on LsResources class: resource availability check implementation
* Read task submission messages from a client
* Submit tasks from the client to the local scheduler
* Assign a task to a worker from the WorkerPool
* change the way node_manager is built to prevent build issues for object_manager.
* add namespaces. fix build.
* Move ClientConnection message handling into server, remove reference to
WorkerPool
* Add raw constructors for TaskSpecification
* Define TaskArgument by reference and by value
* Flatbuffer serialization for TaskSpec
* expand resource implementation
* Start integrating TaskExecutionSpecification into Task
* Separate WorkerPool from LsResources, give ownership to NodeServer
* checkpoint queue and resource code
* resoving merge conflicts
* lspolicy::schedule ; adding lsqueue and lspolicy to the nodeserver
* Implement LsQueue RemoveTasks and QueueReadyTasks
* Fill in some LsQueue code for assigning a task
* added suport for test_asio
* Implement LsQueue queue tasks methods, queue running tasks
* calling into policy from nodeserver; adding cluster resource map
* Feedback and Testing.
Incorporate Alexey's feedback. Actually test some code. Clean up callback imp.
* end to end task assignment
* Decouple local scheduler from node server
* move TODO
* Move local scheduler to separate file
* Add scaffolding for reconstruction policy, task dependency manager, and object manager
* fix
* asio for store client notifications.
added asio for plasma store connection.
added tests for store notifications.
encapsulate store interaction under store_messenger.
* Move Worker inside of ClientConnection
* Set the assigned task ID in the worker
* Several changes toward object manager implementation.
Store client integration with asio.
Complete OM/OD scaffolding.
* simple simulator to estimate number of retry timeouts
* changing dbclientid --> clientid
* fix build (include sandbox after it's fixed).
* changes to object manager, adding lambdas to the interface
* changing void * callbacks to std::function typed callbacks
* remove use namespace std from .h files.
use ray:: for Status everywhere.
* minor
* lineage cache interfaces
* TODO for object IDs
* Interface for the GCS client table
* Revert "Set the assigned task ID in the worker"
This reverts commit a770dd31048a289ef431c56d64e491fa7f9b2737.
* Revert "Move Worker inside of ClientConnection"
This reverts commit dfaa0d662a76976c05be6d76b214b45d88482818.
* OD/OM: ray::Status
* mock gcs integration.
* gcs mock clientinfo assignment
* Allow lookup of a Worker in the WorkerPool
* Split out Worker and ClientConnection source files
* Allow assignment of a task ID to a worker, skeleton for finishing a task
* integrate mock gcs with om tests.
* added tcp connection acceptor
* integrated OM with NM.
integrated GcsClient with NM.
Added multi-node integration tests.
* OM to receive incoming tcp connections.
* implemented object manager connection protocol.
* Added todos.
* slight adjustment to add/remove handler invocation on object store client.
* Simplify Task interface for getting dependencies
* Remove unused object manager file
* TaskDependencyManager tracks missing task dependencies and processes object add notifications
* Local scheduler queues tasks according to argument availability
* Fill in TaskSpecification methods to get arguments
* Implemented push.
* Queue tasks that have been scheduled but that are waiting for a worker
* Pull + mock gcs cleanup.
* OD/OM/GCS mock code review, fixing unused-result issues, eliminating copy ctor
* Remove unique_ptr from object_store_client
* Fix object manager Push memory error
* Pull task arguments in task dependency manager
* Add a demo script for remote task dependencies
* Some comments for the TaskDependencyManager
* code cleanup; builds on mac
* Make ClientConnection a templated type based on the connection protocol
* Add gmock to build
* Add WorkerPool unit tests
* clean up.
* clean up connection code.
* instantiate a template instance in the module
* Virtual destructors
* Document public api.
* Separate read and write buffers in ClientConnection; documentation
* Remove ObjectDirectory from NodeServer constructor, make directory InitGcs call a separate constructor
* Convert NodeServer Terminate to a destructor
* NodeServer documentation
* WorkerPool documentation
* TaskDependencyManager doc
* unifying naming conventions
* unifying naming conventions
* Task cleanup and documentation
* unifying naming conventions
* unifying naming conventions
* code cleanup and naming conventions
* code cleanup
* Rename om --> object_manager
* Merge with master
* SchedulingQueue doc
* Docs and implementation skeleton for ClientTable
* Node manager documentation
* ReconstructionPolicy doc
* Replace std::bind with lambda in TaskDependencyManager
* lineage cache doc
* Use \param style for doc
* documentation for scheduling policy and resources
* minor code cleanup
* SchedulingResources class documentation + code cleanup
* referencing ray/raylet directory; doxygen documentation
* updating trivial policy
* Fix bug where event loop stops after task submission
* Define entry point for ClientManager for handling new connections
* Node manager to node manager protocol, heartbeat protocol
* Fix flatbuffer
* Fix GCS flatbuffer naming conflict
* client connection moved to common dir.
* rename based on feedback.
* Added google style and 90 char lines clang-format file under src/ray.
* const ref ClientID.
* Incorporated feedback from PR.
* raylet: includes and namespaces
* raylet/om/gcs logging/using
* doxygen style
* camel casing, comments, other style; DBClientID -> ClientID
* object_manager : naming, defines, style
* consistent caps and naming; misc style
* cleaning up client connection + other stylistic fixes
* cmath, std::nan
* more style polish: OM, Raylet, gcs tables
* removing sandbox (moved to ray-project/sandbox)
* raylet linting
* object manager linting
* gcs linting
* all other linting
Co-authored-by: Melih <elibol@gmail.com>
Co-authored-by: Stephanie <swang@cs.berkeley.edu>
* restructure how to organize 3rd party libs
* Minor whitespace changes.
* Fix compilation on Linux.
* Pass around Python executable so that the correct version of Python is used.