* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Convert 'Terminate' methods to destructors
* Destroy the Raylet on a SIGTERM
* Clean up workers on a SIGTERM
* Add raylet monitor script to timeout Raylet heartbeats
* Unit test for removing a different client from the client table
* Set node manager heartbeat according to global config
* Doc and fixes
* Add regression test for client table disconnect, refactor client table
* Fix linting.
* Integrate worker with raylet.
* Begin allowing worker to attach to cluster.
* Fix linting and documentation.
* Fix linting.
* Comment tests back in.
* Fix type of worker command.
* Remove xray python files and tests.
* Fix from rebase.
* Add test.
* Copy over raylet executable.
* Small cleanup.
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.
Raylet Changes:
Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:
LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:
Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:
Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* AppendAt Redis and gcs Log command
* unit test for AppendAt
* Add a Log for task reconstruction data
* Add check for unique entries in TABLE_APPEND
* Documentation
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Add TableRequestNotifications and TableCancelNotifications to Redis modules
* Add RequestNotifications and CancelNotifications to generic GCS Table
* Add tests for subscribing to specific keys
* Remove TODO!
* Return the current value at the key directly from RequestNotifications instead of through publish
* Add unit test for Lookup failure callback
* Modify tests to account for empty subscription response
* Remove ObjectTable notification methods
* Clean up message parsing and doc in redis context
* Use vectors of DataT in all GCS callbacks
* Clean up SubscriptionCallback
* Move Table definitions into tables.cc
* Refactor and document redis modules
* doc
* Fix new GCS build
* Cleanups
* Revert "Fix new GCS build"
This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.
* Use vectors for internal callback interface, user-facing interface takes a reference to a single item
* Fix new GCS build
* Add unit test for Lookup failure callback
* Fix compiler errors
* Cleanup
* Publish the entry ID with the notification
* Check that the ID for a notification matches in client tests
* Print error when actor takes too long to start, and refactor error message pushing.
* Print warning every ten seconds.
* Fix linting and tests.
* Fix tests.
* Treat actor creation like a regular task.
* Small cleanups.
* Change semantics of actor resource handling.
* Bug fix.
* Minor linting
* Bug fix
* Fix jenkins test.
* Fix actor tests
* Some cleanups
* Bug fix
* Fix bug.
* Remove cached actor tasks when a driver is removed.
* Add more info to taskspec in global state API.
* Fix cyclic import bug in tune.
* Fix
* Fix linting.
* Fix linting.
* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.
* Bug fix.
* Add test for 0 CPU case
* Fix linting
* Address comments.
* Fix typos and add comment.
* Add assertion and fix test.
* directory for raylet
* some initial class scaffolding -- in progress
* node_manager build code and test stub files.
* class scaffolding for resources, workers, and the worker pool
* Node manager server loop
* raylet policy and queue - wip checkpoint
* fix dependencies
* add gen_nm_fbs as target.
* object manager build, stub, and test code.
* Start integrating WorkerPool into node manager
* fix build on mac
* tmp
* adding LsResources boilerplate
* add/build Task spec boilerplate
* checkpoint ActorInformation and LsQueue
* Worker pool maintains started and removed workers
* todos for e2e task assignment
* fix build on mac
* build/add lsqueue interface
* channel resource config through from NodeServer to LsResources; prep LsResources to replace/provide worker_pool
* progress on LsResources class: resource availability check implementation
* Read task submission messages from a client
* Submit tasks from the client to the local scheduler
* Assign a task to a worker from the WorkerPool
* change the way node_manager is built to prevent build issues for object_manager.
* add namespaces. fix build.
* Move ClientConnection message handling into server, remove reference to
WorkerPool
* Add raw constructors for TaskSpecification
* Define TaskArgument by reference and by value
* Flatbuffer serialization for TaskSpec
* expand resource implementation
* Start integrating TaskExecutionSpecification into Task
* Separate WorkerPool from LsResources, give ownership to NodeServer
* checkpoint queue and resource code
* resoving merge conflicts
* lspolicy::schedule ; adding lsqueue and lspolicy to the nodeserver
* Implement LsQueue RemoveTasks and QueueReadyTasks
* Fill in some LsQueue code for assigning a task
* added suport for test_asio
* Implement LsQueue queue tasks methods, queue running tasks
* calling into policy from nodeserver; adding cluster resource map
* Feedback and Testing.
Incorporate Alexey's feedback. Actually test some code. Clean up callback imp.
* end to end task assignment
* Decouple local scheduler from node server
* move TODO
* Move local scheduler to separate file
* Add scaffolding for reconstruction policy, task dependency manager, and object manager
* fix
* asio for store client notifications.
added asio for plasma store connection.
added tests for store notifications.
encapsulate store interaction under store_messenger.
* Move Worker inside of ClientConnection
* Set the assigned task ID in the worker
* Several changes toward object manager implementation.
Store client integration with asio.
Complete OM/OD scaffolding.
* simple simulator to estimate number of retry timeouts
* changing dbclientid --> clientid
* fix build (include sandbox after it's fixed).
* changes to object manager, adding lambdas to the interface
* changing void * callbacks to std::function typed callbacks
* remove use namespace std from .h files.
use ray:: for Status everywhere.
* minor
* lineage cache interfaces
* TODO for object IDs
* Interface for the GCS client table
* Revert "Set the assigned task ID in the worker"
This reverts commit a770dd31048a289ef431c56d64e491fa7f9b2737.
* Revert "Move Worker inside of ClientConnection"
This reverts commit dfaa0d662a76976c05be6d76b214b45d88482818.
* OD/OM: ray::Status
* mock gcs integration.
* gcs mock clientinfo assignment
* Allow lookup of a Worker in the WorkerPool
* Split out Worker and ClientConnection source files
* Allow assignment of a task ID to a worker, skeleton for finishing a task
* integrate mock gcs with om tests.
* added tcp connection acceptor
* integrated OM with NM.
integrated GcsClient with NM.
Added multi-node integration tests.
* OM to receive incoming tcp connections.
* implemented object manager connection protocol.
* Added todos.
* slight adjustment to add/remove handler invocation on object store client.
* Simplify Task interface for getting dependencies
* Remove unused object manager file
* TaskDependencyManager tracks missing task dependencies and processes object add notifications
* Local scheduler queues tasks according to argument availability
* Fill in TaskSpecification methods to get arguments
* Implemented push.
* Queue tasks that have been scheduled but that are waiting for a worker
* Pull + mock gcs cleanup.
* OD/OM/GCS mock code review, fixing unused-result issues, eliminating copy ctor
* Remove unique_ptr from object_store_client
* Fix object manager Push memory error
* Pull task arguments in task dependency manager
* Add a demo script for remote task dependencies
* Some comments for the TaskDependencyManager
* code cleanup; builds on mac
* Make ClientConnection a templated type based on the connection protocol
* Add gmock to build
* Add WorkerPool unit tests
* clean up.
* clean up connection code.
* instantiate a template instance in the module
* Virtual destructors
* Document public api.
* Separate read and write buffers in ClientConnection; documentation
* Remove ObjectDirectory from NodeServer constructor, make directory InitGcs call a separate constructor
* Convert NodeServer Terminate to a destructor
* NodeServer documentation
* WorkerPool documentation
* TaskDependencyManager doc
* unifying naming conventions
* unifying naming conventions
* Task cleanup and documentation
* unifying naming conventions
* unifying naming conventions
* code cleanup and naming conventions
* code cleanup
* Rename om --> object_manager
* Merge with master
* SchedulingQueue doc
* Docs and implementation skeleton for ClientTable
* Node manager documentation
* ReconstructionPolicy doc
* Replace std::bind with lambda in TaskDependencyManager
* lineage cache doc
* Use \param style for doc
* documentation for scheduling policy and resources
* minor code cleanup
* SchedulingResources class documentation + code cleanup
* referencing ray/raylet directory; doxygen documentation
* updating trivial policy
* Fix bug where event loop stops after task submission
* Define entry point for ClientManager for handling new connections
* Node manager to node manager protocol, heartbeat protocol
* Fix flatbuffer
* Fix GCS flatbuffer naming conflict
* client connection moved to common dir.
* rename based on feedback.
* Added google style and 90 char lines clang-format file under src/ray.
* const ref ClientID.
* Incorporated feedback from PR.
* raylet: includes and namespaces
* raylet/om/gcs logging/using
* doxygen style
* camel casing, comments, other style; DBClientID -> ClientID
* object_manager : naming, defines, style
* consistent caps and naming; misc style
* cleaning up client connection + other stylistic fixes
* cmath, std::nan
* more style polish: OM, Raylet, gcs tables
* removing sandbox (moved to ray-project/sandbox)
* raylet linting
* object manager linting
* gcs linting
* all other linting
Co-authored-by: Melih <elibol@gmail.com>
Co-authored-by: Stephanie <swang@cs.berkeley.edu>
* restructure how to organize 3rd party libs
* Minor whitespace changes.
* Fix compilation on Linux.
* Pass around Python executable so that the correct version of Python is used.
* Add shell script for building parquet
* Use parquet ci script; remove anaconda
* Remove gcc flag, use default
* add boost_root
* Fix $TP_DIR reference issue
* fix the PR
* check out specific parquet-cpp commit
* spillback policy implementation: global + local scheduler
* modernize global scheduler policy state; factor out random number engine and generator
* Minimal version.
* Fix test.
* Make load balancing test less strenuous.
* Expose calls to get and set the actor frontier
* Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded
* Prototype for actor checkpointing
* Filter out duplicate tasks on the local scheduler
* Clean up some of the Python checkpointing code
* More cleanups
* Documentation
* cleanup and fix unit test
* Allow remote checkpoint calls through actor handle
* Check whether object is local before reconstructing
* Enable checkpointing for distributed actor handles, refactor tests
* Fix local scheduler tests
* lint
* Address comments
* lint
* Skip tests that fail on new GCS
* style
* Don't put same object twice when setting the actor frontier
* Address Philipp's comments, cleaner fbs naming
* added deterministic check for objects received in fetch_timeout_handler.
* use receive time, in case something goes wrong after object is received.
* increase timeout for removal.
* indentation fix.
* make log info log debug. clean up debug log.
* undo unecessary changes.
* changed description var.
* shorten line 949.
* incorporate feedback.
* linting; make is_object_received function consts.
* change semantics of received_objects to objects being received.
added checks to both points at which objects are re-requested.
updated object receive initialization accordingly.
* eliminate erase on receive init. check call to request_transfer_from instead of request_transfer.
* updated comments.
* added todo for multiple object transfers.
* linting.