* directory for raylet
* some initial class scaffolding -- in progress
* node_manager build code and test stub files.
* class scaffolding for resources, workers, and the worker pool
* Node manager server loop
* raylet policy and queue - wip checkpoint
* fix dependencies
* add gen_nm_fbs as target.
* object manager build, stub, and test code.
* Start integrating WorkerPool into node manager
* fix build on mac
* tmp
* adding LsResources boilerplate
* add/build Task spec boilerplate
* checkpoint ActorInformation and LsQueue
* Worker pool maintains started and removed workers
* todos for e2e task assignment
* fix build on mac
* build/add lsqueue interface
* channel resource config through from NodeServer to LsResources; prep LsResources to replace/provide worker_pool
* progress on LsResources class: resource availability check implementation
* Read task submission messages from a client
* Submit tasks from the client to the local scheduler
* Assign a task to a worker from the WorkerPool
* change the way node_manager is built to prevent build issues for object_manager.
* add namespaces. fix build.
* Move ClientConnection message handling into server, remove reference to
WorkerPool
* Add raw constructors for TaskSpecification
* Define TaskArgument by reference and by value
* Flatbuffer serialization for TaskSpec
* expand resource implementation
* Start integrating TaskExecutionSpecification into Task
* Separate WorkerPool from LsResources, give ownership to NodeServer
* checkpoint queue and resource code
* resoving merge conflicts
* lspolicy::schedule ; adding lsqueue and lspolicy to the nodeserver
* Implement LsQueue RemoveTasks and QueueReadyTasks
* Fill in some LsQueue code for assigning a task
* added suport for test_asio
* Implement LsQueue queue tasks methods, queue running tasks
* calling into policy from nodeserver; adding cluster resource map
* Feedback and Testing.
Incorporate Alexey's feedback. Actually test some code. Clean up callback imp.
* end to end task assignment
* Decouple local scheduler from node server
* move TODO
* Move local scheduler to separate file
* Add scaffolding for reconstruction policy, task dependency manager, and object manager
* fix
* asio for store client notifications.
added asio for plasma store connection.
added tests for store notifications.
encapsulate store interaction under store_messenger.
* Move Worker inside of ClientConnection
* Set the assigned task ID in the worker
* Several changes toward object manager implementation.
Store client integration with asio.
Complete OM/OD scaffolding.
* simple simulator to estimate number of retry timeouts
* changing dbclientid --> clientid
* fix build (include sandbox after it's fixed).
* changes to object manager, adding lambdas to the interface
* changing void * callbacks to std::function typed callbacks
* remove use namespace std from .h files.
use ray:: for Status everywhere.
* minor
* lineage cache interfaces
* TODO for object IDs
* Interface for the GCS client table
* Revert "Set the assigned task ID in the worker"
This reverts commit a770dd31048a289ef431c56d64e491fa7f9b2737.
* Revert "Move Worker inside of ClientConnection"
This reverts commit dfaa0d662a76976c05be6d76b214b45d88482818.
* OD/OM: ray::Status
* mock gcs integration.
* gcs mock clientinfo assignment
* Allow lookup of a Worker in the WorkerPool
* Split out Worker and ClientConnection source files
* Allow assignment of a task ID to a worker, skeleton for finishing a task
* integrate mock gcs with om tests.
* added tcp connection acceptor
* integrated OM with NM.
integrated GcsClient with NM.
Added multi-node integration tests.
* OM to receive incoming tcp connections.
* implemented object manager connection protocol.
* Added todos.
* slight adjustment to add/remove handler invocation on object store client.
* Simplify Task interface for getting dependencies
* Remove unused object manager file
* TaskDependencyManager tracks missing task dependencies and processes object add notifications
* Local scheduler queues tasks according to argument availability
* Fill in TaskSpecification methods to get arguments
* Implemented push.
* Queue tasks that have been scheduled but that are waiting for a worker
* Pull + mock gcs cleanup.
* OD/OM/GCS mock code review, fixing unused-result issues, eliminating copy ctor
* Remove unique_ptr from object_store_client
* Fix object manager Push memory error
* Pull task arguments in task dependency manager
* Add a demo script for remote task dependencies
* Some comments for the TaskDependencyManager
* code cleanup; builds on mac
* Make ClientConnection a templated type based on the connection protocol
* Add gmock to build
* Add WorkerPool unit tests
* clean up.
* clean up connection code.
* instantiate a template instance in the module
* Virtual destructors
* Document public api.
* Separate read and write buffers in ClientConnection; documentation
* Remove ObjectDirectory from NodeServer constructor, make directory InitGcs call a separate constructor
* Convert NodeServer Terminate to a destructor
* NodeServer documentation
* WorkerPool documentation
* TaskDependencyManager doc
* unifying naming conventions
* unifying naming conventions
* Task cleanup and documentation
* unifying naming conventions
* unifying naming conventions
* code cleanup and naming conventions
* code cleanup
* Rename om --> object_manager
* Merge with master
* SchedulingQueue doc
* Docs and implementation skeleton for ClientTable
* Node manager documentation
* ReconstructionPolicy doc
* Replace std::bind with lambda in TaskDependencyManager
* lineage cache doc
* Use \param style for doc
* documentation for scheduling policy and resources
* minor code cleanup
* SchedulingResources class documentation + code cleanup
* referencing ray/raylet directory; doxygen documentation
* updating trivial policy
* Fix bug where event loop stops after task submission
* Define entry point for ClientManager for handling new connections
* Node manager to node manager protocol, heartbeat protocol
* Fix flatbuffer
* Fix GCS flatbuffer naming conflict
* client connection moved to common dir.
* rename based on feedback.
* Added google style and 90 char lines clang-format file under src/ray.
* const ref ClientID.
* Incorporated feedback from PR.
* raylet: includes and namespaces
* raylet/om/gcs logging/using
* doxygen style
* camel casing, comments, other style; DBClientID -> ClientID
* object_manager : naming, defines, style
* consistent caps and naming; misc style
* cleaning up client connection + other stylistic fixes
* cmath, std::nan
* more style polish: OM, Raylet, gcs tables
* removing sandbox (moved to ray-project/sandbox)
* raylet linting
* object manager linting
* gcs linting
* all other linting
Co-authored-by: Melih <elibol@gmail.com>
Co-authored-by: Stephanie <swang@cs.berkeley.edu>
* restructure how to organize 3rd party libs
* Minor whitespace changes.
* Fix compilation on Linux.
* Pass around Python executable so that the correct version of Python is used.
* Add shell script for building parquet
* Use parquet ci script; remove anaconda
* Remove gcc flag, use default
* add boost_root
* Fix $TP_DIR reference issue
* fix the PR
* check out specific parquet-cpp commit
* spillback policy implementation: global + local scheduler
* modernize global scheduler policy state; factor out random number engine and generator
* Minimal version.
* Fix test.
* Make load balancing test less strenuous.
* Expose calls to get and set the actor frontier
* Remove fields used for old checkpointing prototype, change actor_checkpoint_failed -> succeeded
* Prototype for actor checkpointing
* Filter out duplicate tasks on the local scheduler
* Clean up some of the Python checkpointing code
* More cleanups
* Documentation
* cleanup and fix unit test
* Allow remote checkpoint calls through actor handle
* Check whether object is local before reconstructing
* Enable checkpointing for distributed actor handles, refactor tests
* Fix local scheduler tests
* lint
* Address comments
* lint
* Skip tests that fail on new GCS
* style
* Don't put same object twice when setting the actor frontier
* Address Philipp's comments, cleaner fbs naming
* added deterministic check for objects received in fetch_timeout_handler.
* use receive time, in case something goes wrong after object is received.
* increase timeout for removal.
* indentation fix.
* make log info log debug. clean up debug log.
* undo unecessary changes.
* changed description var.
* shorten line 949.
* incorporate feedback.
* linting; make is_object_received function consts.
* change semantics of received_objects to objects being received.
added checks to both points at which objects are re-requested.
updated object receive initialization accordingly.
* eliminate erase on receive init. check call to request_transfer_from instead of request_transfer.
* updated comments.
* added todo for multiple object transfers.
* linting.
* Add failing unit test for nondeterministic reconstruction
* Retry scheduling actor tasks if reassigned to local scheduler
* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction
* Fix bug for updating checkpoint task execution dependencies
* Update comments for deterministic reconstruction
* cleanup
* Add (and skip) failing test case for nondeterministic reconstruction
* Suppress test output
* Define execution dependencies flatbuffer and add to Redis commands
* Convert TaskSpec to TaskExecutionSpec
* Add execution dependencies to Python bindings
* Submitting actor tasks uses execution dependency API instead of dummy argument
* Fix dependency getters and some cleanup for fetching missing dependencies
* C++ convention
* Make TaskExecutionSpec a C++ class
* Convert local scheduler to use TaskExecutionSpec class
* Convert some pointers to references
* Finish conversion to TaskExecutionSpec class
* fix
* Fix
* Fix memory errors?
* Cast flatbuffers GetSize to size_t
* Fixes
* add more retries in global scheduler unit test
* fix linting and cast fbb.GetSize to size_t
* Style and doc
* Fix linting and simplify from_flatbuf.
* Enable scheduling with custom resource labels.
* Fix.
* Minor fixes and ref counting fix.
* Linting
* Use .data() instead of .c_str().
* Fix linting.
* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.
* Sleep in test so that all tasks are submitted before any completes.
* wip
* with test
* add timeout
* also add test for f
* remove on cleanup
* update
* wip
* fix tests
* mark actor removed in redis
* clang-format
* fix bug when no-inprogress tasks
* try to set task status done
* Add comment.
* Convert to string using std::string
* Fix linting issue
* Fix linting
* Construct db_connect_args using vector
* Use vector size() instead of num_args
* Hopefully fix linting now
* Plasma client test for plasma abort
* Use ray-project/arrow:abort-objects branch
* Set plasma manager connection cursor to -1 when not in use
* Handle transfer errors between plasma managers, abort unsealed objects
* Add TODO for local scheduler exiting on plasma manager death
* Revert "Plasma client test for plasma abort"
This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14.
* Upgrade arrow to version with PlasmaClient::Abort
* Fix plasma manager test
* Fix plasma test
* Temporarily use arrow fork for testing
* fix and set arrow commit
* Fix plasma test
* Fix plasma manager test and make write_object_chunk consistent with read_object_chunk
* style
* upgrade arrow