* use cmake to build ray project, no need to appply build.sh before cmake, fix some abuse of cmake, improve the build performance
* support boost external project, avoid using the system or build.sh boost
* keep compatible with build.sh, remove boost and arrow build from it.
* bugfix: parquet bison version control, plasma_java lib install problem
* bugfix: cmake, do not compile plasma java client if no need
* bugfix: component failures test timeout machenism has problem for plasma manager failed case
* bugfix: arrow use lib64 in centos, travis check-git-clang-format-output.sh does not support other branches except master
* revert some fix
* set arrow python executable, fix format error in component_failures_test.py
* make clean arrow python build directory
* update cmake code style, back to support cmake minimum version 3.4
* Add signal handlers to improve debuggability.
* Fix Linux compiling
* Fix Lint
* Change SIGILL case that happens in both Linux and MaxOs
* Add signal handler to main functions.
* Change handler name.
* Address comment
* Address comment.
* Fix Linux building failure
* Introduce RAII mechanism to SignalHandlers.
* Add InitShutdownWrapper to handle all RAII requirements
* Change util_test to signal_test
* Make sure shutdown is not nullptr.
* Using google::InstallFailureSignalHandler() instead of our own signal handler
* Refine code addording to comment
* Fix valgrind test failure.
* remove Shutdown template
* consistency
* linting
Basically a re-implementation of #2281, with modifications of #2298 (A fix of #2334, for rebasing issues.).
[+] Implement sharding for gcs tables.
[+] Keep ClientTable and ErrorTable managed by the primary_shard. TaskTable is managed by the primary_shard for now, until a good hashing for tasks is implemented.
[+] Move AsyncGcsClient's initialization into Connect function.
[-] Move GetRedisShard and bool sharding from RedisContext's connect into AsyncGcsClient. This may make the interface cleaner.
* Enable --scoped-enums in flatbuffer compiler.
* Change enum to c++11 style (enum class).
* Resolve conflicts.
* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.
* Merge with master and fix CI failure.
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.
Raylet Changes:
Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:
LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:
Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:
Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
* Add TableRequestNotifications and TableCancelNotifications to Redis modules
* Add RequestNotifications and CancelNotifications to generic GCS Table
* Add tests for subscribing to specific keys
* Remove TODO!
* Return the current value at the key directly from RequestNotifications instead of through publish
* Add unit test for Lookup failure callback
* Modify tests to account for empty subscription response
* Remove ObjectTable notification methods
* Clean up message parsing and doc in redis context
* Use vectors of DataT in all GCS callbacks
* Clean up SubscriptionCallback
* Move Table definitions into tables.cc
* Refactor and document redis modules
* doc
* Fix new GCS build
* Cleanups
* Revert "Fix new GCS build"
This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.
* Use vectors for internal callback interface, user-facing interface takes a reference to a single item
* Fix new GCS build
* Add unit test for Lookup failure callback
* Fix compiler errors
* Cleanup
* Publish the entry ID with the notification
* Check that the ID for a notification matches in client tests
* Print error when actor takes too long to start, and refactor error message pushing.
* Print warning every ten seconds.
* Fix linting and tests.
* Fix tests.
* added deterministic check for objects received in fetch_timeout_handler.
* use receive time, in case something goes wrong after object is received.
* increase timeout for removal.
* indentation fix.
* make log info log debug. clean up debug log.
* undo unecessary changes.
* changed description var.
* shorten line 949.
* incorporate feedback.
* linting; make is_object_received function consts.
* change semantics of received_objects to objects being received.
added checks to both points at which objects are re-requested.
updated object receive initialization accordingly.
* eliminate erase on receive init. check call to request_transfer_from instead of request_transfer.
* updated comments.
* added todo for multiple object transfers.
* linting.
* Define execution dependencies flatbuffer and add to Redis commands
* Convert TaskSpec to TaskExecutionSpec
* Add execution dependencies to Python bindings
* Submitting actor tasks uses execution dependency API instead of dummy argument
* Fix dependency getters and some cleanup for fetching missing dependencies
* C++ convention
* Make TaskExecutionSpec a C++ class
* Convert local scheduler to use TaskExecutionSpec class
* Convert some pointers to references
* Finish conversion to TaskExecutionSpec class
* fix
* Fix
* Fix memory errors?
* Cast flatbuffers GetSize to size_t
* Fixes
* add more retries in global scheduler unit test
* fix linting and cast fbb.GetSize to size_t
* Style and doc
* Fix linting and simplify from_flatbuf.
* Enable scheduling with custom resource labels.
* Fix.
* Minor fixes and ref counting fix.
* Linting
* Use .data() instead of .c_str().
* Fix linting.
* Fix ResourcesTest.testGPUIDs test by waiting for workers to start up.
* Sleep in test so that all tasks are submitted before any completes.
* Plasma client test for plasma abort
* Use ray-project/arrow:abort-objects branch
* Set plasma manager connection cursor to -1 when not in use
* Handle transfer errors between plasma managers, abort unsealed objects
* Add TODO for local scheduler exiting on plasma manager death
* Revert "Plasma client test for plasma abort"
This reverts commit e00fbd58dc4a632f58383549b19fb9057b305a14.
* Upgrade arrow to version with PlasmaClient::Abort
* Fix plasma manager test
* Fix plasma test
* Temporarily use arrow fork for testing
* fix and set arrow commit
* Fix plasma test
* Fix plasma manager test and make write_object_chunk consistent with read_object_chunk
* style
* upgrade arrow
* Object table lookup returns vector of DBClientID instead of address strings
* Add node IP address to DBClient notification
* DB client cache stores entire DB client, convert addresses to std::string
* get cached db client returns the client
* Expose a call to initialize the redis cache
* Local scheduler filters out dead clients during reconstruction
* Remove node ip address from dbclient, use aux_address for plasma managers
* Get entire db client entry when not found in cache
* Fix common tests
* Fix address in tests
* Push error to driver if driver task did the put
* Address Robert's comments and cleanup
* Remove unused Redis command
* Fix db test
* Initial pass at factoring out C++ configuration into a single file.
* Expose config through Python.
* Forward declarations.
* Fixes with Python extensions
* Remove old code.
* Consistent naming for constants.
* Fixes
* Fix linting.
* More linting.
* Whitespace
* rename config -> _config.
* Move config inside a class.
* update naming convention
* Fix linting.
* More linting
* More linting.
* Add in some more constants.
* Fix linting
* Compile global scheduler with -Werror -Wall.
* Compile plasma manager with -Werror -Wall.
* Compile local scheduler with -Werror -Wall.
* Compile common code with -Werror -Wall.
* Signed/unsigned comparisons.
* More signed/unsigned fixes.
* More signed/unsigned fixes and added extern keyword.
* Fix linting.
* Don't check strict-aliasing because Python.h doesn't pass.
* Add timing statement to loop that calls redis_get_cached_db_client because it has been slow in the past.
* Fix linting.
* Refactoring to make manager vectors into std::vector.
* Fix linting.
* Fixes.
* Comment out local scheduler valgrind test.
* Fix free/delete error.
* More free -> delete errors
* One more free -> delete and also clean up callback state in plasma manager.
* Add set -x to run_valgrind scripts.
* Fix valgrind error in CreateLocalSchedulerInfoMessage.
* Replaced utstring with std::string
* Converted transfer_queue to a list
* Converted pending_object_transfers to unordered_map
* Fix free/delete bug and small modifications.
* Implemented local_available_objects as an unordered set
* Implemented fetch_requests as an unordered map
* Fixed bug and changed fetch_requests from pointer to object
* free(PlasmaManagerState *) -> delete PlasmaManagerState *
* removed unnecessary newline
* Make local_available_objects not a pointer.
* Attempt to safely iterate over unordered_map and remove elements.