* Add profile table and store profiling information there.
* Code for dumping timeline.
* Improve color scheme.
* Push timeline events on driver only for raylet.
* Improvements to profiling and timeline visualization
* Some linting
* Small fix.
* Linting
* Propagate node IP address through profiling events.
* Fix test.
* object_id.hex() should return byte string in python 2.
* Include gcs.fbs in node_manager.fbs.
* Remove flatbuffer definition duplication.
* Decode to unicode in Python 3 and bytes in Python 2.
* Minor
* Submit profile events in a batch. Revert some CMake changes.
* Fix
* Workaround test failure.
* Fix linting
* Linting
* Don't return anything from chrome_tracing_dump when filename is provided.
* Remove some redundancy from profile table.
* Linting
* Move TODOs out of docstring.
* Minor
* Fix documentation indentation.
* Add error table to GCS and push error messages through node manager.
* Add type to error data.
* Linting
* Fix failure_test bug.
* Linting.
* Enable one more test.
* Attempt to fix doc building.
* Restructuring
* Fixes
* More fixes.
* Move current_time_ms function into util.h.
* build_credis.sh: use an up-to-date credis commit.
* build_credis.sh: leveldb is updated, so update build cmds for it
* WIP: make monitor.py issue flush; switch gcs client to use credis
* Experimental: enable automatic GCS flushing with configurable policy.
* Fix linux compilation error
* Fix leveldb build
* Use optimized build for credis
* Address comments
* Attempt to fix tests
* Enable --scoped-enums in flatbuffer compiler.
* Change enum to c++11 style (enum class).
* Resolve conflicts.
* Solve building failure when RAY_USE_NEW_GCS=on and remove ERROR_INDEX suffix.
* Merge with master and fix CI failure.
* Implement global state API for xray.
* Fix object table.
* Fixes for log structure.
* Implement cluster_resources.
* Add driver task to task table.
* Remove python flatbuffers code
* Get some global state API tests running.
* Python linting.
* Fix linting.
* Fix mock modules for doc
* Copy over flatbuffer bindings.
* Fix for tests.
* Linting
* Fix monitor crash.
* Changes to the directory under src for support java worker
--------------------------
This commit includes changes to the directory under src, which is part of the java worker support of Ray.
It consists of the following changes:
src/common/task.cc - just fix null point problem
org_ray_spi_impl_DefaultLocalSchedulerClient.* - JNI support for local scheduler client, and the org_ray_spi_impl_DefaultLocalSchedulerClient.cc file is not autogenerated
* Changes to the build system for support java worker
--------------------------
This commit includes changes to the build system, which is part of the java worker support of Ray.
It consists of the following changes:
- the changes of CMakeLists.txt files
- the changes of the python setup.py and init files for the adaptation of the changed build system
- move the location of local_scheduler_extension.cc for the adaptation of the changed build system which maybe better support multi-language worker
* minor whitespace
* Linting
* Allow numpy arrays and larger objects to be passed by value in task specifications.
* Fix bug.
* Fix bug. Inline all bug numpy object arrays.
* Increase size limit for inlining args in task spec.
* Give numpy init different signatures in Python 2 and Python 3.
* Simplify code.
* Fix test.
* Use import_array1 instead of import_array.
Summary:
Able to run 1000 tasks with object dependencies on a set of distributed Raylets.
Raylet Changes:
Finalized ClientConnection class.
Task forwarding.
NM-to-NM heartbeats.
NM resource accounting for tasks.
Simple scheduling policy with task forwarding.
Creating and maintaining NM 2 NM long-lived connections and reusing them for task forwarding.
LineageCache Changes:
LineageCache without cleanup of tasks committed by remote nodes.
Lineage cache writeback and cleanup implementation.
ObjectManager Changes:
Object manager event loop/ClientConnection refactor.
Multithreaded object manager (disabled in this PR).
Testing Changes:
Integration tests for task submission on multiple Raylets.
Stress tests for object manager (with GCS and object store integration).
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
Co-authored-by: Alexey Tumanov <atumanov@gmail.com>
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Compile and test raylet TaskTable
* Modify GCS tables to handle unique_ptrs from nested flatbuffers
* Add raylet::TaskTable unit tests to replace ObjectTable ones
* Convert ObjectTable to a log
* Convert ObjectTable tests to the Log
* AppendAt Redis and gcs Log command
* unit test for AppendAt
* Add a Log for task reconstruction data
* Add check for unique entries in TABLE_APPEND
* Documentation
* TABLE_APPEND call
* Convert callbacks back to taking in a string...
* GCS returns flatbuffers, define Log class
* Cleanups
* Modify client table to use the Log interface
* Fix bug where we replied twice from redis
* Fixes
* lint
* Add TableRequestNotifications and TableCancelNotifications to Redis modules
* Add RequestNotifications and CancelNotifications to generic GCS Table
* Add tests for subscribing to specific keys
* Remove TODO!
* Return the current value at the key directly from RequestNotifications instead of through publish
* Add unit test for Lookup failure callback
* Modify tests to account for empty subscription response
* Remove ObjectTable notification methods
* Clean up message parsing and doc in redis context
* Use vectors of DataT in all GCS callbacks
* Clean up SubscriptionCallback
* Move Table definitions into tables.cc
* Refactor and document redis modules
* doc
* Fix new GCS build
* Cleanups
* Revert "Fix new GCS build"
This reverts commit 6e3e69090c67ef60aaf22a9cf62be0290d989e96.
* Use vectors for internal callback interface, user-facing interface takes a reference to a single item
* Fix new GCS build
* Add unit test for Lookup failure callback
* Fix compiler errors
* Cleanup
* Publish the entry ID with the notification
* Check that the ID for a notification matches in client tests
* Print error when actor takes too long to start, and refactor error message pushing.
* Print warning every ten seconds.
* Fix linting and tests.
* Fix tests.
* Treat actor creation like a regular task.
* Small cleanups.
* Change semantics of actor resource handling.
* Bug fix.
* Minor linting
* Bug fix
* Fix jenkins test.
* Fix actor tests
* Some cleanups
* Bug fix
* Fix bug.
* Remove cached actor tasks when a driver is removed.
* Add more info to taskspec in global state API.
* Fix cyclic import bug in tune.
* Fix
* Fix linting.
* Fix linting.
* Don't schedule any tasks (especially actor creaiton tasks) on local schedulers with 0 CPUs.
* Bug fix.
* Add test for 0 CPU case
* Fix linting
* Address comments.
* Fix typos and add comment.
* Add assertion and fix test.
* restructure how to organize 3rd party libs
* Minor whitespace changes.
* Fix compilation on Linux.
* Pass around Python executable so that the correct version of Python is used.
* spillback policy implementation: global + local scheduler
* modernize global scheduler policy state; factor out random number engine and generator
* Minimal version.
* Fix test.
* Make load balancing test less strenuous.
* Add failing unit test for nondeterministic reconstruction
* Retry scheduling actor tasks if reassigned to local scheduler
* Update execution edges asynchronously upon dispatch for nondeterministic reconstruction
* Fix bug for updating checkpoint task execution dependencies
* Update comments for deterministic reconstruction
* cleanup
* Add (and skip) failing test case for nondeterministic reconstruction
* Suppress test output