This removes the force_start argument from StartWorkerProcess in the worker pool so that no more than maximum_startup_concurrency are ever started concurrently. In particular, when the raylet starts up, it my start fewer than num_workers workers.
* Limit number of concurrent workers started by hardware concurrency.
* Check if std:🧵:hardware_concurrency() returns 0.
* Pass in max concurrency from Python.
* Fix Java call to startRaylet.
* Fix typo
* Remove unnecessary cast.
* Fix linting.
* Cleanups on Java side.
* Comment back in actor test.
* Require maximum_startup_concurrency to be at least 1.
* Fix linting and test.
* Improve documentation.
* Fix typo.
## What do these changes do?
* distribute load and resource information on a heartbeat
* for each raylet, maintain total and available resource capacity as well as measure of current load
* this PR introduces a new notion of load, defined as a sum of all resource demand induced by queued ready tasks on the local raylet. This provides a heterogeneity-aware measure of load that supersedes legacy Ray's task count as a proxy for load.
* modify the scheduling policy to perform *capacity-based*, *load-aware*, *optimistically concurrent* resource allocation
* perform task spillover to the heartbeating node in response to a heartbeat, implementing heterogeneity-aware late-binding/work-stealing.
## What do these changes do?
Because the logic of generating `TaskID` in java is different from python's, there are many tests fail when we change the `Ray Core` code.
In this change, I rewrote the logic of generating `TaskID` in java which is the same as the python's.
In java, we call the native method `_generateTaskId()` to generate a `TaskID` which is also used in python. We change `computePutId()`'s logic too.
## Related issue number
[#2608](https://github.com/ray-project/ray/issues/2608)
This PR enables multi-language support in the raylet backend.
- `Worker` class now has a `language` label;
- `WorkerPool`:
- It now maintains one set of states for each language.
- `PopWorker` function's parameter type is changed to `TaskSpecification`, and it will choose a worker to pop based on both task's language and actor id.
- `Size` and `StartWorkerProcess` functions now have an extra `language` parameter.
- `RegisterClientRequest` message now has an extra `language` field in raylet mode, which tells the node manager which language the worker is.
## What do these changes do?
#2362 left a bug where it assumed that the driver task ID was nil. This fixes the bug to check the `SchedulingQueue` for any driver task IDs instead.
* [WIP] Support different backend log lib
* Refine code, unify level, address comment
* Address comment and change formatter
* Fix linux building failure.
* Fix lint
* Remove log4cplus.
* Add log init to raylet main and add test to travis.
* Address comment and refine.
* Update logging_test.cc
* Cache a Task's object dependencies
* Cache the parent task IDs for lineage cache entries
* Cache the parent task IDs in lineage cache entries
* revert
* Fix test
* remove unused line
* Fix test
* Support building Java and Python version at the same time.
* Remove duplicated definition.
* Refine the building process of local_scheduler
* Refine
* Add comment for languages
* Modify instruction and add python,jave building to CI.
* change according to comment
* Move all ObjectManager members to bottom of class def
* Better Pull requests
- suppress duplicate Pulls
- retry the Pull at the next client after a timeout
- cancel a Pull if the object no longer appears on any clients
* increase object manager Pull timeout
* Make the component failure test harder.
* note
* Notify SubscribeObjectLocations caller of empty list
* Address melih's comments
* Fix wait...
* Make component failure test easier for legacy ray
* lint
* Log a warning on remote object manager failures
* Mark a task that was failed to be forwarded as pending
* Raylet component failure test and make it harder
* Turn on component failure test for xray
* Remove return status from ReleaseSender
* lint
* use different random number generator to be compatible with older valgrind versions
* seed from time
* style
* fix
* remove more random devices
* also remove random_device from global scheduler
* rename mutex
* linting
## What do these changes do?
This implements basic task reconstruction in raylet. There are two parts to this PR:
1. Reconstruction suppression through the `TaskReconstructionLog`. This prevents two raylets from reconstructing the same task if they decide simultaneously (via the logic in #2497) that reconstruction is necessary.
2. Task resubmission once a raylet becomes responsible for reconstructing a task.
Reconstruction is quite slow in this PR, especially for long chains of dependent tasks. This is mainly due to the lease table mechanism, where nodes may wait too long before trying to reconstruct a task. There are two ways to improve this:
1. Expire entries in the lease table using Redis `PEXPIRE`. This is a WIP and I may include it in this PR.
2. Introduce a "fast path" for reconstructing dependencies of a re-executed task. Normally, we wait for an initial timeout before checking whether a task requires reconstruction. However, if a task requires reconstruction, then it's likely that its dependencies also require reconstruction. In this case, we could skip the initial timeout before checking the GCS to see whether reconstruction is necessary (e.g., if the object has been evicted).
Since handling failures of other raylets is probably not yet complete in master, this only turns back on Python tests for reconstructing evicted objects.
This PR adds a driver table for the new GCS, which enables cleanup functionality associated with monitoring driver death.
Some testing in `monitor_test.py` is restored, but redis sharding for xray is needed to enable remaining tests.
* raylet memory corruption fixes
* add util function to translate boost error to ray status
* tcp client connection now using ray status utility function
* lint
* Add set to lineage cache entry to track nodes already forwarded to.
* Uncommitted lineage function naming, documentation.
* Simple test for uncommitted lineage with a marked task.
* Rebased, changed tests to use ClientID::nil.
* Bug fix, change MergeLineageHelper function type.
* Formatting.
* Checks and test changes based on PR comments.
* GetUncommittedLineage now always returns at least the requested task ID.
* Bug fix (return at least requested task ID)
* Formatting
* Raise application level exception for actor methods that can't be executed and failed tasks.
* Retry task forwarding for actor tasks.
* Small cleanups
* Move constant to ray_config.
* Create ForwardTaskOrResubmit method.
* Minor
* Clean up queued tasks for dead actors.
* Some cleanups.
* Linting
* Notify task_dependency_manager_ about failed tasks.
* Manage timer lifetime better.
* Use smart pointers to deallocate the timer.
* Fix
* add comment