Resubmit the PR https://github.com/ray-project/ray/pull/19936
I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode.
In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock.
The solution is quite simple, release the lock before executing task in local mode.
In the commit 7c2f61c76c:
1. Release the lock in local mode to fix the bug. @scv119
2. `test_local_mode_deadlock` added to cover the case. @rkooo567
3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.
This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy.
Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.
* Update build rules and patches for darwin_arm64 platform.
Changes include:
Update nelhage/rules_boost package from current version (08/5/2020) to 5/27/2021 version.
Remove rules_boost-undefine-boost_fallthrough.patch, since BOOST_FALLTHROUGH seems to be defined now.
Minor changes to rules_boost-windows-linkopts.patch to use default condition to add -lpthread flag for all platforms.
Add darwin_arm64 config to BUILD files for lib civetweb pulled in via prometheu dependency.
* upgrade boost to 1.74.0 from 1.71.0 to match the udpated build file for windows.
* Fix ray_cpp_pkg
* Use boost/bind/bind.hpp
boost/bind.hpp and global namespace placeholders are deprecated.
* lint
* Use absl::bind_front when possible. Otherwise, NOLINT
* lint
* lint
* lint
* lint
* more lint
* final lint
* trigger build
* Convert worker pool to queue
* Start up to backlog size more workers
* fixes
* Prestart workers according to num available CPUs
* lint
* x
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* dedicated workers
* Fix tests
* x
* fix
* asan
* asan
* Workers can only exec tasks with same job ID
* size_t for runtime env hash, fix unit tests
* include job ID in runtime env hash, remove from worker registration msg
* x
* conflict
* debug
* Schedule and dispatch periodically, skip if no new tasks
* Update src/ray/common/task/task_spec.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/scheduling/cluster_task_manager.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Streaming support metric reporter
* fix lint
* fix bazel format lint
* fix lint
* metric deps lint
* lint
* and comments for runtime reporter
* unordered_map instead
* comments
* fix visibility flag
* deps local .so target
* make stats public visibility
* stats lib in public
* add antgroup team tag
* pass RuntimeEnv in task spec as opaque string
* lint
* set correct empty value for json: "{}" not ""
* add comment for field in proto
* fix worker pool test by checking both "" and "{}"
* add RAY_CHECK todo
* make dict empty if all values null
* remove unnecessary ser/de
* fix
* address comments
* add WorkerCacheKey with hash function
* clean up
* add naive impl., dedicated workers never killed
* put dedicated workers in idle_of_all_languages
* pipe env hash from worker.py -> Worker
* fully pipe through hash, basic cache test passing
* use int type for runtime env hash
* convert Worker env hash type from size_t to int
* fix
* add method to MockWorker to fix cpp tests
* make compatible with java streaming test
* restore old dynamic_options code to fix java test
* address comments
* add comment about sorting before hash
* add comments for private members of WorkerCacheKey
* formatting
* format util
* format release
* format rllib/agents
* format rllib/env
* format rllib/execution
* format rllib/evaluation
* format rllib/examples
* format rllib/policy
* format rllib utils and tests
* format streaming
* more formatting
* update requirements files
* fix rllib type checking
* updates
* update
* fix circular import
* Update python/ray/tests/test_runtime_env.py
* noqa