* Convert worker pool to queue
* Start up to backlog size more workers
* fixes
* Prestart workers according to num available CPUs
* lint
* x
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* dedicated workers
* Fix tests
* x
* fix
* asan
* asan
* Workers can only exec tasks with same job ID
* size_t for runtime env hash, fix unit tests
* include job ID in runtime env hash, remove from worker registration msg
* x
* conflict
* debug
* Schedule and dispatch periodically, skip if no new tasks
* Update src/ray/common/task/task_spec.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/scheduling/cluster_task_manager.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
* Update src/ray/raylet/worker_pool.h
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
When workflow recover, it'll try to reconstruct the DAG. However, it's step scoped, which means if a workflow is passed to multiple steps, it'll be executed multiple times which breaks the exactly-once semantic.
For ObjectRef it's ok since it'll be cached with serialization context, but we also need a similar thing for Workflow input.
This logic is put in workflow layer instead of serialization layer because it's dedupe on app layer.
Issue #18997 has race conditions, and it's also related to this one. The reason is that multiple steps will try to issue writes to virtual actors at the same time which is not allowed right now and can lead to race condition.
* exp backoff
* up
* format
* up
* up
* up
* up
* up
* format
* fix
* up
* format
* adjust ordering
* up
* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"
This reverts commit 2e99fb215f.
* up
* update
* format
* up
* format
* fix
* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""
This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.
* up
* format
* fix lint
* up
* up
* up
* up
* check
* add test1
* format
* up
* add test
* up
* up
* up
* fix
* up
* up
* up
* add test
* format
* up
* up
* fix lint
* format
* fix
* format
* fix
* up
* [ci/tune] Add Tune GPU pipeline step to CI
* cont.
* add sgd gpu tests
* format yaml, fix imports
* install horovod; fix line wrapping
* set GPU per worker to 0.5
* fix import
* move test to 4gpu machine
* fix lint
* lint
* set visible devices
* pull in tf gpu fix
* Fix Tune GPU pipeline step
* nit
* Disable GPU tests until we have some
* Re-add empty rllib tests
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>