mirror of
https://github.com/vale981/ray
synced 2025-03-09 12:56:46 -04:00
![]() This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851. |
||
---|---|---|
.. | ||
benchmarks | ||
ray | ||
asv.conf.json | ||
build-wheel-macos.sh | ||
build-wheel-manylinux1.sh | ||
README-benchmarks.rst | ||
README-building-wheels.md | ||
setup.py |