ray/python at 784a6399b08ec74549c09f01e3ad362a346d7b67 - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 12:56:46 -04:00

History

Richard Liaw 784a6399b0 [tune] Node Fault Tolerance (#3238 ) This PR introduces single-node fault tolerance for Tune. ## Previous behavior: - Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources. ## New behavior: - RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available). - If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued. - During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running. Remaining questions: - Should `last_result` be consistent during restore? Yes; but not for earlier trials (trials that are yet to be checkpointed). - Waiting for some PRs to merge first (#3239) Closes #2851.		2018-11-21 12:38:16 -08:00
..
benchmarks	Deprecate num_workers argument to ray.init and ray start. (#3114 )	2018-10-28 20:12:49 -07:00
ray	[tune] Node Fault Tolerance (#3238 )	2018-11-21 12:38:16 -08:00
asv.conf.json	[asv] Pushing to s3 (#2246 )	2018-06-20 10:43:44 -07:00
build-wheel-macos.sh	Adding Python3.7 wheels support (#2546 )	2018-10-18 17:58:39 -07:00
build-wheel-manylinux1.sh	Adding Python3.7 wheels support (#2546 )	2018-10-18 17:58:39 -07:00
README-benchmarks.rst	[rllib][asv] Support ASV for RLlib (#2304 )	2018-06-28 17:20:09 -07:00
README-building-wheels.md	[DataFrame] Add Parquet Support in Build Process (#1531 )	2018-02-16 07:18:42 -08:00
setup.py	Update redis version in setup.py (#3333 )	2018-11-15 10:40:08 -08:00