This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239)
Closes#2851.
* Broadcast actor death, clean up dummy objects
* Reduce logging and clean up state when failing a task
* lint
* Make actor failure test nicer, reduce node timeout
* Suppress duplicate pre-emptive object pushes.
* Add test.
* Fix linting
* Remove timer and inline recent_pushes_ into local_objects_.
* Improve test.
* Fix
* Fix linting
* Enable retrying pull from same object manager. Randomize object manager.
* Speed up test
* Linting
* Add test.
* Minor
* Lengthen pull timeout and reissue pull every time a new object becomes available.
* Increase pull timeout in test.
* Wait for nodes to start in object manager test.
* Wait longer for nodes to start up in test.
* Small fixes.
* _submit -> _remote
* Change assert to warning.
* Make scheduling queues RemoveTasks return task states as well.
* Add test
* Don't unsubscribe for infeasible tasks when spilling over.
* Linting
* Address comments.
IMPALA support for multiagent was broken since IMPALA has a requirement that batch sizes be of a certain length. However multi-agent envs can create variable-length batches.
Fix this by adding zero-padding as needed (similar to the RNN case).
* example
* add env
* test pg
* change to test
* add atexit test
* Update rllib-env.rst
* comment
* revert unnecessary file
* fix title when actor is idle
* Update python/ray/actor.py
Co-Authored-By: ericl <ekhliang@gmail.com>
* add test for adding node
* multinode test fixes
* First pass at allowing updatable values
* Fix compilation issues
* Add config file parsing
* Full initialization
* Wrote a good test
* configuration parsing and stuff
* docs
* write some tests, make it good
* fixed init
* Add all config options and bring back stress tests.
* Update python/ray/worker.py
* Update python/ray/worker.py
* Fix internalization
* some last changes
* Linting and Java fix
* add docstring
* Fix test, add assertions
* pytest ext
* lint
* lint
* speed up task dispatch
* minor changes
* improved comments
* improved comments
* change argument of DispatchTasks to list of tasks
* dispatch only tasks whose dependencies have been fullfiled
* some updated comments
* refactored DispatchQueue() and Assigntask() to avoid the copy of the ready list
* minor fixes
* some more minor fixes
* some more minor fixes
* added more comments
* better comments?
* fixed all feedback comments, minus making the argument of AssignTask() const
* Assigntask() now taskes a const argument
* Do the task copy outside of the callback
* fix linting