* Add sample example
* Copy relevant lines of ask from inherited Optimizer
* Ignore strategy
* Additional changes
* Add DragonflySearch for tune connector for Dragonfly
* Add example and fix small errors
* lint
* Remove skopt references
* Update example based off of Dragonfly changes
* Edit example for final Dragonfly edits
* Formatting and documentation edits
* Add documentation and add to test pipeline
* Address PR comments
* Fix Jenkins test
* Adjust Dragonfly to PR#7366
* Lint
* fix_tests
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Add base for Soft Actor-Critic
* Pick changes from old SAC branch
* Update sac.py
* First implementation of sac model
* Remove unnecessary SAC imports
* Prune unnecessary noise and exploration code
* Implement SAC model and use that in SAC policy
* runs but doesn't learn
* clear state
* fix batch size
* Add missing alpha grads and vars
* -200 by 2k timesteps
* doc
* lazy squash
* one file
* ignore tfp
* revert done
Stress testing for Jenkins.
<!--
Thank you for your contribution!
Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request.
-->
<!-- Please give a short brief about these changes. -->
TODO:
- [x] Enable a common keypair for autoscaling
- [x] Add automatic timeouts?
- [x] Switch out key pair one last time before merge
This PR introduces single-node fault tolerance for Tune.
## Previous behavior:
- Actors will be restarted without checking if resources are available. This can lead to problems if we lose resources.
## New behavior:
- RUNNING trials will be resumed on another node on a best effort basis (meaning they will run if resources available).
- If the cluster is saturated, RUNNING trials on that failed node will become PENDING and queued.
- During recovery, TrialSchedulers and SearchAlgorithms should receive notification of this (via `trial_runner.stop_trial`) so that they don’t wait/block for a trial that isn’t running.
Remaining questions:
- Should `last_result` be consistent during restore?
Yes; but not for earlier trials (trials that are yet to be checkpointed).
- Waiting for some PRs to merge first (#3239)
Closes#2851.