Ujval Misra
|
2965dc1b72
|
[tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.
* Trigger checkpoint deletes locally in Trainable
* Clean-up code.
* Minor changes.
* Track best checkpoint so far again
* Pulled checkpoint GC out of Trainable.
* Added comments, error logging.
* Immediate pull after checkpoint taken; rsync source delete on pull
* Minor doc fixes
* Fix checkpoint manager bug
* Fix bugs, tests, formatting
* Fix bugs, feature flag for force sync.
* Fix test.
* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.
* Fix bug: update IP of last_result.
* Fixed message.
* Added a lot of logging.
* Changes to ray trial executor.
* More bug fixes (logging after failure), better logging.
* Fix richards bug and logging
* Add comments.
* try-except
* Fix heapq bug.
* .
* Move handling of no available trials to ray_trial_executor (#1)
* Fix formatting bug, lint.
* Addressed Richard's comments
* Revert tests.
* fix rebase
* Fix trial location reporting.
* Fix test
* Fix lint
* Rebase, use ray.get w/ timeout, lint.
* lint
* fix rebase
* Address richard's comments
|
2019-11-18 01:14:41 -08:00 |
|