Commit graph

7 commits

Author SHA1 Message Date
Ujval Misra
2965dc1b72 [tune] Fault tolerance improvements (#5877)
* Precede ray.get with ray.wait.

* Trigger checkpoint deletes locally in Trainable

* Clean-up code.

* Minor changes.

* Track best checkpoint so far again

* Pulled checkpoint GC out of Trainable.

* Added comments, error logging.

* Immediate pull after checkpoint taken; rsync source delete on pull

* Minor doc fixes

* Fix checkpoint manager bug

* Fix bugs, tests, formatting

* Fix bugs, feature flag for force sync.

* Fix test.

* Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings.

* Fix bug: update IP of last_result.

* Fixed message.

* Added a lot of logging.

* Changes to ray trial executor.

* More bug fixes (logging after failure), better logging.

* Fix richards bug and logging

* Add comments.

* try-except

* Fix heapq bug.

* .

* Move handling of no available trials to ray_trial_executor (#1)

* Fix formatting bug, lint.

* Addressed Richard's comments

* Revert tests.

* fix rebase

* Fix trial location reporting.

* Fix test

* Fix lint

* Rebase, use ray.get w/ timeout, lint.

* lint

* fix rebase

* Address richard's comments
2019-11-18 01:14:41 -08:00
Ujval Misra
e3e3ad4b25 Add timeout param to ray.get (#6107) 2019-11-14 00:50:04 -08:00
Eric Liang
1455a19c85
Consolidate and clean up documentation (#5645) 2019-09-07 11:50:18 -07:00
Eric Liang
a101812b9f
Replace --redis-address with --address in test, docs, tune, rllib (#5602)
* wip

* add tests and tune

* add ci

* test fix

* lint

* fix tests

* wip

* sugar dep
2019-09-01 16:53:02 -07:00
Richard Liaw
411f30c125
[docs] Second push of changes (#5391) 2019-08-28 17:54:15 -07:00
Rehan Sohail Durrani
d2e8331d9a [docs] remove table from walkthrough (#5389) 2019-08-06 17:29:48 -07:00
Richard Liaw
a08ea09760 [docs] rewrite (#5175) 2019-08-05 23:33:14 -07:00