This PR introduces a TrialCheckpoint class which is returned e.g. by ExperimentAnalysis.best_checkpoint. The class enables easy access to cloud storage locations (rather than just local directories before). It also comes with utilities to download, upload, and save trial checkpoints to local and cloud targets.
Adds a new page and table to document current scalability thresholds in Ray Tune to the documentation.
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* [tune] make `tune.with_parameters()` work with the class API
* Update python/ray/tune/utils/trainable.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Refactor placement group factory object to accept placement_group arguments instead of callables
* Convert resources to pgf
* Enable placement groups per default
* Fix tests WIP
* Fix stop/resume with placement groups
* Fix progress reporter test
* Fix trial executor tests
* Check resource for trial, not resource object
* Move ENV vars into class
* Fix tests
* Sphinx
* Wait for trial start in PBT
* Revert merge errors
* Support trial reuse with placement groups
* Better check for just staged trials
* Fix trial queuing
* Wait for pg after trial termination
* Clean up PGs before tune run
* No PG settings in pbt scheduler
* Fix buffering tests
* Skip test if ray reports erroneous available resources
* Disable PG for cluster resource counting test
* Debug output for tests
* Output in-use resources for placement groups
* Don't start new trial on trial start failure
* Add docs
* Cleanup PGs once futures returned
* Fix placement group shutdown
* Use updated_queue flag
* Apply suggestions from code review
* Apply suggestions from code review
* Update docs
* Reuse placement groups independently from actors
* Do not remove placement groups for paused trials
* Only continue enqueueing trials if it didn't fail the first time
* Rename parameter
* Fix pause trial
* Code review + try_recover
* Update python/ray/tune/utils/placement_groups.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Move placement group lifecycle management
* Move total used resources to pg manager
* Update FAQ example
* Requeue trial if start was unsuccessful
* Do not cleanup pgs at start of run
* Revert "Do not cleanup pgs at start of run"
This reverts commit 933d9c4c
* Delayed PG removal
* Fix trial requeue test
* Trigger pg cleanup on status update
* Fix tests
* Fix docs
* fix-test
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Add DockerSyncer
* Add docs
* Update python/ray/tune/integration/docker.py
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Updated docs
* fix dir
* Added docker integration test
* added docker integration test to bazel build
* Use sdk.rsync API
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
* Support `yield` and `return` statements in Tune trainable functions
* Support anonymous metric with ``tune.report(value)``
* Raise on invalid return/yield value
* Fix end to end reporter test