Commit graph

5 commits

Author SHA1 Message Date
xwjiang2010
ac831fded4
[air] update documentation to use session.report (#26051)
Update documentation to use `session.report`.

Next steps:
1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer.
2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 
3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources`
4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-30 10:37:31 -07:00
Kai Fricke
bb341eb1e4
Revert "Revert "[tune] Also interrupt training when SIGUSR1 received"" (#24101)
* Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)"

This reverts commit 00595653ed.

Failure in windows has been addressed by conditionally registering the signal handler if available.
2022-04-22 11:27:38 +01:00
xwjiang2010
00595653ed
Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085) 2022-04-21 13:27:34 -07:00
Kai Fricke
f376dd8902
[tune] Also interrupt training when SIGUSR1 received (#24015)
Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted.

In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage.

This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.
2022-04-21 13:07:29 +01:00
Max Pumperla
5cc9355303
[Docs ] Tune docs overhaul (first part) (#22112)
Continuing docs overhaul, tune now has:

- [x] better landing page
- [x] a getting started guide
- [x] user guide was cut down, partially merged with FAQ, and partially integrated with tutorials
- [x] the new user guide contains guides to tune features and practical integrations
- [x] we rewrote some of the feature guides for clarity 
- [x] we got rid of sphinx-gallery for this sub-project (only data and core left), as it looks bad and is unnecessarily complicated anyway (plus, makes the build slower)
- [x] sphinx-gallery examples are now moved to markdown notebook, as started in #22030.
- [x] Examples are tested in the new framework, of course.

There's still a lot one can do, but this is already getting too large. Will follow up with more fine-tuning next week.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-02-07 15:47:03 +00:00