* switch to ARM templates for config and VMs
* switch to ARM templates for config and VMs
* auto-formatting
* addressed Scotts comment
* added missing imports
* fixed gpu templates
fixed wheel reference
* added missing reference
* cleanup wording and yamls
* Update doc/source/autoscaling.rst
Co-Authored-By: Scott Graham <5720537+gramhagen@users.noreply.github.com>
Co-authored-by: Ubuntu <marcozo@marcozodev2.zqvgrdyupqrudayw1il1agipig.jx.internal.cloudapp.net>
Co-authored-by: Scott Graham <5720537+gramhagen@users.noreply.github.com>
Why are these changes needed?
Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.
* Update issue templates
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* Checkpoint the basics
* End of day checkpoint
* Checkpoint log-to-head implementation
* Checkpoint
* Add actor-based batch log reporting, currently segfaults
* Work around progress segfault
* Fix some stuff in quicktorch
* Make things more customizable
* Quality of life fixes
* More quality of life
* Move tqdm logic to training_operator
* Update examples
* Fix some minor bugs
* Fix merge
* Fix small things, add pbar to dcgan
* Run format.sh
* Fix missing epoch number for batch pbar
* Address PR comments
* Fix float is not subscriptable
* Add train_loss to pbar by default
* Isolate tqdm code into a handler system
* Format
* Remove the batch_logs_reporter from distributed runner as well
* Check if the train_loss is avaialbale before using it
* Enable tqdm in the dcgan example
* Fix a crash in no-handler trainers
* Fix
* Allow not calling set_reporters for tests
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>