Why are these changes needed?
Running a worker on head (locally, not as a Ray actor) allows for easier handling of stateful stuff like logging and for easier debugging.
* Update issue templates
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* Checkpoint the basics
* End of day checkpoint
* Checkpoint log-to-head implementation
* Checkpoint
* Add actor-based batch log reporting, currently segfaults
* Work around progress segfault
* Fix some stuff in quicktorch
* Make things more customizable
* Quality of life fixes
* More quality of life
* Move tqdm logic to training_operator
* Update examples
* Fix some minor bugs
* Fix merge
* Fix small things, add pbar to dcgan
* Run format.sh
* Fix missing epoch number for batch pbar
* Address PR comments
* Fix float is not subscriptable
* Add train_loss to pbar by default
* Isolate tqdm code into a handler system
* Format
* Remove the batch_logs_reporter from distributed runner as well
* Check if the train_loss is avaialbale before using it
* Enable tqdm in the dcgan example
* Fix a crash in no-handler trainers
* Fix
* Allow not calling set_reporters for tests
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
* Revert "fix (#7681)"
This reverts commit 6a12a31b2e.
* Revert "[core] Pin lineage of plasma objects that are still in scope (#7499)"
This reverts commit 014929e658.
* Add a lineage_ref_count to References
* Refactor TaskManager to store TaskEntry as a struct
* Refactor to fix deadlock between TaskManager and ReferenceCounter
Add references to task specs
* Pin TaskEntries and References in the lineage of any ObjectIDs in scope
* Fix deadlock, convert num_plasma_returns to a set of object IDs
* fix unit tests
* Feature flag
* Do not release lineage for objects that were promoted to plasma
* fix build
* fix build
* Remove num executions
* Simplify num return values
* Remove unused
* doc
* Set num returns
* Move lineage pinning flag to ReferenceCounter
* comments
* Fixes
* Remove irrelevant test (replaced by ref counting tests)
* Windows compatibility bug fixes
* Use WSASend/WSARecv as WSASendMsg/WSARecvMsg do not work with TCP sockets
* Clean up some TODOs
* Fix duplicate compilations
* RedisAsioClient boost::asio::error::connection_reset
Co-authored-by: Mehrdad <noreply@github.com>