Will Drevo
fa878e2d4d
Added example to user guide for cloud checkpointing ( #20045 )
...
Co-authored-by: will <will@anyscale.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-11-15 15:43:06 +00:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train ( #19436 )
2021-10-18 22:27:46 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration ( #17626 )
...
* wip
* wip
* wip
* draft
* disable tf autosharding
* wip
* wip
* wip
* wip
* add example
* wip
* wip
* wip
* use dataset.split
* add unit tests
* add linear example
* concatenate tensors and fix example
* WIP tune example
* add tensorflow example
* wip
* random_shuffle_each_window
* fault tolerance test
* GPU, examples, CI
* formatting
* fix
* Update python/ray/util/sgd/v2/tests/test_trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* wip
* type hints
* wip
* update user guide
* fix
* fix immediate issues
* update example
* update
* fix tune gpu test
* fix resources for smoke test - 1 CPU for dataset tasks
* update tests, docs, examples
* Apply suggestions from code review
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
* address comments
* add warning
* fix tests
* minor doc updates
* update example in doc
* configure tests
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
* Update python/ray/data/dataset.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* fix docstring
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Amog Kamsetty
db0483a29a
[SGD] SGD Namespace Consistency ( #19048 )
...
* wip
* update
* add callbacks
* fix
* fix
* update
* add
* address comments
2021-10-05 15:56:42 -07:00
Eric Liang
032a420ee6
Rename Dataset.pipeline to Dataset.window ( #19050 )
2021-10-01 19:55:29 -07:00
Amog Kamsetty
98ac3f601c
[SGD] v1 to v2 Migration Guide ( #18887 )
...
* wip
* add guide
* fix test
* address comments
* add to docs
* fix
* remove markdown
* add warning to all pages
* formatting
* fix
* links
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update doc/source/raysgd/v2/migration-guide.rst
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* address comments
* address comments
* fix
* address comments
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-30 09:15:21 -07:00
matthewdeng
d2caa00be8
[SGD] add SGDv2 survey link to docs ( #18934 )
2021-09-27 19:15:37 -07:00
Antoni Baum
72cc0c9bda
[SGDv2] Add Tune-Cifar-PyTorch-PBT example ( #18860 )
...
* [SGDv2] Add Tune-Cifar-PyTorch-PBT example
* Update python/ray/util/sgd/v2/BUILD
* Lint
* Update example
* Update docs
2021-09-27 09:22:40 -07:00
Amog Kamsetty
99b1d8c95f
[SGD] Update Docs ( #18839 )
2021-09-23 07:52:57 -07:00
Amog Kamsetty
d354161528
[SGD] Link ray.sgd
namespace to ray.util.sgd.v2
( #18732 )
...
* wip
* add symlink
* update
* remove from init
* no require tune
* try fix
* change
* * import
* fix docs
* address comment
2021-09-22 18:49:41 -07:00
Amog Kamsetty
00dd190df9
[SGD] Retry sgd.local_rank()
( #18824 )
...
* finish
* fix
* wip
* address comment
* update
* fix test
* fix failing test
* address comments
* fix test
* fix
2021-09-22 15:48:38 -07:00
Amog Kamsetty
d9b166252b
Revert "[SGD] sgd.local_rank
" ( #18822 )
2021-09-22 13:50:00 -07:00
Amog Kamsetty
39bcbe03bc
[SGD] sgd.local_rank
( #18686 )
...
* finish
* fix
* wip
* address comment
* update
* fix test
* fix failing test
* address comments
* fix test
2021-09-22 08:10:49 -07:00
matthewdeng
380a653787
[SGD] update SGDv2 user guide docs ( #18270 )
...
* [SGD] update SGDv2 user guide docs
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
* add new line
* update docs
* fix header line length
* lint
* lint
* lint
* lint
* fix remaining lint issues
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
* address comments
* address comments
* add TODO for iterator API
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
* address comments
* address comments
* add tune doc
* restructure table of contents
* add examples; rename example files to include example suffix
* add quick start, porting code
* address comments
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-09-14 09:07:25 -07:00
Amog Kamsetty
3b77840c1b
PyTorch Lightning Updates ( #17876 )
2021-08-27 23:15:51 -07:00
Richard Liaw
ecc7cf4c5e
[sgd] v2 documentation draft ( #17253 )
...
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-08-02 01:47:14 -07:00
kimikuri
93172b535f
[doc][sgd] Broken Link in SGD's page. ( #17404 ) ( #17423 )
2021-07-29 01:13:23 -07:00
Eric Liang
38bddc3f2b
First cut at dataset documentation ( #16956 )
2021-07-14 23:27:13 -07:00
Antoni Baum
2fb10e6730
[SGD] Add support for native Torch AMP in SGD ( #16382 )
...
* SGD native AMP initial commit
* SGD native amp second pass
* Update docs
* Update TorchTrainer doc
* Temp fix release test
* Update release/sgd_tests/sgd_gpu/sgd_gpu_app_config.yaml
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-06-15 17:48:21 -07:00
YeahNew
9a93dd9682
Adding a RaySGD and DGL ( Deep Graph Library) integration example(gat… ( #15718 )
...
* Adding a RaySGD and DGL ( Deep Graph Library) integration example(gat_dgl.py)
* Update gat_dgl.py
* Update gat_dgl.py
* Update gat_dgl.py
* the gat_dgl.py has been formated by the format.sh script
* delet useless code in the gat_dgl.py
* add 'import numpy as np', modified the output form of accuracy in the validate method
* Modified the code for better readability and added the README.md file
* Update README.md
* Update README.md
* Update README.md
* updates
* formatting
Co-authored-by: YeahNew <1650996069@qq.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-05-20 08:47:19 -07:00
Richard Liaw
6c77aeb98a
[docs] ray slack remove banners ( #13898 )
...
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-04 01:14:34 -08:00
Simon Mo
8e0a2f669b
[Doc] Remove trailing whitespaces ( #13390 )
2021-01-12 20:35:38 -08:00
Amog Kamsetty
8a406e1f9a
[SGD] Add PTL Docs ( #12440 )
...
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-28 10:09:38 -08:00
Amog Kamsetty
92718de40c
[SGD] Better support for custom DDP ( #11771 )
2020-11-04 13:58:51 -08:00
Amog Kamsetty
d87c186721
[RaySGD] Docs for SGD+Tune usage ( #11479 )
2020-10-22 13:32:27 -07:00
Amog Kamsetty
d5a7c53908
[Ray SGD] use_local flag + Worker group abstraction ( #10539 )
...
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-09-15 11:58:57 -07:00
Amog Kamsetty
415be78cc0
[RaySGD] Simplify Builder Process ( #10321 )
...
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-09-08 15:19:40 -07:00
Richard Liaw
3f98a8bfcb
[docs] Fix warnings for sphinx 1.8 ( #10476 )
...
* fix-build-for-sphinx18
* jnilit
2020-09-01 13:37:35 -07:00
Amog Kamsetty
9ff687c093
[SGD][Docs] docs for training/ validation results ( #10181 )
2020-08-19 17:22:28 -07:00
Richard Liaw
0c3b9ebeef
[tune/sgd] Document func_trainable and add checkpoint context ( #9739 )
...
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2020-07-30 09:46:37 -07:00
Richard Liaw
56d934bc18
[docs] Revised Cluster documentation ( #9062 )
...
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-06-26 09:29:22 -07:00
Alex Wu
dcf58a43dc
[SGD] Dataset API ( #7839 )
2020-06-01 15:48:15 -07:00
Bill Chambers
b3d686b78f
[docs] Add Overview Section & Gentle Introduction ( #8517 )
2020-05-26 10:39:34 -05:00
Eric Liang
eabb801a40
less important ( #8439 )
2020-05-13 22:52:38 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example ( #7792 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* experimental
* as_trainable
* fix
* ok
* format
* create_torch_pbt
* setup_pbt
* ok
* format
* ok
* format
* docs
* ok
* Draft head-is-worker
* Fix missing concurrency between local and remote workers
* Fix tqdm to work with head-is-worker
* Cleanup
* Implement state_dict and load_state_dict
* Reserve resources on the head node for the local worker
* Update the development cluster setup
* Add spot block reservation to the development yaml
* ok
* Draft the fault tolerance fix
* Small fixes to local-remote concurrency
* Cleanup + fix typo
* fixes
* worker_counts
* some formatting and asha
* fix
* okme
* fixactorkill
* unify
* Revert the cluster mounts
* Cut the handler-reporter API
* Fix most tests
* Rm tqdm_handler.py
* Re-add tune test
* Automatically force-shutdown on actor errors on shutdown
* Formatting
* fix_tune_test
* Add timeout error verification
* Rename tqdm to use_tqdm
* fixtests
* ok
* remove_redundant
* deprecated
* deactivated
* ok_try_this
* lint
* nice
* done
* retries
* fixes
* kill
* retry
* init_transformer
* init
* deployit
* improve_example
* trans
* rename
* formats
* format-to-py37
* time_to_test
* more_changes
* ok
* update_args_and_script
* fp16_epoch
* huggingface
* training stats
* distributed
* Apply suggestions from code review
* transformer
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) ( #8020 )
...
* Checkpoint the image-models example
* Update cluster definition
* Fix copyright info
* Use original args
* Checkpoint fixes
* Add README
* Add some missing features
* Format
* Get rid of the unused Namespace class
* Address comments
* Link the imagenet example in docs
* Cleanup
* Fix lint
2020-04-17 13:33:55 -07:00
Richard Liaw
dd63178e91
[sgd] Semantic Segmentation Example ( #7825 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* making a segmentation example
* segmentation_example
* segmentation
* device
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* uti
* finished_example
* block
* format
* locationg
* fix
* ok
* revert
* segmentation
* lint_and_test
* address_comments
2020-04-10 20:35:45 -07:00
Richard Liaw
f63b4c1110
[sgd] make ddp optional ( #7875 )
...
* loosen
* devices
* tryitout
* fix
* fix
* fix
* easy
* test
* fix
* fix
* better visibility
* fix
2020-04-06 11:41:36 -07:00
Richard Liaw
314250d072
[docs] Make Ray slack more prominent ( #7870 )
2020-04-02 11:14:02 -07:00
Richard Liaw
24bf6ad607
[raysgd] Improve raysgd examples ( #7818 )
...
* better_example
* test
* improve some usability things
* submit
* fix
* flake
* Update python/ray/util/sgd/torch/training_operator.py
* trythis
* fix
* fix
* smoke
* fail
* fix
* fix
2020-04-01 08:58:39 -07:00
Richard Liaw
86cff17e7e
[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration ( #7547 )
2020-03-30 12:58:49 -05:00
Richard Liaw
d046faeb9c
[sgd] Readme fix ( #7564 )
...
* readme fix
* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ( #7553 )
...
* fix
* fix
2020-03-11 13:08:27 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks ( #7454 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* revert
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API ( #7384 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* comments
* fix
* fix
* runner_tests
* codes
* example
* fix_test
* fix
* tests
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Maksim Smolin
3a134c7224
[RaySGD] Rename PyTorch API endpoints to start with Torch ( #7425 )
...
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* rename
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-03 16:44:42 -08:00
Richard Liaw
48cdca843f
[raysgd] Custom training operator ( #7211 )
2020-03-01 21:22:48 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental ( #7100 )
2020-02-18 13:43:19 -08:00
Richard Liaw
94e2fcea2e
[sgd] fp16 (apex) and scheduler support + move examples page ( #7061 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* fix tests'
* testmode
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-16 19:04:08 -08:00
Richard Liaw
037aa2b961
[sgd] Refactor PyTorch SGD Documentation. ( #6910 )
...
* Refactor documentation and directory structurre
* update loss
* ,ore examples
* fix comments
* more code
* svgs
* formatting
* more_docs
* more writing
* comments ready
* move
* whitespace
* examples
* fix
* bold
* pytorch
* batch
* fix
* fix test
* Apply suggestions from code review
* quarantinegp
* tests/
* fix missing
2020-01-29 08:51:01 -08:00