Commit graph

52 commits

Author SHA1 Message Date
Will Drevo
fa878e2d4d
Added example to user guide for cloud checkpointing (#20045)
Co-authored-by: will <will@anyscale.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-11-15 15:43:06 +00:00
matthewdeng
4674c78050
[Train] Rename Ray SGD v2 to Ray Train (#19436) 2021-10-18 22:27:46 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Amog Kamsetty
db0483a29a
[SGD] SGD Namespace Consistency (#19048)
* wip

* update

* add callbacks

* fix

* fix

* update

* add

* address comments
2021-10-05 15:56:42 -07:00
Eric Liang
032a420ee6
Rename Dataset.pipeline to Dataset.window (#19050) 2021-10-01 19:55:29 -07:00
Amog Kamsetty
98ac3f601c
[SGD] v1 to v2 Migration Guide (#18887)
* wip

* add guide

* fix test

* address comments

* add to docs

* fix

* remove markdown

* add warning to all pages

* formatting

* fix

* links

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* address comments

* fix

* address comments

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-30 09:15:21 -07:00
matthewdeng
d2caa00be8
[SGD] add SGDv2 survey link to docs (#18934) 2021-09-27 19:15:37 -07:00
Antoni Baum
72cc0c9bda
[SGDv2] Add Tune-Cifar-PyTorch-PBT example (#18860)
* [SGDv2] Add Tune-Cifar-PyTorch-PBT example

* Update python/ray/util/sgd/v2/BUILD

* Lint

* Update example

* Update docs
2021-09-27 09:22:40 -07:00
Amog Kamsetty
99b1d8c95f
[SGD] Update Docs (#18839) 2021-09-23 07:52:57 -07:00
Amog Kamsetty
d354161528
[SGD] Link ray.sgd namespace to ray.util.sgd.v2 (#18732)
* wip

* add symlink

* update

* remove from init

* no require tune

* try fix

* change

* * import

* fix docs

* address comment
2021-09-22 18:49:41 -07:00
Amog Kamsetty
00dd190df9
[SGD] Retry sgd.local_rank() (#18824)
* finish

* fix

* wip

* address comment

* update

* fix test

* fix failing test

* address comments

* fix test

* fix
2021-09-22 15:48:38 -07:00
Amog Kamsetty
d9b166252b
Revert "[SGD] sgd.local_rank" (#18822) 2021-09-22 13:50:00 -07:00
Amog Kamsetty
39bcbe03bc
[SGD] sgd.local_rank (#18686)
* finish

* fix

* wip

* address comment

* update

* fix test

* fix failing test

* address comments

* fix test
2021-09-22 08:10:49 -07:00
matthewdeng
380a653787
[SGD] update SGDv2 user guide docs (#18270)
* [SGD] update SGDv2 user guide docs

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* add new line

* update docs

* fix header line length

* lint

* lint

* lint

* lint

* fix remaining lint issues

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* address comments

* address comments

* add TODO for iterator API

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>

* address comments

* address comments

* add tune doc

* restructure table of contents

* add examples; rename example files to include example suffix

* add quick start, porting code

* address comments

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-09-14 09:07:25 -07:00
Amog Kamsetty
3b77840c1b
PyTorch Lightning Updates (#17876) 2021-08-27 23:15:51 -07:00
Richard Liaw
ecc7cf4c5e
[sgd] v2 documentation draft (#17253)
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-08-02 01:47:14 -07:00
kimikuri
93172b535f
[doc][sgd] Broken Link in SGD's page. (#17404) (#17423) 2021-07-29 01:13:23 -07:00
Eric Liang
38bddc3f2b
First cut at dataset documentation (#16956) 2021-07-14 23:27:13 -07:00
Antoni Baum
2fb10e6730
[SGD] Add support for native Torch AMP in SGD (#16382)
* SGD native AMP initial commit

* SGD native amp second pass

* Update docs

* Update TorchTrainer doc

* Temp fix release test

* Update release/sgd_tests/sgd_gpu/sgd_gpu_app_config.yaml

Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-06-15 17:48:21 -07:00
YeahNew
9a93dd9682
Adding a RaySGD and DGL ( Deep Graph Library) integration example(gat… (#15718)
* Adding a RaySGD and DGL ( Deep Graph Library) integration example(gat_dgl.py)

* Update gat_dgl.py

* Update gat_dgl.py

* Update gat_dgl.py

* the gat_dgl.py has been formated by the format.sh script

* delet useless code in the gat_dgl.py

* add 'import numpy as np', modified the output form of accuracy in the validate method

* Modified the code for better readability and added the README.md file

* Update README.md

* Update README.md

* Update README.md

* updates

* formatting

Co-authored-by: YeahNew <1650996069@qq.com>
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-05-20 08:47:19 -07:00
Richard Liaw
6c77aeb98a
[docs] ray slack remove banners (#13898)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-04 01:14:34 -08:00
Simon Mo
8e0a2f669b
[Doc] Remove trailing whitespaces (#13390) 2021-01-12 20:35:38 -08:00
Amog Kamsetty
8a406e1f9a
[SGD] Add PTL Docs (#12440)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-28 10:09:38 -08:00
Amog Kamsetty
92718de40c
[SGD] Better support for custom DDP (#11771) 2020-11-04 13:58:51 -08:00
Amog Kamsetty
d87c186721
[RaySGD] Docs for SGD+Tune usage (#11479) 2020-10-22 13:32:27 -07:00
Amog Kamsetty
d5a7c53908
[Ray SGD] use_local flag + Worker group abstraction (#10539)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-09-15 11:58:57 -07:00
Amog Kamsetty
415be78cc0
[RaySGD] Simplify Builder Process (#10321)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-09-08 15:19:40 -07:00
Richard Liaw
3f98a8bfcb
[docs] Fix warnings for sphinx 1.8 (#10476)
* fix-build-for-sphinx18

* jnilit
2020-09-01 13:37:35 -07:00
Amog Kamsetty
9ff687c093
[SGD][Docs] docs for training/ validation results (#10181) 2020-08-19 17:22:28 -07:00
Richard Liaw
0c3b9ebeef
[tune/sgd] Document func_trainable and add checkpoint context (#9739)
Co-authored-by: krfricke <krfricke@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2020-07-30 09:46:37 -07:00
Richard Liaw
56d934bc18
[docs] Revised Cluster documentation (#9062)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-06-26 09:29:22 -07:00
Alex Wu
dcf58a43dc
[SGD] Dataset API (#7839) 2020-06-01 15:48:15 -07:00
Bill Chambers
b3d686b78f
[docs] Add Overview Section & Gentle Introduction (#8517) 2020-05-26 10:39:34 -05:00
Eric Liang
eabb801a40
less important (#8439) 2020-05-13 22:52:38 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example (#7792)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* experimental

* as_trainable

* fix

* ok

* format

* create_torch_pbt

* setup_pbt

* ok

* format

* ok

* format

* docs

* ok

* Draft head-is-worker

* Fix missing concurrency between local and remote workers

* Fix tqdm to work with head-is-worker

* Cleanup

* Implement state_dict and load_state_dict

* Reserve resources on the head node for the local worker

* Update the development cluster setup

* Add spot block reservation to the development yaml

* ok

* Draft the fault tolerance fix

* Small fixes to local-remote concurrency

* Cleanup + fix typo

* fixes

* worker_counts

* some formatting and asha

* fix

* okme

* fixactorkill

* unify

* Revert the cluster mounts

* Cut the handler-reporter API

* Fix most tests

* Rm tqdm_handler.py

* Re-add tune test

* Automatically force-shutdown on actor errors on shutdown

* Formatting

* fix_tune_test

* Add timeout error verification

* Rename tqdm to use_tqdm

* fixtests

* ok

* remove_redundant

* deprecated

* deactivated

* ok_try_this

* lint

* nice

* done

* retries

* fixes

* kill

* retry

* init_transformer

* init

* deployit

* improve_example

* trans

* rename

* formats

* format-to-py37

* time_to_test

* more_changes

* ok

* update_args_and_script

* fp16_epoch

* huggingface

* training stats

* distributed

* Apply suggestions from code review

* transformer

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) (#8020)
* Checkpoint the image-models example

* Update cluster definition

* Fix copyright info

* Use original args

* Checkpoint fixes

* Add README

* Add some missing features

* Format

* Get rid of the unused Namespace class

* Address comments

* Link the imagenet example in docs

* Cleanup

* Fix lint
2020-04-17 13:33:55 -07:00
Richard Liaw
dd63178e91
[sgd] Semantic Segmentation Example (#7825)
* better_example

* test

* improve some usability things

* submit

* fix

* making a segmentation example

* segmentation_example

* segmentation

* device

* flake

* Update python/ray/util/sgd/torch/training_operator.py

* uti

* finished_example

* block

* format

* locationg

* fix

* ok

* revert

* segmentation

* lint_and_test

* address_comments
2020-04-10 20:35:45 -07:00
Richard Liaw
f63b4c1110
[sgd] make ddp optional (#7875)
* loosen

* devices

* tryitout

* fix

* fix

* fix

* easy

* test

* fix

* fix

* better visibility

* fix
2020-04-06 11:41:36 -07:00
Richard Liaw
314250d072
[docs] Make Ray slack more prominent (#7870) 2020-04-02 11:14:02 -07:00
Richard Liaw
24bf6ad607
[raysgd] Improve raysgd examples (#7818)
* better_example

* test

* improve some usability things

* submit

* fix

* flake

* Update python/ray/util/sgd/torch/training_operator.py

* trythis

* fix

* fix

* smoke

* fail

* fix

* fix
2020-04-01 08:58:39 -07:00
Richard Liaw
86cff17e7e
[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration (#7547) 2020-03-30 12:58:49 -05:00
Richard Liaw
d046faeb9c
[sgd] Readme fix (#7564)
* readme fix

* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes (#7553)
* fix

* fix
2020-03-11 13:08:27 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks (#7454)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* revert

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API (#7384)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* comments

* fix

* fix

* runner_tests

* codes

* example

* fix_test

* fix

* tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Maksim Smolin
3a134c7224
[RaySGD] Rename PyTorch API endpoints to start with Torch (#7425)
* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* rename

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-03 16:44:42 -08:00
Richard Liaw
48cdca843f
[raysgd] Custom training operator (#7211) 2020-03-01 21:22:48 -08:00
Eric Liang
5df801605e
Add ray.util package and move libraries from experimental (#7100) 2020-02-18 13:43:19 -08:00
Richard Liaw
94e2fcea2e
[sgd] fp16 (apex) and scheduler support + move examples page (#7061)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* fix tests'

* testmode

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-16 19:04:08 -08:00
Richard Liaw
037aa2b961
[sgd] Refactor PyTorch SGD Documentation. (#6910)
* Refactor documentation and directory structurre

* update loss

* ,ore examples

* fix comments

* more code

* svgs

* formatting

* more_docs

* more writing

* comments ready

* move

* whitespace

* examples

* fix

* bold

* pytorch

* batch

* fix

* fix test

* Apply suggestions from code review

* quarantinegp

* tests/

* fix missing
2020-01-29 08:51:01 -08:00