Tomasz Wrona
b508166419
Copy initial state of an RNN to a CPU before converting it to a NumPy array ( #8097 )
2020-04-25 18:49:09 -07:00
Richard Liaw
b506f87117
[tune] New Doc edits, add Concepts page ( #8083 )
...
Co-Authored-By: Sven Mika <sven@anyscale.io>
2020-04-25 18:25:56 -07:00
ijrsvt
69ff7e3e35
TaskCancellation ( #7669 )
...
* Smol comment
* WIP, not passing ray.init
* Fixed small problem
* wip
* Pseudo interrupt things
* Basic prototype operational
* correct proc title
* Mostly done
* Cleanup
* cleaner raylet error
* Cleaning up a few loose ends
* Fixing Race Conds
* Prelim testing
* Fixing comments and adding second_check for kill
* Working_new_impl
* demo_ready
* Fixing my english
* Fixing a few problems
* Small problems
* Cleaning up
* Response to changes
* Fixing error passing
* Merged to master
* fixing lock
* Cleaning up print statements
* Format
* Fixing Unit test build failure
* mock_worker fix
* java_fix
* Canel
* Switching to Cancel
* Responding to Review
* FixFormatting
* Lease cancellation
* FInal comments?
* Moving exist check to CoreWorker
* Fix Actor Transport Test
* Fixing task manager test
* chaning clock repr
* Fix build
* fix white space
* lint fix
* Updating to medium size
* Fixing Java test compilation issue
* lengthen bad timeouts
2020-04-25 16:04:52 -07:00
Richard Liaw
9dd3490c38
[tune] Safer try-catch for TensorboardX ( #8174 )
...
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-04-25 13:08:37 -07:00
Simon Mo
13c14eac07
[Asyncio] Remove async init legacy code ( #8177 )
...
* [Asyncio] Remove async init legacy code
* Fix places that call async_init
2020-04-25 09:32:38 -07:00
Edward Oakes
9dc625318f
[serve] Add basic test for specifying the method in a serve call ( #8172 )
2020-04-24 20:15:27 -05:00
Scott Graham
0dc01d8c1e
[autoscaler] Azure versioning ( #8168 )
2020-04-24 17:03:55 -07:00
fangfengbin
38dfe5db86
remove store client template ( #8160 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-04-24 21:19:12 +08:00
fangfengbin
713e375d50
[GCS]GCS adapts to job table pub sub ( #8145 )
2020-04-24 16:33:25 +08:00
Eric Liang
2298f6fb40
[rllib] Port DQN/Ape-X to training workflow api ( #8077 )
2020-04-23 12:39:19 -07:00
Sven Mika
499ad5fbe4
[RLlib] PyTorch version of APPO. ( #8120 )
...
- Translate all vtrace functionality to torch and added torch to the framework_iterator-loop in all existing vtrace test cases.
- Add learning test cases for APPO torch (both w/ and w/o v-trace).
- Add quick compilation tests for APPO (tf and torch, v-trace and no v-trace).
2020-04-23 09:11:12 +02:00
Sven Mika
e9ee5c4e5f
[RLlib] Nested action space PR (minimally invasive; torch only + test). ( #8101 )
...
- Add TorchMultiActionDistribution class.
- Add framework-agnostic test cases for TorchMultiActionDistribution.
2020-04-23 09:09:22 +02:00
Nick Matthews
a9d8d16b6b
Change memory monitor warning to a logging call ( #8137 )
2020-04-22 21:29:18 -07:00
yncxcw
51559c08b9
Fix mis-memory counting in memory monitor for contaienr environment ( #8113 )
...
Co-authored-by: weich <weich@nvidia.com>
2020-04-22 14:32:35 -07:00
Edward Oakes
0bb918f2b1
Disable eager execution to fix test_tensorflow ( #8133 )
2020-04-22 15:54:42 -05:00
Edward Oakes
f9f41e5a1a
[serve] Fix nonblocking serve.init() ( #8068 )
2020-04-22 11:51:27 -05:00
Tianyi Chen
0204dff1e9
[streaming]Add master and scheduler. ( #8044 )
2020-04-22 14:43:56 +08:00
Max Fitton
c486b56c58
Improve Serve API Input Validations ( #8124 )
...
* Add additional validation to endpoint and backend creation that ensures there are not duplicates created of either of these. In addition, adds additional validation to split_traffic to make sure both the endpoint and backends exist.
* Fix test to deal with removed serve.link
* Address PR feedback
Co-authored-by: Max Fitton <max@semprehealth.com>
2020-04-21 19:45:29 -07:00
Simon Mo
95e8ec8c47
[CI] Dashboard+ Tensorboard Lint Hotfix ( #8125 )
2020-04-21 16:52:58 -07:00
Edward Oakes
505f3a8714
[serve] Remove serve.link(), rename serve.split() -> serve.set_traffic() ( #8072 )
2020-04-21 14:26:07 -05:00
Richard Liaw
6799fbbd5e
[dashboard] Temporarily disable tensorboard ( #8121 )
2020-04-21 10:40:46 -07:00
mehrdadn
0a54407961
[CI] Factor out more Travis code and update GitHub Actions ( #8085 )
2020-04-21 09:53:08 -07:00
Richard Liaw
fa7eecf48a
[sgd] Avoid parameter "gotcha" for learning rate scheduler ( #8107 )
...
* with-scheduler-creator
* none
* add_freq
* runner
* torch
2020-04-21 01:01:04 -07:00
Sven Mika
d15609ba2a
[RLlib] PyTorch version of ARS (Augmented Random Search). ( #8106 )
...
This PR implements a PyTorch version of RLlib's ARS algorithm using RLlib's functional algo builder API. It also adds a regression test for ARS (torch) on CartPole.
2020-04-21 09:47:52 +02:00
Qing Wang
d66d12661b
Improve the perf of constructing actor task specs. ( #8093 )
2020-04-21 11:54:09 +08:00
Stephanie Wang
eefea4e29c
[core] Post task submission to IO loop ( #8090 )
...
* Post to IO loop
* Unused
* Fix build
2020-04-20 19:13:50 -07:00
Ujval Misra
708dff6d8f
[tune] Stop-gap fix for PBT checkpointing ( #7794 )
...
* Fix PBT
* lint
* reset
* rm
* tests
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-20 15:10:36 -07:00
Edward Oakes
213d3894ca
Remove serve.route decorator ( #8108 )
2020-04-20 16:22:25 -05:00
Stephanie Wang
1323e1753d
[core] When reconstruction is enabled, pin objects created by ray.put() ( #8021 )
...
* Unit test and pin ray.put objects until they have no more lineage references
* c++ tests
* lint
* Mark ray.put objects as pinned
2020-04-20 13:09:54 -07:00
Eric Liang
17e3c545d9
[rllib] Fix truncate episodes mode in central critic example ( #8073 )
2020-04-20 12:58:01 -07:00
Sven Mika
3812bfedda
[RLlib] PyTorch version of ES (Evolution Strategies). ( #8104 )
...
PyTorch version of Evolution Strategies (ES) Algo.
2020-04-20 21:47:28 +02:00
Richard Liaw
9f3e9e7e9f
[tune] Add more intensive tests ( #7667 )
...
* make_heavier_tests
* help
2020-04-20 11:14:44 -07:00
Edward Oakes
793e616a2d
Fix job table parsing ( #8070 )
2020-04-20 12:56:43 -05:00
Bill Chambers
77655749fb
[RayServe] RayServe Introduction and Overview ( #8038 )
2020-04-20 12:05:59 -05:00
Sven Mika
d6cb7d865e
[RLlib] Torch DQN (APEX) TD-Error/prio. replay fixes. ( #8082 )
...
PyTorch APEX_DQN with Prioritized Replay enabled would not work properly due to the td_error not being retrievable by the AsyncReplayOptimizer.
2020-04-20 10:03:25 +02:00
mehrdadn
c8b9a357f2
Try to fix dependency issue ( #8065 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-19 16:09:29 -07:00
ZhuSenlin
3f28a8a229
[GCS] reply to the owner only after the actor has been successfully created. ( #8079 )
...
* reply to the owner only after the actor is successfully created.
* reply immediately if the actor is already created
* fix comment
* add test_actor_creation_task provided by @Stephanie Wang
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-19 09:53:02 -07:00
Edward Oakes
da296bf8c5
[serve] Router fault tolerance ( #8008 )
2020-04-19 11:04:06 -05:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). ( #8063 )
...
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Sumanth Ratna
bdb03a0544
[tune] Update dragonfly installation instructions ( #8086 )
...
Closes #8084
2020-04-18 20:25:38 -07:00
Dean Wampler
5d2885c609
Minor Ray API doc refinements ( #8060 )
...
* Added small section on installation when using Anaconda. Also fixed an obsolete link to Anaconda.
* Delete more temporary directories when running the doc "make clean".
* Fine-tuning the core Ray API documentation
* Fix doc lines that were too long
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
2020-04-18 15:19:35 -07:00
Eric Liang
d92c5f1a9e
[rllib] Add init file for exec module
2020-04-17 17:24:28 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example ( #7792 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* experimental
* as_trainable
* fix
* ok
* format
* create_torch_pbt
* setup_pbt
* ok
* format
* ok
* format
* docs
* ok
* Draft head-is-worker
* Fix missing concurrency between local and remote workers
* Fix tqdm to work with head-is-worker
* Cleanup
* Implement state_dict and load_state_dict
* Reserve resources on the head node for the local worker
* Update the development cluster setup
* Add spot block reservation to the development yaml
* ok
* Draft the fault tolerance fix
* Small fixes to local-remote concurrency
* Cleanup + fix typo
* fixes
* worker_counts
* some formatting and asha
* fix
* okme
* fixactorkill
* unify
* Revert the cluster mounts
* Cut the handler-reporter API
* Fix most tests
* Rm tqdm_handler.py
* Re-add tune test
* Automatically force-shutdown on actor errors on shutdown
* Formatting
* fix_tune_test
* Add timeout error verification
* Rename tqdm to use_tqdm
* fixtests
* ok
* remove_redundant
* deprecated
* deactivated
* ok_try_this
* lint
* nice
* done
* retries
* fixes
* kill
* retry
* init_transformer
* init
* deployit
* improve_example
* trans
* rename
* formats
* format-to-py37
* time_to_test
* more_changes
* ok
* update_args_and_script
* fp16_epoch
* huggingface
* training stats
* distributed
* Apply suggestions from code review
* transformer
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) ( #8020 )
...
* Checkpoint the image-models example
* Update cluster definition
* Fix copyright info
* Use original args
* Checkpoint fixes
* Add README
* Add some missing features
* Format
* Get rid of the unused Namespace class
* Address comments
* Link the imagenet example in docs
* Cleanup
* Fix lint
2020-04-17 13:33:55 -07:00
Edward Oakes
90ef585fd5
Revert "Add ability to specify worker and driver ports ( #7833 )" ( #8069 )
...
This reverts commit 9f751ff8c4
.
2020-04-17 12:32:22 -05:00
mehrdadn
8cf37726d2
Fix missing Java dependency ( #8067 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 10:43:02 -05:00
mehrdadn
f15618033d
Remove --no-transfer-progress as it appears to be unsupported ( #8066 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 16:30:37 +02:00
Sven Mika
f7e4dae852
[RLlib] DQN and SAC Atari benchmark fixes. ( #7962 )
...
* Add Atari SAC-discrete (learning MsPacman in 40k ts up to 780 rewards).
* SAC loss function test case fix.
2020-04-17 08:49:15 +02:00
Richard Liaw
a9ea139317
[sgd] Make serialization of data creation optional ( #8027 )
...
* pytest
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Ujval Misra <misraujval@gmail.com>
Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-04-16 20:27:51 -07:00
Richard Liaw
de1787e5e5
[tune] Check actor start -> test_cluster ( #8056 )
...
* test
* info
* ok
* hard_stop
* codefix
2020-04-16 20:00:45 -07:00