Edward Oakes
505f3a8714
[serve] Remove serve.link(), rename serve.split() -> serve.set_traffic() ( #8072 )
2020-04-21 14:26:07 -05:00
Richard Liaw
6799fbbd5e
[dashboard] Temporarily disable tensorboard ( #8121 )
2020-04-21 10:40:46 -07:00
mehrdadn
0a54407961
[CI] Factor out more Travis code and update GitHub Actions ( #8085 )
2020-04-21 09:53:08 -07:00
Richard Liaw
fa7eecf48a
[sgd] Avoid parameter "gotcha" for learning rate scheduler ( #8107 )
...
* with-scheduler-creator
* none
* add_freq
* runner
* torch
2020-04-21 01:01:04 -07:00
Sven Mika
d15609ba2a
[RLlib] PyTorch version of ARS (Augmented Random Search). ( #8106 )
...
This PR implements a PyTorch version of RLlib's ARS algorithm using RLlib's functional algo builder API. It also adds a regression test for ARS (torch) on CartPole.
2020-04-21 09:47:52 +02:00
Qing Wang
d66d12661b
Improve the perf of constructing actor task specs. ( #8093 )
2020-04-21 11:54:09 +08:00
Stephanie Wang
eefea4e29c
[core] Post task submission to IO loop ( #8090 )
...
* Post to IO loop
* Unused
* Fix build
2020-04-20 19:13:50 -07:00
Ujval Misra
708dff6d8f
[tune] Stop-gap fix for PBT checkpointing ( #7794 )
...
* Fix PBT
* lint
* reset
* rm
* tests
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-20 15:10:36 -07:00
Edward Oakes
213d3894ca
Remove serve.route decorator ( #8108 )
2020-04-20 16:22:25 -05:00
Stephanie Wang
1323e1753d
[core] When reconstruction is enabled, pin objects created by ray.put() ( #8021 )
...
* Unit test and pin ray.put objects until they have no more lineage references
* c++ tests
* lint
* Mark ray.put objects as pinned
2020-04-20 13:09:54 -07:00
Eric Liang
17e3c545d9
[rllib] Fix truncate episodes mode in central critic example ( #8073 )
2020-04-20 12:58:01 -07:00
Sven Mika
3812bfedda
[RLlib] PyTorch version of ES (Evolution Strategies). ( #8104 )
...
PyTorch version of Evolution Strategies (ES) Algo.
2020-04-20 21:47:28 +02:00
Richard Liaw
9f3e9e7e9f
[tune] Add more intensive tests ( #7667 )
...
* make_heavier_tests
* help
2020-04-20 11:14:44 -07:00
Edward Oakes
793e616a2d
Fix job table parsing ( #8070 )
2020-04-20 12:56:43 -05:00
Bill Chambers
77655749fb
[RayServe] RayServe Introduction and Overview ( #8038 )
2020-04-20 12:05:59 -05:00
Sven Mika
d6cb7d865e
[RLlib] Torch DQN (APEX) TD-Error/prio. replay fixes. ( #8082 )
...
PyTorch APEX_DQN with Prioritized Replay enabled would not work properly due to the td_error not being retrievable by the AsyncReplayOptimizer.
2020-04-20 10:03:25 +02:00
mehrdadn
c8b9a357f2
Try to fix dependency issue ( #8065 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-19 16:09:29 -07:00
ZhuSenlin
3f28a8a229
[GCS] reply to the owner only after the actor has been successfully created. ( #8079 )
...
* reply to the owner only after the actor is successfully created.
* reply immediately if the actor is already created
* fix comment
* add test_actor_creation_task provided by @Stephanie Wang
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-19 09:53:02 -07:00
Edward Oakes
da296bf8c5
[serve] Router fault tolerance ( #8008 )
2020-04-19 11:04:06 -05:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). ( #8063 )
...
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Sumanth Ratna
bdb03a0544
[tune] Update dragonfly installation instructions ( #8086 )
...
Closes #8084
2020-04-18 20:25:38 -07:00
Dean Wampler
5d2885c609
Minor Ray API doc refinements ( #8060 )
...
* Added small section on installation when using Anaconda. Also fixed an obsolete link to Anaconda.
* Delete more temporary directories when running the doc "make clean".
* Fine-tuning the core Ray API documentation
* Fix doc lines that were too long
Co-authored-by: Dean Wampler <dean@concurrentthought.com>
2020-04-18 15:19:35 -07:00
Eric Liang
d92c5f1a9e
[rllib] Add init file for exec module
2020-04-17 17:24:28 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example ( #7792 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* experimental
* as_trainable
* fix
* ok
* format
* create_torch_pbt
* setup_pbt
* ok
* format
* ok
* format
* docs
* ok
* Draft head-is-worker
* Fix missing concurrency between local and remote workers
* Fix tqdm to work with head-is-worker
* Cleanup
* Implement state_dict and load_state_dict
* Reserve resources on the head node for the local worker
* Update the development cluster setup
* Add spot block reservation to the development yaml
* ok
* Draft the fault tolerance fix
* Small fixes to local-remote concurrency
* Cleanup + fix typo
* fixes
* worker_counts
* some formatting and asha
* fix
* okme
* fixactorkill
* unify
* Revert the cluster mounts
* Cut the handler-reporter API
* Fix most tests
* Rm tqdm_handler.py
* Re-add tune test
* Automatically force-shutdown on actor errors on shutdown
* Formatting
* fix_tune_test
* Add timeout error verification
* Rename tqdm to use_tqdm
* fixtests
* ok
* remove_redundant
* deprecated
* deactivated
* ok_try_this
* lint
* nice
* done
* retries
* fixes
* kill
* retry
* init_transformer
* init
* deployit
* improve_example
* trans
* rename
* formats
* format-to-py37
* time_to_test
* more_changes
* ok
* update_args_and_script
* fp16_epoch
* huggingface
* training stats
* distributed
* Apply suggestions from code review
* transformer
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) ( #8020 )
...
* Checkpoint the image-models example
* Update cluster definition
* Fix copyright info
* Use original args
* Checkpoint fixes
* Add README
* Add some missing features
* Format
* Get rid of the unused Namespace class
* Address comments
* Link the imagenet example in docs
* Cleanup
* Fix lint
2020-04-17 13:33:55 -07:00
Edward Oakes
90ef585fd5
Revert "Add ability to specify worker and driver ports ( #7833 )" ( #8069 )
...
This reverts commit 9f751ff8c4
.
2020-04-17 12:32:22 -05:00
mehrdadn
8cf37726d2
Fix missing Java dependency ( #8067 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 10:43:02 -05:00
mehrdadn
f15618033d
Remove --no-transfer-progress as it appears to be unsupported ( #8066 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 16:30:37 +02:00
Sven Mika
f7e4dae852
[RLlib] DQN and SAC Atari benchmark fixes. ( #7962 )
...
* Add Atari SAC-discrete (learning MsPacman in 40k ts up to 780 rewards).
* SAC loss function test case fix.
2020-04-17 08:49:15 +02:00
Richard Liaw
a9ea139317
[sgd] Make serialization of data creation optional ( #8027 )
...
* pytest
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Ujval Misra <misraujval@gmail.com>
Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-04-16 20:27:51 -07:00
Richard Liaw
de1787e5e5
[tune] Check actor start -> test_cluster ( #8056 )
...
* test
* info
* ok
* hard_stop
* codefix
2020-04-16 20:00:45 -07:00
Mitchell Stern
d0c6f013c3
Fix command config portion of project schema ( #8057 )
2020-04-16 18:08:17 -07:00
Richard Liaw
6545534805
[tune/sgd] DCGAN example self-contained, turn example into modu… ( #8012 )
...
* ok
* done
* run_benchmarks
* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Eric Liang
0c80efa2a3
[rllib] Disable explicit free, which is no longer needed and causes memory leaks
2020-04-16 16:06:58 -07:00
roireshef
dbcad35022
[RLlib] Added DefaultCallbacks which replaces old callbacks dict interface ( #6972 )
2020-04-16 16:06:42 -07:00
mehrdadn
35ae7f0e68
[CI] Preload Test to Skip Env Var to All Travis Job ( #8061 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-16 15:37:25 -07:00
Karthikeyan Singaravelan
f95e18dfeb
[tune/sgd] Import ABC from collections.abc instead of collectio… ( #7982 )
...
* Import ABC from collections.abc instead of collections for Python 3 compatibility.
* Fix linter errors.
2020-04-16 15:26:49 -07:00
mehrdadn
42f88ecf9d
Hotfix CI Export Tests to Skip ( #8058 )
...
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-16 15:23:00 -07:00
Richard Liaw
118d960e1c
[hotfix] Java Lint Broken ( #8048 )
2020-04-16 13:58:33 -07:00
Richard Liaw
2cb3355495
[docs] Move css to right location ( #8053 )
2020-04-16 13:46:50 -07:00
Eric Liang
55ce2bba10
Record num plasma errs in map ( #8034 )
2020-04-16 13:16:40 -07:00
Edward Oakes
9f751ff8c4
Add ability to specify worker and driver ports ( #7833 )
2020-04-16 13:49:25 -05:00
Richard Liaw
d5f517b2f5
[docs] Hotfix for missing css files. ( #8051 )
2020-04-16 11:44:55 -07:00
Richard Liaw
4d8bf5635d
[hotfix] Lint formatting for new Tune optimizer ZOOpt ( #8040 )
...
* formatting
* removedill
* lint
2020-04-16 09:24:30 -07:00
Clark Zinzow
d4cae5f632
[Core] Added ability to specify different IP addresses for a core worker and its raylet. ( #7985 )
2020-04-16 10:32:24 -05:00
Sven Mika
d0fab84e4d
[RLlib] DDPG PyTorch version. ( #7953 )
...
The DDPG/TD3 algorithms currently do not have a PyTorch implementation. This PR adds PyTorch support for DDPG/TD3 to RLlib.
This PR:
- Depends on the re-factor PR for DDPG (Functional Algorithm API).
- Adds learning regression tests for the PyTorch version of DDPG and a DDPG (torch)
- Updates the documentation to reflect that DDPG and TD3 now support PyTorch.
* Learning Pendulum-v0 on torch version (same config as tf). Wall time a little slower (~20% than tf).
* Fix GPU target model problem.
2020-04-16 10:20:01 +02:00
Xianyang Liu
e1d3f7eba6
[rllib]Add config for rllib to support set python environments ( #8026 )
...
* support set extra python environments
* wrap value with str
* Apply suggestions from code review
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
* addresses comments
* fix lint errors
* remove unrelated changes due to format.sh
* remove unrelated changes due to format.sh
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-04-16 01:13:45 -07:00
wanxing
9345d03ffb
[Streaming] Streaming data transfer supports cross language. ( #7961 )
...
* add init parameters for java
* fix bug
* cython
* fix compile
* fix test_direct_tranfer
* comment
* ChannelCreationParameter
* fix comment
* builder
* lint and fix tests
* fix single process test
* fix checkstyle and lint
* checkstyle
* lint python
Co-authored-by: wanxing <wanxing@B-458DMD6M-1753.local>
2020-04-16 15:16:48 +08:00
fangfengbin
5a7882bb44
Fix gcs_server get invalid local address ( #7842 )
2020-04-16 14:58:19 +08:00
JianZhangYang
7b0518b993
[streaming] Async changes for resourcemanager part ( #7955 )
2020-04-16 14:15:45 +08:00