Commit graph

4701 commits

Author SHA1 Message Date
Stephanie Wang
1323e1753d
[core] When reconstruction is enabled, pin objects created by ray.put() (#8021)
* Unit test and pin ray.put objects until they have no more lineage references

* c++ tests

* lint

* Mark ray.put objects as pinned
2020-04-20 13:09:54 -07:00
Eric Liang
17e3c545d9
[rllib] Fix truncate episodes mode in central critic example (#8073) 2020-04-20 12:58:01 -07:00
Sven Mika
3812bfedda
[RLlib] PyTorch version of ES (Evolution Strategies). (#8104)
PyTorch version of Evolution Strategies (ES) Algo.
2020-04-20 21:47:28 +02:00
Richard Liaw
9f3e9e7e9f
[tune] Add more intensive tests (#7667)
* make_heavier_tests

* help
2020-04-20 11:14:44 -07:00
Edward Oakes
793e616a2d
Fix job table parsing (#8070) 2020-04-20 12:56:43 -05:00
Bill Chambers
77655749fb
[RayServe] RayServe Introduction and Overview (#8038) 2020-04-20 12:05:59 -05:00
Sven Mika
d6cb7d865e
[RLlib] Torch DQN (APEX) TD-Error/prio. replay fixes. (#8082)
PyTorch APEX_DQN with Prioritized Replay enabled would not work properly due to the td_error not being retrievable by the AsyncReplayOptimizer.
2020-04-20 10:03:25 +02:00
mehrdadn
c8b9a357f2
Try to fix dependency issue (#8065)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-19 16:09:29 -07:00
ZhuSenlin
3f28a8a229
[GCS] reply to the owner only after the actor has been successfully created. (#8079)
* reply to the owner only after the actor is successfully created.

* reply immediately if the actor is already created

* fix comment

* add test_actor_creation_task provided by @Stephanie Wang

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-19 09:53:02 -07:00
Edward Oakes
da296bf8c5
[serve] Router fault tolerance (#8008) 2020-04-19 11:04:06 -05:00
Sven Mika
165a86f1ab
[RLlib] SAC MuJoCo instability issues (tf and torch versions). (#8063)
SAC (both torch and tf versions) are showing issues (crashes) due to numeric instabilities in the SquashedGaussian distribution (sampling + logp after extreme NN outputs).
This PR fixes these. Stable MuJoCo learning (HalfCheetah) has been confirmed on both tf and torch versions. A Distribution stability test (using extreme NN outputs) has been added for SquashedGaussian (can be used for any other type of distribution as well).
2020-04-19 10:20:23 +02:00
Sumanth Ratna
bdb03a0544
[tune] Update dragonfly installation instructions (#8086)
Closes #8084
2020-04-18 20:25:38 -07:00
Dean Wampler
5d2885c609
Minor Ray API doc refinements (#8060)
* Added small section on installation when using Anaconda. Also fixed an obsolete link to Anaconda.

* Delete more temporary directories when running the doc "make clean".

* Fine-tuning the core Ray API documentation

* Fix doc lines that were too long

Co-authored-by: Dean Wampler <dean@concurrentthought.com>
2020-04-18 15:19:35 -07:00
Eric Liang
d92c5f1a9e
[rllib] Add init file for exec module 2020-04-17 17:24:28 -07:00
Richard Liaw
857e4dba2f
[sgd] HuggingFace GLUE Fine-tuning Example (#7792)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* experimental

* as_trainable

* fix

* ok

* format

* create_torch_pbt

* setup_pbt

* ok

* format

* ok

* format

* docs

* ok

* Draft head-is-worker

* Fix missing concurrency between local and remote workers

* Fix tqdm to work with head-is-worker

* Cleanup

* Implement state_dict and load_state_dict

* Reserve resources on the head node for the local worker

* Update the development cluster setup

* Add spot block reservation to the development yaml

* ok

* Draft the fault tolerance fix

* Small fixes to local-remote concurrency

* Cleanup + fix typo

* fixes

* worker_counts

* some formatting and asha

* fix

* okme

* fixactorkill

* unify

* Revert the cluster mounts

* Cut the handler-reporter API

* Fix most tests

* Rm tqdm_handler.py

* Re-add tune test

* Automatically force-shutdown on actor errors on shutdown

* Formatting

* fix_tune_test

* Add timeout error verification

* Rename tqdm to use_tqdm

* fixtests

* ok

* remove_redundant

* deprecated

* deactivated

* ok_try_this

* lint

* nice

* done

* retries

* fixes

* kill

* retry

* init_transformer

* init

* deployit

* improve_example

* trans

* rename

* formats

* format-to-py37

* time_to_test

* more_changes

* ok

* update_args_and_script

* fp16_epoch

* huggingface

* training stats

* distributed

* Apply suggestions from code review

* transformer

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-04-17 15:17:30 -07:00
Maksim Smolin
d6f4e5b3e1
[SGD] Imagenet example (basic) (#8020)
* Checkpoint the image-models example

* Update cluster definition

* Fix copyright info

* Use original args

* Checkpoint fixes

* Add README

* Add some missing features

* Format

* Get rid of the unused Namespace class

* Address comments

* Link the imagenet example in docs

* Cleanup

* Fix lint
2020-04-17 13:33:55 -07:00
Edward Oakes
90ef585fd5
Revert "Add ability to specify worker and driver ports (#7833)" (#8069)
This reverts commit 9f751ff8c4.
2020-04-17 12:32:22 -05:00
mehrdadn
8cf37726d2
Fix missing Java dependency (#8067)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 10:43:02 -05:00
mehrdadn
f15618033d
Remove --no-transfer-progress as it appears to be unsupported (#8066)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-17 16:30:37 +02:00
Sven Mika
f7e4dae852
[RLlib] DQN and SAC Atari benchmark fixes. (#7962)
* Add Atari SAC-discrete (learning MsPacman in 40k ts up to 780 rewards).
* SAC loss function test case fix.
2020-04-17 08:49:15 +02:00
Richard Liaw
a9ea139317
[sgd] Make serialization of data creation optional (#8027)
* pytest

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Ujval Misra <misraujval@gmail.com>

Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-04-16 20:27:51 -07:00
Richard Liaw
de1787e5e5
[tune] Check actor start -> test_cluster (#8056)
* test

* info

* ok

* hard_stop

* codefix
2020-04-16 20:00:45 -07:00
Mitchell Stern
d0c6f013c3
Fix command config portion of project schema (#8057) 2020-04-16 18:08:17 -07:00
Richard Liaw
6545534805
[tune/sgd] DCGAN example self-contained, turn example into modu… (#8012)
* ok

* done

* run_benchmarks

* should_make_examples_usable
2020-04-16 17:55:27 -07:00
Eric Liang
0c80efa2a3
[rllib] Disable explicit free, which is no longer needed and causes memory leaks 2020-04-16 16:06:58 -07:00
roireshef
dbcad35022
[RLlib] Added DefaultCallbacks which replaces old callbacks dict interface (#6972) 2020-04-16 16:06:42 -07:00
mehrdadn
35ae7f0e68
[CI] Preload Test to Skip Env Var to All Travis Job (#8061)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-16 15:37:25 -07:00
Karthikeyan Singaravelan
f95e18dfeb
[tune/sgd] Import ABC from collections.abc instead of collectio… (#7982)
* Import ABC from collections.abc instead of collections for Python 3 compatibility.

* Fix linter errors.
2020-04-16 15:26:49 -07:00
mehrdadn
42f88ecf9d
Hotfix CI Export Tests to Skip (#8058)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-16 15:23:00 -07:00
Richard Liaw
118d960e1c
[hotfix] Java Lint Broken (#8048) 2020-04-16 13:58:33 -07:00
Richard Liaw
2cb3355495
[docs] Move css to right location (#8053) 2020-04-16 13:46:50 -07:00
Eric Liang
55ce2bba10
Record num plasma errs in map (#8034) 2020-04-16 13:16:40 -07:00
Edward Oakes
9f751ff8c4
Add ability to specify worker and driver ports (#7833) 2020-04-16 13:49:25 -05:00
Richard Liaw
d5f517b2f5
[docs] Hotfix for missing css files. (#8051) 2020-04-16 11:44:55 -07:00
Richard Liaw
4d8bf5635d
[hotfix] Lint formatting for new Tune optimizer ZOOpt (#8040)
* formatting

* removedill

* lint
2020-04-16 09:24:30 -07:00
Clark Zinzow
d4cae5f632
[Core] Added ability to specify different IP addresses for a core worker and its raylet. (#7985) 2020-04-16 10:32:24 -05:00
Sven Mika
d0fab84e4d
[RLlib] DDPG PyTorch version. (#7953)
The DDPG/TD3 algorithms currently do not have a PyTorch implementation. This PR adds PyTorch support for DDPG/TD3 to RLlib.
This PR:
- Depends on the re-factor PR for DDPG (Functional Algorithm API).
- Adds learning regression tests for the PyTorch version of DDPG and a DDPG (torch)
- Updates the documentation to reflect that DDPG and TD3 now support PyTorch.

* Learning Pendulum-v0 on torch version (same config as tf). Wall time a little slower (~20% than tf).
* Fix GPU target model problem.
2020-04-16 10:20:01 +02:00
Xianyang Liu
e1d3f7eba6
[rllib]Add config for rllib to support set python environments (#8026)
* support set extra python environments

* wrap value with str

* Apply suggestions from code review

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* addresses comments

* fix lint errors

* remove unrelated changes due to format.sh

* remove unrelated changes due to format.sh

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-04-16 01:13:45 -07:00
wanxing
9345d03ffb
[Streaming] Streaming data transfer supports cross language. (#7961)
* add init parameters for java

* fix bug

* cython

* fix compile

* fix test_direct_tranfer

* comment

* ChannelCreationParameter

* fix comment

* builder

* lint and fix tests

* fix single process test

* fix checkstyle and lint

* checkstyle

* lint python

Co-authored-by: wanxing <wanxing@B-458DMD6M-1753.local>
2020-04-16 15:16:48 +08:00
fangfengbin
5a7882bb44
Fix gcs_server get invalid local address (#7842) 2020-04-16 14:58:19 +08:00
JianZhangYang
7b0518b993
[streaming] Async changes for resourcemanager part (#7955) 2020-04-16 14:15:45 +08:00
Servon
5c274fe631
[Tune] Add ZOOpt search algorithm (#7960)
* add zoopt

* add zoopt search algo

* add zoopt

* fix zoopt

* add zoopt requirements

* fix zoopt

* remove generated guides

* Apply suggestions from code review

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-04-15 21:13:29 -07:00
mehrdadn
956ea7c944
Hotfix CI determine_tests_to_run (#8039) 2020-04-15 17:00:38 -07:00
Simon Mo
7455610d5a
Serve Doc: Quickstart (#7940) 2020-04-15 12:25:37 -07:00
mehrdadn
ba00c29b67
Factor out Travis 'install' sections for use with GitHub Actions (#7988) 2020-04-15 08:10:22 -07:00
Sven Mika
428516056a
[RLlib] SAC Torch (incl. Atari learning) (#7984)
* Policy-classes cleanup and torch/tf unification.
- Make Policy abstract.
- Add `action_dist` to call to `extra_action_out_fn` (necessary for PPO torch).
- Move some methods and vars to base Policy
  (from TFPolicy): num_state_tensors, ACTION_PROB, ACTION_LOGP and some more.

* Fix `clip_action` import from Policy (should probably be moved into utils altogether).

* - Move `is_recurrent()` and `num_state_tensors()` into TFPolicy (from DynamicTFPolicy).
- Add config to all Policy c'tor calls (as 3rd arg after obs and action spaces).

* Add `config` to c'tor call to TFPolicy.

* Add missing `config` to c'tor call to TFPolicy in marvil_policy.py.

* Fix test_rollout_worker.py::MockPolicy and BadPolicy classes (Policy base class is now abstract).

* Fix LINT errors in Policy classes.

* Implement StatefulPolicy abstract methods in test cases: test_multi_agent_env.py.

* policy.py LINT errors.

* Create a simple TestPolicy to sub-class from when testing Policies (reduces code in some test cases).

* policy.py
- Remove abstractmethod from `apply_gradients` and `compute_gradients` (these are not required iff `learn_on_batch` implemented).
- Fix docstring of `num_state_tensors`.

* Make QMIX torch Policy a child of TorchPolicy (instead of Policy).

* QMixPolicy add empty implementations of abstract Policy methods.

* Store Policy's config in self.config in base Policy c'tor.

* - Make only compute_actions in base Policy's an abstractmethod and provide pass
implementation to all other methods if not defined.
- Fix state_batches=None (most Policies don't have internal states).

* Cartpole tf learning.

* Cartpole tf AND torch learning (in ~ same ts).

* Cartpole tf AND torch learning (in ~ same ts). 2

* Cartpole tf (torch syntax-broken) learning (in ~ same ts). 3

* Cartpole tf AND torch learning (in ~ same ts). 4

* Cartpole tf AND torch learning (in ~ same ts). 5

* Cartpole tf AND torch learning (in ~ same ts). 6

* Cartpole tf AND torch learning (in ~ same ts). Pendulum tf learning.

* WIP.

* WIP.

* SAC torch learning Pendulum.

* WIP.

* SAC torch and tf learning Pendulum and Cartpole after cleanup.

* WIP.

* LINT.

* LINT.

* SAC: Move policy.target_model to policy.device as well.

* Fixes and cleanup.

* Fix data-format of tf keras Conv2d layers (broken for some tf-versions which have data_format="channels_first" as default).

* Fixes and LINT.

* Fixes and LINT.

* Fix and LINT.

* WIP.

* Test fixes and LINT.

* Fixes and LINT.

Co-authored-by: Sven Mika <sven@Svens-MacBook-Pro.local>
2020-04-15 13:25:16 +02:00
fangfengbin
efbaf155b2
[GCS]Add publish and subscribe function of gcs table (#7909) 2020-04-15 04:24:52 -07:00
Qing Wang
dfb0ad0d3e
[Java] Fix Java CI exit code issue (#8028) 2020-04-15 15:28:52 +08:00
Jan Blumenkamp
8e439688fc
Torch sequence_mask now works for tensors on different devices (#7980) 2020-04-15 07:21:51 +02:00
fangfengbin
c17404918c
[GCS]Add gcs table storage interface (#7949) 2020-04-15 10:48:12 +08:00