Commit graph

4393 commits

Author SHA1 Message Date
Sven Mika
80d314ae5e
[RLlib] Add all agents to rllib rollout tests. (#7534) 2020-03-12 11:02:51 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166) 2020-03-12 22:13:56 +08:00
fangfengbin
428fb79b27
Fix streaming compile bug (#7577)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 17:26:45 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API (#7497) 2020-03-12 00:54:08 -07:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code (#7539)
* add gcs reply status

* rebase master

* use macro to simplify

* convert status in gcs rpc client

* define a Status message in probobuf

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Sven Mika
20ef4a8603
[RLlib] Cleanup/unify all test cases. (#7533) 2020-03-11 20:39:47 -07:00
Sven Mika
dded5b6d22
[RLlib] ES env_config is not a EnvContext object (e.g. does not contain worker_index). (#7560) 2020-03-11 20:33:20 -07:00
Sven Mika
bc120730e5
[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning (#7556) 2020-03-11 20:31:27 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options parameter (#7537)
* fix Java CI

* Minor fix

* move json.loads out of build_java_worker_command

* lint

* fix cross language test
2020-03-12 10:44:23 +08:00
Markus Cozowicz
ba1b081477
Azure Portal cluster deployment | Support spot instances (#7558)
* added priority option

* added head node priority

* upgrade api version
2020-03-11 18:46:11 -07:00
Simon Mo
31d63d3ca7
Fix global state actors() call (#7567) 2020-03-11 16:59:50 -07:00
Richard Liaw
b38ed4be71
[raysgd] Fix More Docs (#7565) 2020-03-11 14:17:47 -07:00
Richard Liaw
d046faeb9c
[sgd] Readme fix (#7564)
* readme fix

* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes (#7553)
* fix

* fix
2020-03-11 13:08:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py (#7554) 2020-03-11 09:53:21 -07:00
mehrdadn
3b9caa98ba
Fix fate-sharing warning (#7545)
* Fix kernel_fate_sharing being None instead of False

* Remove fate-sharing warning

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-11 08:27:54 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks (#7454)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* revert

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… (#7261)
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* removed parameterized dependency
reverted ray[test] intro

* removed parameterized

* fix_tests

* format

Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! (#7546)
* format

* callable

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* data

* torchtrainer

* num_rep

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Edward Oakes
7b609ca211
Remove instances of 'raise Exception' (#7523) 2020-03-10 17:51:22 -07:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles (#7434)
* tmp

* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

* fix build

* Remove __ray_terminate__ and add test case for distributed ref counting

* lint

* Remove unused

* Fixes for detached actor, duplicate actor handles

* Remove unused

* Remove creation return ID

* Remove ObjectIDs from python, set references in CoreWorker

* Fix crash

* Fix memory crash

* Fix tests

* fix

* fixes

* fix tests

* fix java build

* fix build

* fix

* check status

* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server (#7544) 2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files (#7520) 2020-03-10 16:07:17 -07:00
Eric Liang
be48e1964b
[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety (#7504)
* fix sched

* lintc

* lint

* fix

* add unit test

* fix

* format

* fix test

* fix test
2020-03-10 11:14:14 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API (#7384)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* comments

* fix

* fix

* runner_tests

* codes

* example

* fix_test

* fix

* tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Anthony Yu
89ec4adb72
[tune] Dragonfly Optimizer (#5955)
* Add sample example

* Copy relevant lines of ask from inherited Optimizer

* Ignore strategy

* Additional changes

* Add DragonflySearch for tune connector for Dragonfly

* Add example and fix small errors

* lint

* Remove skopt references

* Update example based off of Dragonfly changes

* Edit example for final Dragonfly edits

* Formatting and documentation edits

* Add documentation and add to test pipeline

* Address PR comments

* Fix Jenkins test

* Adjust Dragonfly to PR#7366

* Lint

* fix_tests

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 08:40:36 -07:00
fangfengbin
fa785a2ad2
ServiceBasedGcsClient support detect gcs server availability and retry (#7292) 2020-03-10 21:01:07 +08:00
mehrdadn
fc76586518
Redis on Windows (#7509)
* Switch hiredis on Windows to that of the Windows port of Redis

* Use boost::asio::ip::tcp::socket::native_handle_type

* Use normal hiredis instead of Windows-specific one

* Finish up using normal hiredis

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-09 18:49:54 -07:00
Eric Liang
90e23a5c43
[iterators] Add duplicate() call and fix broken test case (#7510) 2020-03-09 17:18:52 -07:00
Edward Oakes
883ee4912d
Return reconcile.Result{}, not nil (#7521) 2020-03-09 16:27:15 -07:00
Edward Oakes
4ab80eafb9
Deprecate use_pickle flag (#7474) 2020-03-09 16:03:56 -07:00
Edward Oakes
0c254295b0
Remove experimental.signal API (#7477)
* Remove experimental.signal API

* fix test
2020-03-09 16:03:36 -07:00
Ujval Misra
023d4c02a9
[tune] Prevent deletion of checkpoint from user-initiated resto… (#7501)
* Fix restore bug

* Add test

* Lint

* Indent
2020-03-09 15:53:10 -07:00
Edward Oakes
08d4cb3822
[operator] Minor cleanup (#7498) 2020-03-09 11:23:46 -07:00
Edward Oakes
b4e2d5317e
Remove experimental.NoReturn (#7475) 2020-03-09 11:09:36 -07:00
Edward Oakes
27b4ffa98e
Improve k8s operator documentation (#7496) 2020-03-09 11:09:06 -07:00
Stephanie Wang
95bb0c5357
Upgrade plasma to latest version, use synchronous Seal (#7470)
* Upgrade arrow to master

* fix build

* todo

* lint

* Fix hanging test
2020-03-09 10:30:44 -07:00
Markus Cozowicz
e03259455f
[autoscaler] azure init script path (#7515) 2020-03-09 09:49:07 -07:00
Markus Cozowicz
145ebe14c7
added Azure Resource Manager (ARM) template (#7494)
* added Azure Resource Manager (ARM) template

* removed Azure doc (moved to separate PR)

* nit

* fixpaths

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-08 22:29:10 -07:00
Eric Liang
e7bc5c612d
Add testing strategy to PR template (#7505) 2020-03-08 15:16:49 -07:00
Sven Mika
f08687f550
[RLlib] rllib train crashes when using torch PPO/PG/A2C. (#7508)
* Fix.

* Rollback.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.

* TEST.
2020-03-08 13:03:18 -07:00
Sven Mika
bc637a2546
[Tune Jenkins tests] Add dm_tree to docker. (#7500)
* Fix.

* Rollback.

* Add dm_tree to docker examples and tune_test containers.
2020-03-07 23:16:00 -08:00
Eric Liang
a644060daa
[rllib] First pass at pipeline implementation of DQN (#7433)
* wip iters

* add test

* speed up

* update docs

* document it

* support serial sampling

* add test

* spacing

* annotate it

* update

* rename to pipeline

* comment

* iter2 wip

* update

* update

* context test

* update

* fix

* fix

* a3c pipeline

* doc

* update

* move timer

* comment

* add piepline test

* fix

* clean up

* document

* iter s

* wip dqn

* wip

* wip

* metrics

* metrics rename

* metrics ctx

* wip

* constants

* add todo

* suppport .union

* wip

* support union

* remove prints

* add todo

* remove auto timer

* fix up

* fix pipeline test

* typing

* fix breakage

* remove bad assert

* wip

* fix multiagent example

* fixapply

* update a3c

* remove a2c pl

* 0 workers

* wip

* wip

* share metrics

* wip

* wip

* doc

* fix weight sync and global var updates

* mode

* fix

* fix

* doc

* fix
2020-03-07 14:47:58 -08:00
Landcold7
beb9b02dbd
Add numba test (#7298) (#7487) 2020-03-07 11:12:25 -08:00
Richard Liaw
115468de2c
[tune] Repeated evals (#7366)
* easyrepeat

* done

* suggest

* doc

* ok

* commit

* Apply suggestions from code review

Co-Authored-By: Ujval Misra <misraujval@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Ujval Misra <misraujval@gmail.com>

* Apply suggestions from code review

* ok

* docs

Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-03-07 11:08:23 -08:00
mehrdadn
a8bda9b551
Fix incorrect handling of command-lines (#7439) 2020-03-06 15:51:49 -08:00
Sven Mika
876a1ba5bd
[RLlib] Issue 7421: can't convert cuda tensor to numpy in torch ppo. (#7445) 2020-03-06 12:45:30 -08:00
Sven Mika
510c850651
[RLlib] SAC add discrete action support. (#7320)
* Exploration API (+EpsilonGreedy sub-class).

* Exploration API (+EpsilonGreedy sub-class).

* Cleanup/LINT.

* Add `deterministic` to generic Trainer config (NOTE: this is still ignored by most Agents).

* Add `error` option to deprecation_warning().

* WIP.

* Bug fix: Get exploration-info for tf framework.
Bug fix: Properly deprecate some DQN config keys.

* WIP.

* LINT.

* WIP.

* Split PerWorkerEpsilonGreedy out of EpsilonGreedy.
Docstrings.

* Fix bug in sampler.py in case Policy has self.exploration = None

* Update rllib/agents/dqn/dqn.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Update rllib/agents/trainer.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* WIP.

* Change requests.

* LINT

* In tune/utils/util.py::deep_update() Only keep deep_updat'ing if both original and value are dicts. If value is not a dict, set

* Completely obsolete syn_replay_optimizer.py's parameters schedule_max_timesteps AND beta_annealing_fraction (replaced with prioritized_replay_beta_annealing_timesteps).

* Update rllib/evaluation/worker_set.py

Co-Authored-By: Eric Liang <ekhliang@gmail.com>

* Review fixes.

* Fix default value for DQN's exploration spec.

* LINT

* Fix recursion bug (wrong parent c'tor).

* Do not pass timestep to get_exploration_info.

* Update tf_policy.py

* Fix some remaining issues with test cases and remove more deprecated DQN/APEX exploration configs.

* Bug fix tf-action-dist

* DDPG incompatibility bug fix with new DQN exploration handling (which is imported by DDPG).

* Switch off exploration when getting action probs from off-policy-estimator's policy.

* LINT

* Fix test_checkpoint_restore.py.

* Deprecate all SAC exploration (unused) configs.

* Properly use `model.last_output()` everywhere. Instead of `model._last_output`.

* WIP.

* Take out set_epsilon from multi-agent-env test (not needed, decays anyway).

* WIP.

* Trigger re-test (flaky checkpoint-restore test).

* WIP.

* WIP.

* Add test case for deterministic action sampling in PPO.

* bug fix.

* Added deterministic test cases for different Agents.

* Fix problem with TupleActions in dynamic-tf-policy.

* Separate supported_spaces tests so they can be run separately for easier debugging.

* LINT.

* Fix autoregressive_action_dist.py test case.

* Re-test.

* Fix.

* Remove duplicate py_test rule from bazel.

* LINT.

* WIP.

* WIP.

* SAC fix.

* SAC fix.

* WIP.

* WIP.

* WIP.

* FIX 2 examples tests.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* Fix.

* LINT.

* Renamed test file.

* WIP.

* Add unittest.main.

* Make action_dist_class mandatory.

* fix

* FIX.

* WIP.

* WIP.

* Fix.

* Fix.

* Fix explorations test case (contextlib cannot find its own nullcontext??).

* Force torch to be installed for QMIX.

* LINT.

* Fix determine_tests_to_run.py.

* Fix determine_tests_to_run.py.

* WIP

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).

* Rename some stuff.

* Rename some stuff.

* WIP.

* update.

* WIP.

* Gumbel Softmax Dist.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP

* WIP.

* WIP.

* Hypertune.

* Hypertune.

* Hypertune.

* Lock-in.

* Cleanup.

* LINT.

* Fix.

* Update rllib/policy/eager_tf_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/agents/sac/sac_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/agents/sac/sac_policy.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/models/tf/tf_action_dist.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Update rllib/models/tf/tf_action_dist.py

Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>

* Fix items from review comments.

* Add dm_tree to RLlib dependencies.

* Add dm_tree to RLlib dependencies.

* Fix DQN test cases ((Torch)Categorical).

* Fix wrong pip install.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-03-06 10:37:12 -08:00
Qing Wang
7a33a6ea3c
[Java] Enable skipped direct call cases (#7363)
* Comment out

* Refine

* Revert
2020-03-06 16:22:08 +08:00
Stephanie Wang
7c174d0ffe
Make the ref counting test more stressful (#7473) 2020-03-05 20:51:24 -08:00