Sven Mika
80d314ae5e
[RLlib] Add all agents to rllib rollout
tests. ( #7534 )
2020-03-12 11:02:51 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true ( #7166 )
2020-03-12 22:13:56 +08:00
fangfengbin
428fb79b27
Fix streaming compile bug ( #7577 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 17:26:45 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API ( #7497 )
2020-03-12 00:54:08 -07:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code ( #7539 )
...
* add gcs reply status
* rebase master
* use macro to simplify
* convert status in gcs rpc client
* define a Status message in probobuf
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Sven Mika
20ef4a8603
[RLlib] Cleanup/unify all test cases. ( #7533 )
2020-03-11 20:39:47 -07:00
Sven Mika
dded5b6d22
[RLlib] ES env_config
is not a EnvContext object (e.g. does not contain worker_index
). ( #7560 )
2020-03-11 20:33:20 -07:00
Sven Mika
bc120730e5
[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning ( #7556 )
2020-03-11 20:31:27 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options
parameter ( #7537 )
...
* fix Java CI
* Minor fix
* move json.loads out of build_java_worker_command
* lint
* fix cross language test
2020-03-12 10:44:23 +08:00
Markus Cozowicz
ba1b081477
Azure Portal cluster deployment | Support spot instances ( #7558 )
...
* added priority option
* added head node priority
* upgrade api version
2020-03-11 18:46:11 -07:00
Simon Mo
31d63d3ca7
Fix global state actors() call ( #7567 )
2020-03-11 16:59:50 -07:00
Richard Liaw
b38ed4be71
[raysgd] Fix More Docs ( #7565 )
2020-03-11 14:17:47 -07:00
Richard Liaw
d046faeb9c
[sgd] Readme fix ( #7564 )
...
* readme fix
* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ( #7553 )
...
* fix
* fix
2020-03-11 13:08:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py ( #7554 )
2020-03-11 09:53:21 -07:00
mehrdadn
3b9caa98ba
Fix fate-sharing warning ( #7545 )
...
* Fix kernel_fate_sharing being None instead of False
* Remove fate-sharing warning
Co-authored-by: Mehrdad <noreply@github.com>
2020-03-11 08:27:54 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks ( #7454 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* revert
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… ( #7261 )
...
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency
* Update python/ray/autoscaler/autoscaler.py
* read
* restrict allowed properties
* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)
* updated autoscaler test to use ValidationError exception
* add missing dependency
* added pytest
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency
* Update python/ray/autoscaler/autoscaler.py
* read
* restrict allowed properties
* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)
* updated autoscaler test to use ValidationError exception
* add missing dependency
* added pytest
* removed parameterized dependency
reverted ray[test] intro
* removed parameterized
* fix_tests
* format
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! ( #7546 )
...
* format
* callable
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* data
* torchtrainer
* num_rep
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Edward Oakes
7b609ca211
Remove instances of 'raise Exception' ( #7523 )
2020-03-10 17:51:22 -07:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles ( #7434 )
...
* tmp
* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0
* fix build
* Remove __ray_terminate__ and add test case for distributed ref counting
* lint
* Remove unused
* Fixes for detached actor, duplicate actor handles
* Remove unused
* Remove creation return ID
* Remove ObjectIDs from python, set references in CoreWorker
* Fix crash
* Fix memory crash
* Fix tests
* fix
* fixes
* fix tests
* fix java build
* fix build
* fix
* check status
* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server ( #7544 )
2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files ( #7520 )
2020-03-10 16:07:17 -07:00
Eric Liang
be48e1964b
[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety ( #7504 )
...
* fix sched
* lintc
* lint
* fix
* add unit test
* fix
* format
* fix test
* fix test
2020-03-10 11:14:14 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API ( #7384 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* comments
* fix
* fix
* runner_tests
* codes
* example
* fix_test
* fix
* tests
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Anthony Yu
89ec4adb72
[tune] Dragonfly Optimizer ( #5955 )
...
* Add sample example
* Copy relevant lines of ask from inherited Optimizer
* Ignore strategy
* Additional changes
* Add DragonflySearch for tune connector for Dragonfly
* Add example and fix small errors
* lint
* Remove skopt references
* Update example based off of Dragonfly changes
* Edit example for final Dragonfly edits
* Formatting and documentation edits
* Add documentation and add to test pipeline
* Address PR comments
* Fix Jenkins test
* Adjust Dragonfly to PR#7366
* Lint
* fix_tests
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 08:40:36 -07:00
fangfengbin
fa785a2ad2
ServiceBasedGcsClient support detect gcs server availability and retry ( #7292 )
2020-03-10 21:01:07 +08:00
mehrdadn
fc76586518
Redis on Windows ( #7509 )
...
* Switch hiredis on Windows to that of the Windows port of Redis
* Use boost::asio::ip::tcp::socket::native_handle_type
* Use normal hiredis instead of Windows-specific one
* Finish up using normal hiredis
Co-authored-by: Mehrdad <noreply@github.com>
2020-03-09 18:49:54 -07:00
Eric Liang
90e23a5c43
[iterators] Add duplicate() call and fix broken test case ( #7510 )
2020-03-09 17:18:52 -07:00
Edward Oakes
883ee4912d
Return reconcile.Result{}, not nil ( #7521 )
2020-03-09 16:27:15 -07:00
Edward Oakes
4ab80eafb9
Deprecate use_pickle flag ( #7474 )
2020-03-09 16:03:56 -07:00
Edward Oakes
0c254295b0
Remove experimental.signal API ( #7477 )
...
* Remove experimental.signal API
* fix test
2020-03-09 16:03:36 -07:00
Ujval Misra
023d4c02a9
[tune] Prevent deletion of checkpoint from user-initiated resto… ( #7501 )
...
* Fix restore bug
* Add test
* Lint
* Indent
2020-03-09 15:53:10 -07:00
Edward Oakes
08d4cb3822
[operator] Minor cleanup ( #7498 )
2020-03-09 11:23:46 -07:00
Edward Oakes
b4e2d5317e
Remove experimental.NoReturn ( #7475 )
2020-03-09 11:09:36 -07:00
Edward Oakes
27b4ffa98e
Improve k8s operator documentation ( #7496 )
2020-03-09 11:09:06 -07:00
Stephanie Wang
95bb0c5357
Upgrade plasma to latest version, use synchronous Seal ( #7470 )
...
* Upgrade arrow to master
* fix build
* todo
* lint
* Fix hanging test
2020-03-09 10:30:44 -07:00
Markus Cozowicz
e03259455f
[autoscaler] azure init script path ( #7515 )
2020-03-09 09:49:07 -07:00
Markus Cozowicz
145ebe14c7
added Azure Resource Manager (ARM) template ( #7494 )
...
* added Azure Resource Manager (ARM) template
* removed Azure doc (moved to separate PR)
* nit
* fixpaths
* nit
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-08 22:29:10 -07:00
Eric Liang
e7bc5c612d
Add testing strategy to PR template ( #7505 )
2020-03-08 15:16:49 -07:00
Sven Mika
f08687f550
[RLlib] rllib train
crashes when using torch PPO/PG/A2C. ( #7508 )
...
* Fix.
* Rollback.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
* TEST.
2020-03-08 13:03:18 -07:00
Sven Mika
bc637a2546
[Tune Jenkins tests] Add dm_tree to docker. ( #7500 )
...
* Fix.
* Rollback.
* Add dm_tree to docker examples and tune_test containers.
2020-03-07 23:16:00 -08:00
Eric Liang
a644060daa
[rllib] First pass at pipeline implementation of DQN ( #7433 )
...
* wip iters
* add test
* speed up
* update docs
* document it
* support serial sampling
* add test
* spacing
* annotate it
* update
* rename to pipeline
* comment
* iter2 wip
* update
* update
* context test
* update
* fix
* fix
* a3c pipeline
* doc
* update
* move timer
* comment
* add piepline test
* fix
* clean up
* document
* iter s
* wip dqn
* wip
* wip
* metrics
* metrics rename
* metrics ctx
* wip
* constants
* add todo
* suppport .union
* wip
* support union
* remove prints
* add todo
* remove auto timer
* fix up
* fix pipeline test
* typing
* fix breakage
* remove bad assert
* wip
* fix multiagent example
* fixapply
* update a3c
* remove a2c pl
* 0 workers
* wip
* wip
* share metrics
* wip
* wip
* doc
* fix weight sync and global var updates
* mode
* fix
* fix
* doc
* fix
2020-03-07 14:47:58 -08:00
Landcold7
beb9b02dbd
Add numba test ( #7298 ) ( #7487 )
2020-03-07 11:12:25 -08:00
Richard Liaw
115468de2c
[tune] Repeated evals ( #7366 )
...
* easyrepeat
* done
* suggest
* doc
* ok
* commit
* Apply suggestions from code review
Co-Authored-By: Ujval Misra <misraujval@gmail.com>
* Apply suggestions from code review
Co-Authored-By: Ujval Misra <misraujval@gmail.com>
* Apply suggestions from code review
* ok
* docs
Co-authored-by: Ujval Misra <misraujval@gmail.com>
2020-03-07 11:08:23 -08:00
mehrdadn
a8bda9b551
Fix incorrect handling of command-lines ( #7439 )
2020-03-06 15:51:49 -08:00
Sven Mika
876a1ba5bd
[RLlib] Issue 7421: can't convert cuda tensor to numpy in torch ppo. ( #7445 )
2020-03-06 12:45:30 -08:00
Sven Mika
510c850651
[RLlib] SAC add discrete action support. ( #7320 )
...
* Exploration API (+EpsilonGreedy sub-class).
* Exploration API (+EpsilonGreedy sub-class).
* Cleanup/LINT.
* Add `deterministic` to generic Trainer config (NOTE: this is still ignored by most Agents).
* Add `error` option to deprecation_warning().
* WIP.
* Bug fix: Get exploration-info for tf framework.
Bug fix: Properly deprecate some DQN config keys.
* WIP.
* LINT.
* WIP.
* Split PerWorkerEpsilonGreedy out of EpsilonGreedy.
Docstrings.
* Fix bug in sampler.py in case Policy has self.exploration = None
* Update rllib/agents/dqn/dqn.py
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
* WIP.
* Update rllib/agents/trainer.py
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
* WIP.
* Change requests.
* LINT
* In tune/utils/util.py::deep_update() Only keep deep_updat'ing if both original and value are dicts. If value is not a dict, set
* Completely obsolete syn_replay_optimizer.py's parameters schedule_max_timesteps AND beta_annealing_fraction (replaced with prioritized_replay_beta_annealing_timesteps).
* Update rllib/evaluation/worker_set.py
Co-Authored-By: Eric Liang <ekhliang@gmail.com>
* Review fixes.
* Fix default value for DQN's exploration spec.
* LINT
* Fix recursion bug (wrong parent c'tor).
* Do not pass timestep to get_exploration_info.
* Update tf_policy.py
* Fix some remaining issues with test cases and remove more deprecated DQN/APEX exploration configs.
* Bug fix tf-action-dist
* DDPG incompatibility bug fix with new DQN exploration handling (which is imported by DDPG).
* Switch off exploration when getting action probs from off-policy-estimator's policy.
* LINT
* Fix test_checkpoint_restore.py.
* Deprecate all SAC exploration (unused) configs.
* Properly use `model.last_output()` everywhere. Instead of `model._last_output`.
* WIP.
* Take out set_epsilon from multi-agent-env test (not needed, decays anyway).
* WIP.
* Trigger re-test (flaky checkpoint-restore test).
* WIP.
* WIP.
* Add test case for deterministic action sampling in PPO.
* bug fix.
* Added deterministic test cases for different Agents.
* Fix problem with TupleActions in dynamic-tf-policy.
* Separate supported_spaces tests so they can be run separately for easier debugging.
* LINT.
* Fix autoregressive_action_dist.py test case.
* Re-test.
* Fix.
* Remove duplicate py_test rule from bazel.
* LINT.
* WIP.
* WIP.
* SAC fix.
* SAC fix.
* WIP.
* WIP.
* WIP.
* FIX 2 examples tests.
* WIP.
* WIP.
* WIP.
* WIP.
* WIP.
* Fix.
* LINT.
* Renamed test file.
* WIP.
* Add unittest.main.
* Make action_dist_class mandatory.
* fix
* FIX.
* WIP.
* WIP.
* Fix.
* Fix.
* Fix explorations test case (contextlib cannot find its own nullcontext??).
* Force torch to be installed for QMIX.
* LINT.
* Fix determine_tests_to_run.py.
* Fix determine_tests_to_run.py.
* WIP
* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).
* Add Random exploration component to tests (fixed issue with "static-graph randomness" via py_function).
* Rename some stuff.
* Rename some stuff.
* WIP.
* update.
* WIP.
* Gumbel Softmax Dist.
* WIP.
* WIP.
* WIP.
* WIP.
* WIP.
* WIP.
* WIP
* WIP.
* WIP.
* Hypertune.
* Hypertune.
* Hypertune.
* Lock-in.
* Cleanup.
* LINT.
* Fix.
* Update rllib/policy/eager_tf_policy.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
* Update rllib/agents/sac/sac_policy.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
* Update rllib/agents/sac/sac_policy.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
* Update rllib/models/tf/tf_action_dist.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
* Update rllib/models/tf/tf_action_dist.py
Co-Authored-By: Kristian Hartikainen <kristian.hartikainen@gmail.com>
* Fix items from review comments.
* Add dm_tree to RLlib dependencies.
* Add dm_tree to RLlib dependencies.
* Fix DQN test cases ((Torch)Categorical).
* Fix wrong pip install.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Kristian Hartikainen <kristian.hartikainen@gmail.com>
2020-03-06 10:37:12 -08:00
Qing Wang
7a33a6ea3c
[Java] Enable skipped direct call cases ( #7363 )
...
* Comment out
* Refine
* Revert
2020-03-06 16:22:08 +08:00
Stephanie Wang
7c174d0ffe
Make the ref counting test more stressful ( #7473 )
2020-03-05 20:51:24 -08:00