Simon Mo
ce0885a897
[Serve] UI Improvements ( #7569 )
2020-03-16 22:23:16 -07:00
mehrdadn
a0700e2f86
Change /tmp to platform-specific temporary directory ( #7529 )
2020-03-16 18:10:14 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans ( #7611 )
2020-03-16 11:19:58 -07:00
ijrsvt
46953c53b1
Cleanup Plasma Async Callback ( #7452 )
2020-03-16 10:12:44 -07:00
Simon Mo
45ce40e5d4
Disable Travis Disk Cache ( #7612 )
...
There are some file sizes and memory issue with bazel disk cache
we will disable the cache and use remote cache exclusively for now
2020-03-16 00:18:01 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support ( #7080 )
...
* adding directory and node_provider entry for azure autoscaler
* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating
* adding todos and switching to auth file for service principal authentication
* adding role / scope to service principal
* resolving issues with app credentials
* adding retry for setting service principal role
* typo and adding retry to nic creation
* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing
* linting
* updating cleanup and fixing bugs
* adding directory and node_provider entry for azure autoscaler
* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating
* adding todos and switching to auth file for service principal authentication
* adding role / scope to service principal
* resolving issues with app credentials
* adding retry for setting service principal role
* typo and adding retry to nic creation
* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing
* linting
* updating cleanup and fixing bugs
* minor fixes
* first working version :)
* added tag support
* added msi identity intermediate
* enable MSI through user managed identity
* updated schema
* extend yaml schema
remove service principal code
add re-use of managed user identity
* fix rg_id
* fix logging
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
* run linting
* updating yaml configs and formatting
* updating yaml configs and formatting
* typo in example config
* pulling default config from example-full
* resetting min, init worker prop
* adding docs for azure autoscaler and fixing status
* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment
* fix for default subscription in azure node provider
* vm dev image build
* minor change
* keeping example-full.yaml in autoscaler/azure, updating azure example config
* linting azure config
* extending retries on azure config
* lint
* support for internal ips, fix to azure docs, and new azure gpu example config
* linting
* Update python/ray/autoscaler/azure/node_provider.py
Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>
* revert_this
* remove_schema
* updating configs and removing ssh keygen, tweak azure node provider terminate
* minor tweaks
Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Simon Mo
3f1fcaa024
Blocking ray.get/wait inside async context will warn instead of error ( #7262 )
2020-03-14 22:02:30 -07:00
fangfengbin
6b37be9677
[GCS]Add job id when operating gcs table ( #7592 )
2020-03-15 12:04:04 +08:00
Kai Yang
630e48967d
[Java] Allow passing internal config from raylet to Java worker ( #7532 )
2020-03-15 12:03:38 +08:00
mehrdadn
a87199d240
Fix cyclic dependency between ray/util and ray/common ( #7581 )
...
* Fix cyclic dependency
Headers in ray/util should not depend on those in ray/common
* Move random generations to ray/common/test_util.h
* Add license header
Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-03-14 12:44:53 -07:00
tison
ffeab5d2bf
Support configurable python executable in format.sh ( #7513 )
2020-03-14 12:27:41 -07:00
Eric Liang
dd70720578
[rllib] Rename sample_batch_size => rollout_fragment_length ( #7503 )
...
* bulk rename
* deprecation warn
* update doc
* update fig
* line length
* rename
* make pytest comptaible
* fix test
* fi sys
* rename
* wip
* fix more
* lint
* update svg
* comments
* lint
* fix use of batch steps
2020-03-14 12:05:04 -07:00
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory ( #7410 )
...
* Add a test for LRU fallback
* Update error message
* Upgrade arrow to master
* Integrate with arrow
* Revert "Bazel mirrors (#7385 )"
This reverts commit 44aded5272
.
* Don't LRU evict
* Revert "Revert "Bazel mirrors (#7385 )""
This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.
* Add lru_evict flag
* fix internal config
* Fix
* upgrade arrow
* debug
* Set free period in config for lru_evict, override max retries to fix
test
* Fix test?
* fix test
* Revert "debug"
This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.
* fix exception str
* Fix ref count test
* Shorten travis test?
2020-03-14 11:28:43 -07:00
Eric Liang
52cf77f5a9
[rllib] SAC no_done_at_end should default to False ( #7594 )
...
* update
* update doc
* stochastic
* cleanu
2020-03-14 11:16:54 -07:00
Eric Liang
c3a8ba399f
[rllib] Enable distributed exec api for A2C, A3C, PG by default ( #7580 )
2020-03-13 18:48:41 -07:00
Anthony Yu
094125cf03
[tune] Dragonfly integration ask tell nit ( #7593 )
...
* Add sample example
* Copy relevant lines of ask from inherited Optimizer
* Ignore strategy
* Additional changes
* Add DragonflySearch for tune connector for Dragonfly
* Add example and fix small errors
* lint
* Remove skopt references
* Update example based off of Dragonfly changes
* Edit example for final Dragonfly edits
* Formatting and documentation edits
* Add documentation and add to test pipeline
* Address PR comments
* Fix Jenkins test
* Adjust Dragonfly to PR#7366
* Lint
* fix_tests
* Minor changes to ordering
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-13 15:27:03 -07:00
Qing Wang
d6365c2586
[Java] Enable stress test. ( #7596 )
2020-03-13 21:02:13 +08:00
Kai Yang
d6e8f47065
Add a flag to disable reconstruction for a killed actor ( #7346 )
2020-03-13 19:10:21 +08:00
Qing Wang
575c89cf47
[Java] Pass large object by reference ( #7595 )
2020-03-13 18:38:03 +08:00
Sven Mika
552cfb37ea
[RLlib] Fix bugs and speed up SegmentTree
2020-03-13 01:03:07 -07:00
Ujval Misra
6022eb53c4
[tune] Use newest checkpoint in normal operation ( #7563 )
...
* Use persistent checkpoint for failures
* Fix test
* Add unpause test
* move test
* Fix tests
* remove debug statement
* Mark test as flaky
2020-03-12 22:21:42 -07:00
Qing Wang
f4656d8cc3
[Java] Enable direct call by default. ( #7408 )
...
* WIP
* Address comments.
* Linting
* Fix
* Fix
* Fix test
* Fix
* Fix single process ci
* Fix ut
* Update java/test/src/main/java/org/ray/api/test/PlasmaFreeTest.java
* Address comments
* Fix linting
* Minor update comments.
* Fix streaming CI
2020-03-13 12:25:30 +08:00
Tianyi Chen
6993a471f1
[Streaming] Move resource-manager and scheduler to master package. ( #7582 )
2020-03-13 12:24:37 +08:00
micafan
cc91ed57dc
[core] Fix losing task state when giving up forward task. ( #7525 )
...
* fix NodeManager::Forward task bug on error
* fix lint
* revert spillback task forward
2020-03-13 11:49:44 +08:00
Edward Oakes
768d0b3b3f
Allocate a buffer of 100 calls for each RPC handler ( #7573 )
2020-03-12 12:05:30 -07:00
Sven Mika
f165766813
[RLlib] Bug: If trainer config horizon
is provided, should try to increase env steps to that value. ( #7531 )
2020-03-12 11:03:37 -07:00
Sven Mika
80d314ae5e
[RLlib] Add all agents to rllib rollout
tests. ( #7534 )
2020-03-12 11:02:51 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true ( #7166 )
2020-03-12 22:13:56 +08:00
fangfengbin
428fb79b27
Fix streaming compile bug ( #7577 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 17:26:45 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API ( #7497 )
2020-03-12 00:54:08 -07:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code ( #7539 )
...
* add gcs reply status
* rebase master
* use macro to simplify
* convert status in gcs rpc client
* define a Status message in probobuf
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Sven Mika
20ef4a8603
[RLlib] Cleanup/unify all test cases. ( #7533 )
2020-03-11 20:39:47 -07:00
Sven Mika
dded5b6d22
[RLlib] ES env_config
is not a EnvContext object (e.g. does not contain worker_index
). ( #7560 )
2020-03-11 20:33:20 -07:00
Sven Mika
bc120730e5
[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning ( #7556 )
2020-03-11 20:31:27 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options
parameter ( #7537 )
...
* fix Java CI
* Minor fix
* move json.loads out of build_java_worker_command
* lint
* fix cross language test
2020-03-12 10:44:23 +08:00
Markus Cozowicz
ba1b081477
Azure Portal cluster deployment | Support spot instances ( #7558 )
...
* added priority option
* added head node priority
* upgrade api version
2020-03-11 18:46:11 -07:00
Simon Mo
31d63d3ca7
Fix global state actors() call ( #7567 )
2020-03-11 16:59:50 -07:00
Richard Liaw
b38ed4be71
[raysgd] Fix More Docs ( #7565 )
2020-03-11 14:17:47 -07:00
Richard Liaw
d046faeb9c
[sgd] Readme fix ( #7564 )
...
* readme fix
* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ( #7553 )
...
* fix
* fix
2020-03-11 13:08:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py ( #7554 )
2020-03-11 09:53:21 -07:00
mehrdadn
3b9caa98ba
Fix fate-sharing warning ( #7545 )
...
* Fix kernel_fate_sharing being None instead of False
* Remove fate-sharing warning
Co-authored-by: Mehrdad <noreply@github.com>
2020-03-11 08:27:54 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks ( #7454 )
...
* Init fp16
* fp16 and schedulers
* scheduler linking and fp16
* to fp16
* loss scaling and documentation
* more documentation
* add tests, refactor config
* moredocs
* more docs
* fix logo, add test mode, add fp16 flag
* fix tests
* fix scheduler
* fix apex
* improve safety
* fix tests
* fix tests
* remove pin memory default
* rm
* fix
* Update doc/examples/doc_code/raysgd_torch_signatures.py
* fix
* migrate changes from other PR
* ok thanks
* pass
* signatures
* lint'
* Update python/ray/experimental/sgd/pytorch/utils.py
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* should address most comments
* comments
* fix this ci
* first_pass
* add overrides
* override
* fixing up operators
* format
* sgd
* constants
* rm
* revert
* save
* failures
* fixes
* trainer
* run test
* operator
* code
* op
* ok done
* operator
* sgd test fixes
* ok
* trainer
* format
* Apply suggestions from code review
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update doc/source/raysgd/raysgd_pytorch.rst
* docstring
* dcgan
* doc
* commits
* nit
* testing
* revert
* Start renaming pytorch to torch
* Rename PyTorchTrainer to TorchTrainer
* Rename PyTorch runners to Torch runners
* Finish renaming API
* Rename to torch in tests
* Finish renaming docs + tests
* Run format + fix DeprecationWarning
* fix
* move tests up
* benchmarks
* rename
* remove some args
* better metrics output
* fix up the benchmark
* benchmark-yaml
* horovod-benchmark
* benchmarks
* Remove benchmark code for cleanups
* benchmark-code
* nits
* benchmark yamls
* benchmark yaml
* ok
* ok
* ok
* benchmark
* nit
* finish_bench
* makedatacreator
* relax
* metrics
* autosetsampler
* profile
* movements
* OK
* smoothen
* fix
* nitdocs
* loss
* envflag
* comments
* nit
* format
* visible
* images
* move_images
* fix
* rernder
* rrender
* rest
* multgpu
* fix
* nit
* finish
* extrra
* setup
* revert
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… ( #7261 )
...
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency
* Update python/ray/autoscaler/autoscaler.py
* read
* restrict allowed properties
* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)
* updated autoscaler test to use ValidationError exception
* add missing dependency
* added pytest
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency
* Update python/ray/autoscaler/autoscaler.py
* read
* restrict allowed properties
* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)
* updated autoscaler test to use ValidationError exception
* add missing dependency
* added pytest
* removed parameterized dependency
reverted ray[test] intro
* removed parameterized
* fix_tests
* format
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! ( #7546 )
...
* format
* callable
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* Update python/ray/util/sgd/torch/torch_trainer.py
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
* data
* torchtrainer
* num_rep
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Edward Oakes
7b609ca211
Remove instances of 'raise Exception' ( #7523 )
2020-03-10 17:51:22 -07:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles ( #7434 )
...
* tmp
* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0
* fix build
* Remove __ray_terminate__ and add test case for distributed ref counting
* lint
* Remove unused
* Fixes for detached actor, duplicate actor handles
* Remove unused
* Remove creation return ID
* Remove ObjectIDs from python, set references in CoreWorker
* Fix crash
* Fix memory crash
* Fix tests
* fix
* fixes
* fix tests
* fix java build
* fix build
* fix
* check status
* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server ( #7544 )
2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files ( #7520 )
2020-03-10 16:07:17 -07:00
Eric Liang
be48e1964b
[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety ( #7504 )
...
* fix sched
* lintc
* lint
* fix
* add unit test
* fix
* format
* fix test
* fix test
2020-03-10 11:14:14 -07:00