Commit graph

4269 commits

Author SHA1 Message Date
Simon Mo
ce0885a897
[Serve] UI Improvements (#7569) 2020-03-16 22:23:16 -07:00
mehrdadn
a0700e2f86
Change /tmp to platform-specific temporary directory (#7529) 2020-03-16 18:10:14 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans (#7611) 2020-03-16 11:19:58 -07:00
ijrsvt
46953c53b1
Cleanup Plasma Async Callback (#7452) 2020-03-16 10:12:44 -07:00
Simon Mo
45ce40e5d4
Disable Travis Disk Cache (#7612)
There are some file sizes and memory issue with bazel disk cache
we will disable the cache and use remote cache exclusively for now
2020-03-16 00:18:01 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support (#7080)
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Simon Mo
3f1fcaa024
Blocking ray.get/wait inside async context will warn instead of error (#7262) 2020-03-14 22:02:30 -07:00
fangfengbin
6b37be9677
[GCS]Add job id when operating gcs table (#7592) 2020-03-15 12:04:04 +08:00
Kai Yang
630e48967d
[Java] Allow passing internal config from raylet to Java worker (#7532) 2020-03-15 12:03:38 +08:00
mehrdadn
a87199d240
Fix cyclic dependency between ray/util and ray/common (#7581)
* Fix cyclic dependency

Headers in ray/util should not depend on those in ray/common

* Move random generations to ray/common/test_util.h

* Add license header

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-03-14 12:44:53 -07:00
tison
ffeab5d2bf
Support configurable python executable in format.sh (#7513) 2020-03-14 12:27:41 -07:00
Eric Liang
dd70720578
[rllib] Rename sample_batch_size => rollout_fragment_length (#7503)
* bulk rename

* deprecation warn

* update doc

* update fig

* line length

* rename

* make pytest comptaible

* fix test

* fi sys

* rename

* wip

* fix more

* lint

* update svg

* comments

* lint

* fix use of batch steps
2020-03-14 12:05:04 -07:00
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory (#7410)
* Add a test for LRU fallback

* Update error message

* Upgrade arrow to master

* Integrate with arrow

* Revert "Bazel mirrors (#7385)"

This reverts commit 44aded5272.

* Don't LRU evict

* Revert "Revert "Bazel mirrors (#7385)""

This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.

* Add lru_evict flag

* fix internal config

* Fix

* upgrade arrow

* debug

* Set free period in config for lru_evict, override max retries to fix
test

* Fix test?

* fix test

* Revert "debug"

This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.

* fix exception str

* Fix ref count test

* Shorten travis test?
2020-03-14 11:28:43 -07:00
Eric Liang
52cf77f5a9
[rllib] SAC no_done_at_end should default to False (#7594)
* update

* update doc

* stochastic

* cleanu
2020-03-14 11:16:54 -07:00
Eric Liang
c3a8ba399f
[rllib] Enable distributed exec api for A2C, A3C, PG by default (#7580) 2020-03-13 18:48:41 -07:00
Anthony Yu
094125cf03
[tune] Dragonfly integration ask tell nit (#7593)
* Add sample example

* Copy relevant lines of ask from inherited Optimizer

* Ignore strategy

* Additional changes

* Add DragonflySearch for tune connector for Dragonfly

* Add example and fix small errors

* lint

* Remove skopt references

* Update example based off of Dragonfly changes

* Edit example for final Dragonfly edits

* Formatting and documentation edits

* Add documentation and add to test pipeline

* Address PR comments

* Fix Jenkins test

* Adjust Dragonfly to PR#7366

* Lint

* fix_tests

* Minor changes to ordering

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-13 15:27:03 -07:00
Qing Wang
d6365c2586
[Java] Enable stress test. (#7596) 2020-03-13 21:02:13 +08:00
Kai Yang
d6e8f47065
Add a flag to disable reconstruction for a killed actor (#7346) 2020-03-13 19:10:21 +08:00
Qing Wang
575c89cf47
[Java] Pass large object by reference (#7595) 2020-03-13 18:38:03 +08:00
Sven Mika
552cfb37ea
[RLlib] Fix bugs and speed up SegmentTree 2020-03-13 01:03:07 -07:00
Ujval Misra
6022eb53c4
[tune] Use newest checkpoint in normal operation (#7563)
* Use persistent checkpoint for failures

* Fix test

* Add unpause test

* move test

* Fix tests

* remove debug statement

* Mark test as flaky
2020-03-12 22:21:42 -07:00
Qing Wang
f4656d8cc3
[Java] Enable direct call by default. (#7408)
* WIP

* Address comments.

* Linting

* Fix

* Fix

* Fix test

* Fix

* Fix single process ci

* Fix ut

* Update java/test/src/main/java/org/ray/api/test/PlasmaFreeTest.java

* Address comments

* Fix linting

* Minor update comments.

* Fix streaming CI
2020-03-13 12:25:30 +08:00
Tianyi Chen
6993a471f1
[Streaming] Move resource-manager and scheduler to master package. (#7582) 2020-03-13 12:24:37 +08:00
micafan
cc91ed57dc
[core] Fix losing task state when giving up forward task. (#7525)
* fix NodeManager::Forward task bug on error

* fix lint

* revert spillback task forward
2020-03-13 11:49:44 +08:00
Edward Oakes
768d0b3b3f
Allocate a buffer of 100 calls for each RPC handler (#7573) 2020-03-12 12:05:30 -07:00
Sven Mika
f165766813
[RLlib] Bug: If trainer config horizon is provided, should try to increase env steps to that value. (#7531) 2020-03-12 11:03:37 -07:00
Sven Mika
80d314ae5e
[RLlib] Add all agents to rllib rollout tests. (#7534) 2020-03-12 11:02:51 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166) 2020-03-12 22:13:56 +08:00
fangfengbin
428fb79b27
Fix streaming compile bug (#7577)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 17:26:45 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API (#7497) 2020-03-12 00:54:08 -07:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code (#7539)
* add gcs reply status

* rebase master

* use macro to simplify

* convert status in gcs rpc client

* define a Status message in probobuf

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Sven Mika
20ef4a8603
[RLlib] Cleanup/unify all test cases. (#7533) 2020-03-11 20:39:47 -07:00
Sven Mika
dded5b6d22
[RLlib] ES env_config is not a EnvContext object (e.g. does not contain worker_index). (#7560) 2020-03-11 20:33:20 -07:00
Sven Mika
bc120730e5
[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning (#7556) 2020-03-11 20:31:27 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options parameter (#7537)
* fix Java CI

* Minor fix

* move json.loads out of build_java_worker_command

* lint

* fix cross language test
2020-03-12 10:44:23 +08:00
Markus Cozowicz
ba1b081477
Azure Portal cluster deployment | Support spot instances (#7558)
* added priority option

* added head node priority

* upgrade api version
2020-03-11 18:46:11 -07:00
Simon Mo
31d63d3ca7
Fix global state actors() call (#7567) 2020-03-11 16:59:50 -07:00
Richard Liaw
b38ed4be71
[raysgd] Fix More Docs (#7565) 2020-03-11 14:17:47 -07:00
Richard Liaw
d046faeb9c
[sgd] Readme fix (#7564)
* readme fix

* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes (#7553)
* fix

* fix
2020-03-11 13:08:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py (#7554) 2020-03-11 09:53:21 -07:00
mehrdadn
3b9caa98ba
Fix fate-sharing warning (#7545)
* Fix kernel_fate_sharing being None instead of False

* Remove fate-sharing warning

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-11 08:27:54 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks (#7454)
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* revert

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… (#7261)
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* removed parameterized dependency
reverted ray[test] intro

* removed parameterized

* fix_tests

* format

Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! (#7546)
* format

* callable

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* data

* torchtrainer

* num_rep

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Edward Oakes
7b609ca211
Remove instances of 'raise Exception' (#7523) 2020-03-10 17:51:22 -07:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles (#7434)
* tmp

* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

* fix build

* Remove __ray_terminate__ and add test case for distributed ref counting

* lint

* Remove unused

* Fixes for detached actor, duplicate actor handles

* Remove unused

* Remove creation return ID

* Remove ObjectIDs from python, set references in CoreWorker

* Fix crash

* Fix memory crash

* Fix tests

* fix

* fixes

* fix tests

* fix java build

* fix build

* fix

* check status

* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server (#7544) 2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files (#7520) 2020-03-10 16:07:17 -07:00
Eric Liang
be48e1964b
[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety (#7504)
* fix sched

* lintc

* lint

* fix

* add unit test

* fix

* format

* fix test

* fix test
2020-03-10 11:14:14 -07:00