1
0
Fork 0
mirror of https://github.com/vale981/ray synced 2025-03-19 09:36:40 -04:00
Commit graph

5517 commits

Author SHA1 Message Date
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans () 2020-03-16 11:19:58 -07:00
ijrsvt
46953c53b1
Cleanup Plasma Async Callback () 2020-03-16 10:12:44 -07:00
Simon Mo
45ce40e5d4
Disable Travis Disk Cache ()
There are some file sizes and memory issue with bazel disk cache
we will disable the cache and use remote cache exclusively for now
2020-03-16 00:18:01 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support ()
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Simon Mo
3f1fcaa024
Blocking ray.get/wait inside async context will warn instead of error () 2020-03-14 22:02:30 -07:00
fangfengbin
6b37be9677
[GCS]Add job id when operating gcs table () 2020-03-15 12:04:04 +08:00
Kai Yang
630e48967d
[Java] Allow passing internal config from raylet to Java worker () 2020-03-15 12:03:38 +08:00
mehrdadn
a87199d240
Fix cyclic dependency between ray/util and ray/common ()
* Fix cyclic dependency

Headers in ray/util should not depend on those in ray/common

* Move random generations to ray/common/test_util.h

* Add license header

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-03-14 12:44:53 -07:00
tison
ffeab5d2bf
Support configurable python executable in format.sh () 2020-03-14 12:27:41 -07:00
Eric Liang
dd70720578
[rllib] Rename sample_batch_size => rollout_fragment_length ()
* bulk rename

* deprecation warn

* update doc

* update fig

* line length

* rename

* make pytest comptaible

* fix test

* fi sys

* rename

* wip

* fix more

* lint

* update svg

* comments

* lint

* fix use of batch steps
2020-03-14 12:05:04 -07:00
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory ()
* Add a test for LRU fallback

* Update error message

* Upgrade arrow to master

* Integrate with arrow

* Revert "Bazel mirrors ()"

This reverts commit 44aded5272.

* Don't LRU evict

* Revert "Revert "Bazel mirrors ()""

This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.

* Add lru_evict flag

* fix internal config

* Fix

* upgrade arrow

* debug

* Set free period in config for lru_evict, override max retries to fix
test

* Fix test?

* fix test

* Revert "debug"

This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.

* fix exception str

* Fix ref count test

* Shorten travis test?
2020-03-14 11:28:43 -07:00
Eric Liang
52cf77f5a9
[rllib] SAC no_done_at_end should default to False ()
* update

* update doc

* stochastic

* cleanu
2020-03-14 11:16:54 -07:00
Eric Liang
c3a8ba399f
[rllib] Enable distributed exec api for A2C, A3C, PG by default () 2020-03-13 18:48:41 -07:00
Anthony Yu
094125cf03
[tune] Dragonfly integration ask tell nit ()
* Add sample example

* Copy relevant lines of ask from inherited Optimizer

* Ignore strategy

* Additional changes

* Add DragonflySearch for tune connector for Dragonfly

* Add example and fix small errors

* lint

* Remove skopt references

* Update example based off of Dragonfly changes

* Edit example for final Dragonfly edits

* Formatting and documentation edits

* Add documentation and add to test pipeline

* Address PR comments

* Fix Jenkins test

* Adjust Dragonfly to PR#7366

* Lint

* fix_tests

* Minor changes to ordering

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-13 15:27:03 -07:00
Qing Wang
d6365c2586
[Java] Enable stress test. () 2020-03-13 21:02:13 +08:00
Kai Yang
d6e8f47065
Add a flag to disable reconstruction for a killed actor () 2020-03-13 19:10:21 +08:00
Qing Wang
575c89cf47
[Java] Pass large object by reference () 2020-03-13 18:38:03 +08:00
Sven Mika
552cfb37ea
[RLlib] Fix bugs and speed up SegmentTree 2020-03-13 01:03:07 -07:00
Ujval Misra
6022eb53c4
[tune] Use newest checkpoint in normal operation ()
* Use persistent checkpoint for failures

* Fix test

* Add unpause test

* move test

* Fix tests

* remove debug statement

* Mark test as flaky
2020-03-12 22:21:42 -07:00
Qing Wang
f4656d8cc3
[Java] Enable direct call by default. ()
* WIP

* Address comments.

* Linting

* Fix

* Fix

* Fix test

* Fix

* Fix single process ci

* Fix ut

* Update java/test/src/main/java/org/ray/api/test/PlasmaFreeTest.java

* Address comments

* Fix linting

* Minor update comments.

* Fix streaming CI
2020-03-13 12:25:30 +08:00
Tianyi Chen
6993a471f1
[Streaming] Move resource-manager and scheduler to master package. () 2020-03-13 12:24:37 +08:00
micafan
cc91ed57dc
[core] Fix losing task state when giving up forward task. ()
* fix NodeManager::Forward task bug on error

* fix lint

* revert spillback task forward
2020-03-13 11:49:44 +08:00
Edward Oakes
768d0b3b3f
Allocate a buffer of 100 calls for each RPC handler () 2020-03-12 12:05:30 -07:00
Sven Mika
f165766813
[RLlib] Bug: If trainer config horizon is provided, should try to increase env steps to that value. () 2020-03-12 11:03:37 -07:00
Sven Mika
80d314ae5e
[RLlib] Add all agents to rllib rollout tests. () 2020-03-12 11:02:51 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true () 2020-03-12 22:13:56 +08:00
fangfengbin
428fb79b27
Fix streaming compile bug ()
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 17:26:45 +08:00
Eric Liang
f5d12a958b
[rllib] Port Ape-X to distributed execution API () 2020-03-12 00:54:08 -07:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code ()
* add gcs reply status

* rebase master

* use macro to simplify

* convert status in gcs rpc client

* define a Status message in probobuf

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Sven Mika
20ef4a8603
[RLlib] Cleanup/unify all test cases. () 2020-03-11 20:39:47 -07:00
Sven Mika
dded5b6d22
[RLlib] ES env_config is not a EnvContext object (e.g. does not contain worker_index). () 2020-03-11 20:33:20 -07:00
Sven Mika
bc120730e5
[RLlib] PPO(torch) on CartPole not tuned well enough for consistent learning () 2020-03-11 20:31:27 -07:00
Kai Yang
932a749fa9
Fix the java_worker_options parameter ()
* fix Java CI

* Minor fix

* move json.loads out of build_java_worker_command

* lint

* fix cross language test
2020-03-12 10:44:23 +08:00
Markus Cozowicz
ba1b081477
Azure Portal cluster deployment | Support spot instances ()
* added priority option

* added head node priority

* upgrade api version
2020-03-11 18:46:11 -07:00
Simon Mo
31d63d3ca7
Fix global state actors() call () 2020-03-11 16:59:50 -07:00
Richard Liaw
b38ed4be71
[raysgd] Fix More Docs () 2020-03-11 14:17:47 -07:00
Richard Liaw
d046faeb9c
[sgd] Readme fix ()
* readme fix

* replicas
2020-03-11 13:40:18 -07:00
Richard Liaw
b70f31339c
[sgd] Benchmark Fixes ()
* fix

* fix
2020-03-11 13:08:27 -07:00
Markus Cozowicz
ea99063c10
added json schema to setup.py () 2020-03-11 09:53:21 -07:00
mehrdadn
3b9caa98ba
Fix fate-sharing warning ()
* Fix kernel_fate_sharing being None instead of False

* Remove fate-sharing warning

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-11 08:27:54 -07:00
Richard Liaw
fbac256982
[sgd] Add benchmarks ()
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* benchmark-code

* nits

* benchmark yamls

* benchmark yaml

* ok

* ok

* ok

* benchmark

* nit

* finish_bench

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* envflag

* comments

* nit

* format

* visible

* images

* move_images

* fix

* rernder

* rrender

* rest

* multgpu

* fix

* nit

* finish

* extrra

* setup

* revert

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-11 01:09:08 -07:00
Markus Cozowicz
49439611f1
[autoscaler] Replace cluster yaml validation with json schema v… ()
* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)
- run linting
- moved schema to ray/autoscaler
- fixed typo
- remove importlib dependency

* Update python/ray/autoscaler/autoscaler.py

* read

* restrict allowed properties

* added unit test for invalid yaml
added ray[test] package (remove pytest from default dependencies)

* updated autoscaler test to use ValidationError exception

* add missing dependency

* added pytest

* removed parameterized dependency
reverted ray[test] intro

* removed parameterized

* fix_tests

* format

Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 18:58:55 -07:00
Richard Liaw
6163b21458
[raysgd] Better user errors! ()
* format

* callable

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update python/ray/util/sgd/torch/torch_trainer.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* data

* torchtrainer

* num_rep

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-03-10 18:58:19 -07:00
Edward Oakes
7b609ca211
Remove instances of 'raise Exception' () 2020-03-10 17:51:22 -07:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles ()
* tmp

* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

* fix build

* Remove __ray_terminate__ and add test case for distributed ref counting

* lint

* Remove unused

* Fixes for detached actor, duplicate actor handles

* Remove unused

* Remove creation return ID

* Remove ObjectIDs from python, set references in CoreWorker

* Fix crash

* Fix memory crash

* Fix tests

* fix

* fixes

* fix tests

* fix java build

* fix build

* fix

* check status

* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server () 2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files () 2020-03-10 16:07:17 -07:00
Eric Liang
be48e1964b
[rllib] Fix per-worker exploration in Ape-X; make more kwargs required for future safety ()
* fix sched

* lintc

* lint

* fix

* add unit test

* fix

* format

* fix test

* fix test
2020-03-10 11:14:14 -07:00
Richard Liaw
d192ef0611
[raysgd] Cleanup User API ()
* Init fp16

* fp16 and schedulers

* scheduler linking and fp16

* to fp16

* loss scaling and documentation

* more documentation

* add tests, refactor config

* moredocs

* more docs

* fix logo, add test mode, add fp16 flag

* fix tests

* fix scheduler

* fix apex

* improve safety

* fix tests

* fix tests

* remove pin memory default

* rm

* fix

* Update doc/examples/doc_code/raysgd_torch_signatures.py

* fix

* migrate changes from other PR

* ok thanks

* pass

* signatures

* lint'

* Update python/ray/experimental/sgd/pytorch/utils.py

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* should address most comments

* comments

* fix this ci

* first_pass

* add overrides

* override

* fixing up operators

* format

* sgd

* constants

* rm

* revert

* save

* failures

* fixes

* trainer

* run test

* operator

* code

* op

* ok done

* operator

* sgd test fixes

* ok

* trainer

* format

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update doc/source/raysgd/raysgd_pytorch.rst

* docstring

* dcgan

* doc

* commits

* nit

* testing

* revert

* Start renaming pytorch to torch

* Rename PyTorchTrainer to TorchTrainer

* Rename PyTorch runners to Torch runners

* Finish renaming API

* Rename to torch in tests

* Finish renaming docs + tests

* Run format + fix DeprecationWarning

* fix

* move tests up

* benchmarks

* rename

* remove some args

* better metrics output

* fix up the benchmark

* benchmark-yaml

* horovod-benchmark

* benchmarks

* Remove benchmark code for cleanups

* makedatacreator

* relax

* metrics

* autosetsampler

* profile

* movements

* OK

* smoothen

* fix

* nitdocs

* loss

* comments

* fix

* fix

* runner_tests

* codes

* example

* fix_test

* fix

* tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Maksim Smolin <maximsmol@gmail.com>
2020-03-10 08:41:42 -07:00
Anthony Yu
89ec4adb72
[tune] Dragonfly Optimizer ()
* Add sample example

* Copy relevant lines of ask from inherited Optimizer

* Ignore strategy

* Additional changes

* Add DragonflySearch for tune connector for Dragonfly

* Add example and fix small errors

* lint

* Remove skopt references

* Update example based off of Dragonfly changes

* Edit example for final Dragonfly edits

* Formatting and documentation edits

* Add documentation and add to test pipeline

* Address PR comments

* Fix Jenkins test

* Adjust Dragonfly to PR#7366

* Lint

* fix_tests

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-10 08:40:36 -07:00