1
0
Fork 0
mirror of https://github.com/vale981/ray synced 2025-04-05 09:49:11 -04:00
Commit graph

13756 commits

Author SHA1 Message Date
Richard Liaw
81d311031b
[tune] Update API Reference Page ()
* widerdocs

* init

* docs

* fix

* moveit

* mix

* better_docs

* remove

* Apply suggestions from code review

Co-Authored-By: Sven Mika <sven@anyscale.io>

Co-authored-by: Sven Mika <sven@anyscale.io>
2020-03-22 16:42:20 -07:00
Eric Liang
288933ec6b
[rllib] Fix shared metrics context in parallel iterators ()
* debug

* build

* update

* wip

* wpi

* update

* recurisve sync

* comment

* stream

* fix

* Update .travis.yml
2020-03-22 14:15:01 -07:00
Sven Mika
2fb219a658
[Ray RLlib] Fix tree import ()
* Rollback.

* Fix import tree error by adding meaningful error and replacing by tf.nest wherever possible.

* LINT.

* LINT.

* Fix.

* Fix log-likelihood test case failing on travis.
2020-03-22 13:51:24 -07:00
Eric Liang
86f89fc3b3
[tune] Higher timeout for progress reporter test ()
* wip

* medium size
2020-03-22 13:47:08 -07:00
Stephanie Wang
ba86a02b37
[core] Revert lineage pinning () ()
* Revert "fix ()"

This reverts commit 6a12a31b2e.

* Revert "[core] Pin lineage of plasma objects that are still in scope ()"

This reverts commit 014929e658.
2020-03-21 18:35:43 -07:00
Simon Mo
89d959fd6a
Stop gap solution for cython functions breaking in memory monitor () 2020-03-21 15:16:12 -07:00
Zhijun Fu
a7a5d172b1
[core] fix bug that actor tasks from reconstructed actor is ignored by scheduling queue () 2020-03-21 13:05:24 +08:00
SangBin Cho
1b90196bef
[doc] Dashboard documentation ()
* Completed the first half of dashboard documentation.

* Dashboard document initial versions.

* Formatting.

* Fixed tune note is not visible.

* Half of comments from code reivew are handled.

* Fixed based on code review.

* Improved memory usage page.

* Addressed code review.

* Fixed image not found issue.

* Add gitkeep again.

* Refactored document.

* Addressed Robert's feedback.

* Addressed code reviews.

* Addressed last comments.

* Update doc/source/ray-dashboard.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-20 22:00:33 -07:00
Stephanie Wang
6a12a31b2e
fix () 2020-03-20 18:53:28 -07:00
Edward Oakes
ec50037ee1
Use go1.12 in lint build () 2020-03-20 14:52:41 -07:00
Edward Oakes
31845f17a5
[docs] Add documentation for reference counting and 'ray memory' () 2020-03-20 15:47:00 -05:00
Edward Oakes
58dc70f90e
[minor] Remove get_global_worker(), RuntimeContext () 2020-03-20 15:45:29 -05:00
Eric Liang
7ebc6783e4
[rllib] Add back get_policy_output method for SAC model () 2020-03-20 12:44:04 -07:00
Eric Liang
9392cdbf74
[rllib] Add high-performance external application connector () 2020-03-20 12:43:57 -07:00
Stephanie Wang
014929e658
[core] Pin lineage of plasma objects that are still in scope ()
* Add a lineage_ref_count to References

* Refactor TaskManager to store TaskEntry as a struct

* Refactor to fix deadlock between TaskManager and ReferenceCounter
Add references to task specs

* Pin TaskEntries and References in the lineage of any ObjectIDs in scope

* Fix deadlock, convert num_plasma_returns to a set of object IDs

* fix unit tests

* Feature flag

* Do not release lineage for objects that were promoted to plasma

* fix build

* fix build

* Remove num executions

* Simplify num return values

* Remove unused

* doc

* Set num returns

* Move lineage pinning flag to ReferenceCounter

* comments

* Fixes

* Remove irrelevant test (replaced by ref counting tests)
2020-03-20 10:56:43 -07:00
fyrestone
a1ae935839
Java call Python use structured function descriptors () 2020-03-20 17:29:45 +08:00
ZhuSenlin
7d08b418fc
fix test_worker_stats ()
* fix test_worker_stats

* fix lint error

* fix lint error

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-03-20 14:53:40 +08:00
mehrdadn
e69664b74b
Miscellaneous Windows compatibility bugfixes ()
* Windows compatibility bug fixes

* Use WSASend/WSARecv as WSASendMsg/WSARecvMsg do not work with TCP sockets

* Clean up some TODOs

* Fix duplicate compilations

* RedisAsioClient boost::asio::error::connection_reset

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-19 19:32:53 -07:00
Stephanie Wang
c7cae036c3
[core] Only drain references for non-actor workers on shutdown ()
* Only drain ref counter for non-actor tasks

* Don't force kill actors that have gone out of scope
2020-03-19 18:46:16 -07:00
Eric Liang
5a112ab212
Remove object store memory cap () 2020-03-19 16:00:30 -07:00
Clark Zinzow
c37f6e745a
Remove duplicate jsonschema from setup.py () 2020-03-19 13:12:47 -07:00
Edward Oakes
90b553ed05
[operator] Use headless service for head node () 2020-03-19 10:31:56 -05:00
Edward Oakes
c78b52b5b2
Set RayCluster as service owner () 2020-03-19 10:30:44 -05:00
fangfengbin
0d0a41f598
[GCS]Tie lifecycle of gcs service and redis together () 2020-03-19 19:52:35 +08:00
Stephanie Wang
b499100a88
Enable distributed ref counting by default ()
* enable

* Turn on eager eviction

* Shorten tests and drain ReferenceCounter

* Don't force kill actor handles that have gone out of scope, lint

* Fix locks

* Cleanup Plasma Async Callback ()

* [rllib][tune] fix some nans ()

* Change /tmp to platform-specific temporary directory ()

* [Serve] UI Improvements ()

* bugfix about test_dynres.py ()

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* Java call Python actor method use actor.call ()

* bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase ()

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>

* [Java] Make both `RayActor` and `RayPyActor` inheriting from `BaseActor` ()

* [Java] Fix the issue that the cached value in `RayObject` is serialized ()

* Add failure tests to test_reference_counting ()

* Fix typo in asyncio documentation ()

* Fix segfault

* debug

* Force kill actor

* Fix test
2020-03-18 22:39:21 -07:00
fangfengbin
fca9dc73e1
Fix test_raylet_pending_tasks test case failed () 2020-03-19 11:09:38 +08:00
Seung Hyeon, Kim
ee49f4a875
[tune] Fix an example for _Brackets of async hyperband scheduler () 2020-03-18 19:06:32 -07:00
Stephanie Wang
35a4bfc885
[core] Fix leak for subscribing to object dependencies in NodeManager ()
* Fix GetDependencies

* lint
2020-03-18 11:01:29 -07:00
Richard Liaw
ea10cd212c
[tune] add accessible trial_info ()
* add accessible trial_info

* trial name and info

* doc

* fix
gp

* Update doc/source/tune-package-ref.rst

* Apply suggestions from code review

* fix

* trial

* fixtest

* testfix
2020-03-17 23:44:18 -07:00
Eric Liang
745b9d643d
First pass at ray memory command for memory debugging () 2020-03-17 20:45:07 -07:00
Landcold7
e6a045df48
Fix typo in asyncio documentation () 2020-03-17 10:37:37 -05:00
Edward Oakes
c1b0f9ccdf
Add failure tests to test_reference_counting () 2020-03-17 10:30:21 -05:00
Hao Chen
7678418210
[Java] Fix the issue that the cached value in RayObject is serialized () 2020-03-17 22:07:41 +08:00
Kai Yang
6b888b0247
[Java] Make both RayActor and RayPyActor inheriting from BaseActor () 2020-03-17 21:45:56 +08:00
ZhuSenlin
dfa5d9b8e9
bug fix about useage of absl::flat_hash_map::erase and absl::flat_hash_set::erase ()
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-03-17 19:39:56 +08:00
fyrestone
7697ea2be2
Java call Python actor method use actor.call () 2020-03-17 14:52:43 +08:00
ZhuSenlin
ffa9df4683
bugfix about test_dynres.py ()
Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-03-17 13:58:44 +08:00
Simon Mo
ce0885a897
[Serve] UI Improvements () 2020-03-16 22:23:16 -07:00
mehrdadn
a0700e2f86
Change /tmp to platform-specific temporary directory () 2020-03-16 18:10:14 -07:00
Eric Liang
797e6cfc2a
[rllib][tune] fix some nans () 2020-03-16 11:19:58 -07:00
ijrsvt
46953c53b1
Cleanup Plasma Async Callback () 2020-03-16 10:12:44 -07:00
Simon Mo
45ce40e5d4
Disable Travis Disk Cache ()
There are some file sizes and memory issue with bazel disk cache
we will disable the cache and use remote cache exclusively for now
2020-03-16 00:18:01 -07:00
Scott Graham
37e4d29f87
[autoscaler] Adding Azure Support ()
* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* adding directory and node_provider entry for azure autoscaler

* adding initial cut at azure autoscaler functionality, needs testing and node_provider methods need updating

* adding todos and switching to auth file for service principal authentication

* adding role / scope to service principal

* resolving issues with app credentials

* adding retry for setting service principal role

* typo and adding retry to nic creation

* adding nsg to config, moving nic/public ip to node provider, cleanup node_provider, leaving in NodeProvider stub for testing

* linting

* updating cleanup and fixing bugs

* minor fixes

* first working version :)

* added tag support

* added msi identity intermediate

* enable MSI through user managed identity

* updated schema

* extend yaml schema
remove service principal code
add re-use of managed user identity

* fix rg_id

* fix logging

* replace manual cluster yaml validation with json schema
- improved error message
- support for intellisense in VSCode (or other IDEs)

* run linting

* updating yaml configs and formatting

* updating yaml configs and formatting

* typo in example config

* pulling default config from example-full

* resetting min, init worker prop

* adding docs for azure autoscaler and fixing status

* add azure to docs, fix config for spot instances, update azure provider to avoid caching issues during deployment

* fix for default subscription in azure node provider

* vm dev image build

* minor change

* keeping example-full.yaml in autoscaler/azure, updating azure example config

* linting azure config

* extending retries on azure config

* lint

* support for internal ips, fix to azure docs, and new azure gpu example config

* linting

* Update python/ray/autoscaler/azure/node_provider.py

Co-Authored-By: Richard Liaw <rliaw@berkeley.edu>

* revert_this

* remove_schema

* updating configs and removing ssh keygen, tweak azure node provider terminate

* minor tweaks

Co-authored-by: Markus Cozowicz <marcozo@microsoft.com>
Co-authored-by: Ubuntu <marcozo@mc-ray-jumpbox.chcbtljllnieveqhw3e4c1ducc.xx.internal.cloudapp.net>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-03-15 14:48:27 -07:00
Simon Mo
3f1fcaa024
Blocking ray.get/wait inside async context will warn instead of error () 2020-03-14 22:02:30 -07:00
fangfengbin
6b37be9677
[GCS]Add job id when operating gcs table () 2020-03-15 12:04:04 +08:00
Kai Yang
630e48967d
[Java] Allow passing internal config from raylet to Java worker () 2020-03-15 12:03:38 +08:00
mehrdadn
a87199d240
Fix cyclic dependency between ray/util and ray/common ()
* Fix cyclic dependency

Headers in ray/util should not depend on those in ray/common

* Move random generations to ray/common/test_util.h

* Add license header

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-03-14 12:44:53 -07:00
tison
ffeab5d2bf
Support configurable python executable in format.sh () 2020-03-14 12:27:41 -07:00
Eric Liang
dd70720578
[rllib] Rename sample_batch_size => rollout_fragment_length ()
* bulk rename

* deprecation warn

* update doc

* update fig

* line length

* rename

* make pytest comptaible

* fix test

* fi sys

* rename

* wip

* fix more

* lint

* update svg

* comments

* lint

* fix use of batch steps
2020-03-14 12:05:04 -07:00
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory ()
* Add a test for LRU fallback

* Update error message

* Upgrade arrow to master

* Integrate with arrow

* Revert "Bazel mirrors ()"

This reverts commit 44aded5272.

* Don't LRU evict

* Revert "Revert "Bazel mirrors ()""

This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.

* Add lru_evict flag

* fix internal config

* Fix

* upgrade arrow

* debug

* Set free period in config for lru_evict, override max retries to fix
test

* Fix test?

* fix test

* Revert "debug"

This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.

* fix exception str

* Fix ref count test

* Shorten travis test?
2020-03-14 11:28:43 -07:00