Commit graph

6594 commits

Author SHA1 Message Date
Kai Fricke
2b60c5774b
[tune] cache checkpoint serialization (#12064) 2020-11-18 09:03:53 -08:00
Sven Mika
6da4342822
[RLlib] Add on_learn_on_batch (Policy) callback to DefaultCallbacks. (#12070) 2020-11-18 15:39:23 +01:00
dHannasch
b41f4fdec2
Extract the connection logic to reduce duplication. (#12016) 2020-11-18 00:12:58 -08:00
Ian Rodney
d23d326560
[CLI] Fix ray commands when RAY_ADDRESS used (#11989)
* [CLI] Fix ray commands when RAY_ADDRESS used

* erics suggestion
2020-11-17 23:44:59 -08:00
fangfengbin
d87af0da88
[PlacementGroup]Add gcs placement group manager debug info (#12061)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-18 11:15:38 +08:00
Philipp Moritz
b96516e9d3
[core] Remove google dependency (#12085) 2020-11-17 19:01:00 -08:00
fangfengbin
f400333841
[Placement Group]Placement Group supports gcs failover(Part2) (#12003)
* add testcase

* fix ut

* fix review comment

* fix review comment

* fix review comments

* fix ut bug

* add part code

* add part code

* add part code

* add testcase

* add part code

* fix ut bug

* fix ut timeout bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-18 10:59:26 +08:00
Simon Mo
c476037c97
[Core] Async API should raise on all RayError (#12043)
Before this PR we are raising just RayTaskError, this means errors
like RayActorError(Actor Died) won't be propogated and thrown at
`await object_ref`. This PR fixes that.
2020-11-17 17:20:30 -08:00
Ameer Haj Ali
e8c018e8fc
[C++ API] tests for the C++ API. (#12076) 2020-11-17 17:07:52 -08:00
Stephanie Wang
f6bdd5ab17
[New Scheduler] Spillback from the queue of tasks assigned to the local node (#12084) 2020-11-17 16:13:59 -08:00
dHannasch
b5dfdb2a21
Log the Redis shard addresses as originally received from the head GCS. (#12011) 2020-11-17 13:11:17 -08:00
dHannasch
010e6cef3f
Allow setting the RAY_BACKEND_LOG_LEVEL to trace. (#12012) 2020-11-17 13:10:23 -08:00
dHannasch
f0dcf01807
Clarify that Ray is not yet retrying to connect. (#12013) 2020-11-17 13:01:42 -08:00
Richard Liaw
ca44222e03
[minor] log info instead of error upon ray.init rerun (#12025) 2020-11-17 12:59:24 -08:00
fangfengbin
7f050c706b
[PlacementGroup]Skip flaky testcase (#12065)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-17 12:21:34 -08:00
Simon Mo
d7c95a4a90
[Serve] Rewrite Router to be Embeddable (#12019) 2020-11-17 08:28:18 -08:00
Maksim Smolin
23926f3e6e
[CLI] Docker Support (#11761)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-17 00:04:39 -08:00
chaokunyang
bea0031491
fix linux wheel build (#9896) 2020-11-17 15:49:42 +08:00
Eric Liang
09d6ea5784
Clarify official releases vs nightly wheels 2020-11-16 23:30:40 -08:00
Amog Kamsetty
f10cef93c7
[sgd] support operator.device (#12056) 2020-11-16 21:44:27 -08:00
Eric Liang
380df89069
Lazily initialize the global state accessor in Python workers (#12054)
* wip

* fix

* fix
2020-11-16 21:35:12 -08:00
DK.Pino
0f9e2fec12
[Placement Group] Add get / get all / remove interface for Placement Group Java api. (#11821)
* add placement group java get/get all interface

* add remove placement group api

* fix some issue like: Placement Group -> placement group

* extract dumplicate code to placement group utils

* specify running mode for placement group ut

* update checkGlobalStateAccessorPointerValid -> validateGlobalStateAccessorPointer

* use THROW_EXCEPTION_AND_RETURN_IF_NOT_OK

* update pg log print
2020-11-17 12:32:39 +08:00
Max Fitton
90574b66cc
pin aiohttp to the 3.x.x version (#12051) 2020-11-16 21:54:16 -05:00
Richard Liaw
51d277f2e4
[tests] fix mock for test_cli (#12055)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-16 18:44:15 -08:00
Tao Wang
d525e61288
[GCS]Open light heartbeat by default (#11968)
* [GCS]Open light heartbeat by default (#11689)

* Add some unit tests
2020-11-16 18:21:47 -08:00
Stephanie Wang
c49554fb7a
Abstract plasma store creation request queue (#12039) 2020-11-16 17:09:15 -08:00
Kai Fricke
9f5986ee58
[tune] logger migration to ExperimentLogger classes (#11984) 2020-11-16 15:08:37 -08:00
Alan Guo
3dc68533a9
make some private rsync, and exec_cluster arguments public (#11958)
* make some private rsync, and exec_cluster arguments public

* fix format issue

* undo make all_nodes public
2020-11-16 14:31:41 -08:00
SongGuyang
df2c2a7ce5
[cpp worker] support pass by reference on cluster mode (#11753) 2020-11-16 14:30:35 -08:00
Ameer Haj Ali
8d599bb3f5
[autoscaler] Move fill out resources to bootstrap config to cache the resources and avoid expensive boto3 calls (#12028) 2020-11-16 13:28:57 -08:00
fyrestone
0c6bb745cd
Fix dashboard agent use incorrect ip (#12038) 2020-11-16 14:02:20 -06:00
SangBin Cho
f56d7c1a76
[Logging] Remove per worker job log file / support worker log rotation (#11927)
* In progress.

* MVP done.

* In Progress.

* Remove unnecessay code.

* Fix some issues.

* Fix test failures.

* Addressed code review + fix object spilling test failure.
2020-11-16 11:29:43 -08:00
Sven Mika
b6b54f1c81
[RLlib] Trajectory view API: enable by default for SAC, DDPG, DQN, SimpleQ (#11827) 2020-11-16 10:54:35 -08:00
Kai Fricke
8609e2dd90
[tune] refactor verbosity levels (#11767)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-16 10:32:53 -08:00
Keqiu Hu
a50128079d
[tune/placement group] dist. training placement group support (#11934)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-11-16 01:11:39 -08:00
fangfengbin
8fb926565c
[Placement Group]Placement Group supports gcs failover (Part1) (#11933) 2020-11-16 14:42:56 +08:00
dHannasch
d35de2272d
[Core] Allow redis.ResponseError instead of redis.AuthenticationError (#12024)
* redis.ResponseError

* there really is no way to make this look good, is there
2020-11-15 15:04:56 -08:00
Simon Mo
ac9610b19d
[Autoscaler] Precisely match docker HOME (#12020)
* [Autoscaler] Precisely match docker HOME

The current grep will match any env variable keyed by HOME. This will
include some unwanted variables like PYTHONHOME, PROJECT_HOME, etc.
Depending on the order of the environment variable, the subsequent
docker setup command might fail.

* fstring
2020-11-15 11:49:50 -08:00
Richard Liaw
8b3f79f307
[tune] refactor and add examples (#11931) 2020-11-14 20:43:28 -08:00
dHannasch
5891759a3e
Clarify get_node_ip_address docstring (#11881) 2020-11-14 15:20:58 -08:00
dHannasch
9fbeefd604
Distinguish a bad --redis-password from any other Redis error (#11893) 2020-11-13 17:39:44 -06:00
Eric Liang
4f5d6274af
[docs] Add links to Ray design patterns whitepaper (#12014)
* update

* update
2020-11-13 14:16:51 -08:00
Edward Oakes
8bcb0bddc9
[serve] Fix API calls in global README (#12015) 2020-11-13 16:05:00 -06:00
dHannasch
effa553077
[Doc] Explain how to know whether RAY_BACKEND_LOG_LEVEL worked (#12010)
* Fix broken link to nonexistent Temporary Files page.

* How to know that RAY_BACKEND_LOG_LEVEL worked.

* Reference the definition of DEBUG in case it changes.
2020-11-13 14:02:57 -08:00
Simon Mo
277558895d
[Serve] Introduce Long Polling (#11905) 2020-11-13 13:17:20 -08:00
Eric Liang
00ef1179c0
[object spilling] Autocreate dir if not exists (#11999) 2020-11-13 12:13:06 -08:00
Ian Rodney
f936ea35fe
[hotfix] Fix ResourceDemandScheduler (#11996)
* [hotfix] Fix ResourceDemandScheduler

* fix test_autoscaler
2020-11-13 00:42:16 -08:00
SangBin Cho
f6f9b15299
. (#11998) 2020-11-12 21:33:00 -08:00
Ian Rodney
3b56a1a522
[docker] auto-populate shared memory size (#11953) 2020-11-12 17:22:42 -08:00
Michael Luo
59bc1e6c09
[RLLib] MAML extension for all models except RNNs (#11337) 2020-11-12 16:51:40 -08:00