Sven Mika
9a8ca6a69d
[RLlib] Fix Atari learning test regressions (2 bugs) and 1 minor attention net bug. ( #18306 )
2021-09-03 13:29:57 +02:00
Kai Fricke
fb38d06cfb
Move RLLib GPU release test dependencies to ml docker ( #18208 )
2021-09-03 09:35:18 +01:00
gjoliver
336e79956a
[RLlib] Make MultiAgentEnv inherit gym.Env to avoid direct class type manipulation ( #18156 )
2021-09-03 08:02:05 +02:00
qicosmos
72739462a9
[C++ Worker]Add some api of placement group part1. ( #17925 )
...
* linkopts shared
* add some pg api
* add Wait for PlacementGroup
2021-09-03 13:32:28 +08:00
Alex Wu
fa961032e1
[workflow] object ref integration ( #18128 )
...
* notes
* notes
* .
* seems to work?
* .
* seems to work
* needs tests
* needs tests
* parallelize uploads
* fixed
* fixed
* .
* dumb test
* .
* .
* fix festsg
* .
* works
* .:
* .
* .
* Update common.py
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-09-02 19:59:45 -07:00
SangBin Cho
814095add6
Revert "Change instance type for some tests ( #18248 )" ( #18320 )
...
This reverts commit 34026a7bd5
.
2021-09-02 17:45:02 -07:00
Amog Kamsetty
40b6d765df
[SGD] v2 tune checkpointing ( #18179 )
...
* wip
* wip
* wip
* wip
* fix test
* finish
* fix failing tests
* address comments
* wip
* address comments
* update
* fix
* fix fault tolerance checkpoint id
* lint
* updates
* updates
* add test
* updates
* update
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* lint
* fix
* fix test
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-02 17:44:37 -07:00
Jiajun Yao
d9538a958b
Avoid duplicate exports of functions ( #18284 )
2021-09-02 17:36:52 -07:00
Eric Liang
7dcae690b9
Mark datasets as still in alpha for now ( #18321 )
2021-09-02 17:07:33 -07:00
SangBin Cho
9b9eae1e86
Change misleading documentation from the placement group ( #18257 )
...
* Modify a doc
* completed
2021-09-02 16:40:48 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue ( #18064 )
...
* begin
* build
* add test
* add first test
* add test
* fix build
* lint bazel
* fix build
* fix build
* fix crash
* fix some comment
* revert shared_ptr ObjectLifecycleManager
* fix RemoveGetRequest lost
* no defer
* fix lots of comments
* fix build
* fix data race
* fix comments
* Revert "fix data race"
This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.
* refine
* fix mac build
* fix unit test
* fix unit test
2021-09-02 14:16:50 -07:00
Edward Oakes
549a8fa948
[runtime_env] [ray_client] Remove PrepRuntimeEnv RPC, upload working_dir before calling ray.init in server ( #18240 )
2021-09-02 14:02:39 -05:00
Antoni Baum
4c95ea6d0a
[client] Improve Ray Client connection timeout information ( #18281 )
...
* Improve Ray Client connection timeout information
* fix lint issue.
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-09-02 16:34:11 +03:00
Sven Mika
2357bbc0c8
[RLlib] Issue 18231: Better (earlier) env validation and error message improvement. ( #18249 )
2021-09-02 09:28:16 +02:00
gjoliver
6621bb5611
[RLlib] Minor renaming and cleanups related to last rollout worker seed fix. ( #18155 )
2021-09-02 06:57:46 +02:00
xwjiang2010
9fa7951171
[core] Log once when get_gpu_ids is called on driver. ( #18282 )
2021-09-01 16:47:00 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error ( #17971 )
...
* Attach call site to ObjectRef
* flag
* Fix build
* build
* build
* build
* x
* x
* skip on windows
* lint
2021-09-01 15:29:05 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core ( #18265 )
...
* up
* up
* up
* format
* up
* up
* format
2021-09-01 13:04:35 -07:00
Chris K. W
1a10108765
[core] Release function actor lock while waiting for actor class to be loaded by import thread ( #18175 )
2021-09-01 12:59:48 -07:00
Sven Mika
a7670d9fab
[RLlib; Testing] Fix smoke-test settings for nightly learning_tests
and stress_test
; Add pybullet_envs
to app-config. ( #18274 )
2021-09-01 21:46:06 +02:00
Amog Kamsetty
9c2e7ffd97
[SGD] v2 Fault Tolerance ( #18090 )
...
* wip
* wip
* wip
* wip
* update
* finish
* remove
* fix
* update
* update
* update comment
* handle backend failures
* bump test timeout
* address comments
* fix
* fix
* address comments
* formatting
* add comment
* address comment
* fix failing test
* update error message
* Update python/ray/util/sgd/v2/trainer.py
* wip
* fix failing test
* formatting
* fix
2021-09-01 12:43:10 -07:00
Edward Oakes
0326bbb30a
[serve] Skip test_standalone namespace test on windows ( #18277 )
2021-09-01 12:58:59 -05:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors ( #18176 )
...
* Retry application-level errors
* Retry application-level errors
* Push retry message to the driver
2021-09-01 10:53:06 -07:00
Edward Oakes
673bf35c1f
Refactor BackendState to be per-backend instead of global ( #18255 )
2021-09-01 09:46:22 -05:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list ( #16944 )
2021-08-31 16:31:13 -07:00
Edward Oakes
5d122cf7b7
[runtime_env] Move working dir setup to the agent ( #18170 )
2021-08-31 17:22:49 -05:00
Guyang Song
be772df4dc
[Event] Add some error level events ( #18118 )
...
* add event 'RAY_WORKER_FAILURE' and 'RAY_DRIVER_FAILURE'
* add some events
* move event 'EL_RAY_NODE_REMOVED' to 'RemoveNode()'
2021-08-31 14:15:13 -07:00
Sven Mika
82465f9342
[RLlib] Better PolicyServer example (w/ or w/o tune) and add printing out actual listen port address in log-level=INFO. ( #18254 )
2021-08-31 22:03:23 +02:00
matthewdeng
a3123b6860
[SGD] v2 Horovod backend ( #18047 )
...
* [SGD] add Horovod backend
* address comments: set CUDA_VISIBLE_DEVICES, refactor code
* fix gpu test
* fix lint/test import
* address comments, add example cluster config
* delay horovod imports
2021-08-31 12:54:59 -07:00
Wesley Gifford
6133a561e9
Dataset from modin ( #18122 )
2021-08-31 11:19:35 -07:00
Nikita Vemuri
c5b99ab590
[serve] Start RayInternalKVStore in controller namespace ( #18164 )
2021-08-31 13:09:33 -05:00
Edward Oakes
17dded543c
Support passing gcs_client to internal_kv ( #18235 )
2021-08-31 12:46:41 -05:00
xwjiang2010
63f00843f3
[Tune] Inform users of the setup needed for uploading results to cloud. ( #18220 )
2021-08-31 10:27:50 -07:00
mwtian
134ac0ef55
[CI] Fix clang-format to always compare against master ( #18140 )
2021-08-31 10:16:33 -07:00
SangBin Cho
34026a7bd5
Change instance type for some tests ( #18248 )
2021-08-31 10:10:46 -07:00
SangBin Cho
d240d26525
[Object Spilling] Fix a bug where object url is empty. ( #18193 )
...
* Fix a bug
* Addressed code review.
* Fix a test
2021-08-31 10:10:28 -07:00
Antoni Baum
2c0dcec18f
[test] Fix golden notebook tests always failing ( #17873 )
2021-08-31 17:07:47 +02:00
Ryan L. Melvin
c081c68de7
[tune] Conditional search space example using hyperopt ( #18130 )
...
Co-authored-by: Ryan Melvin <rmelvin@uabmc.edu>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-08-31 17:06:22 +02:00
Kai Fricke
a8dbc44f9a
[ci] minimal dependency install test ( #18071 )
2021-08-31 15:26:25 +02:00
Sven Mika
599e589481
[RLlib] Move existing fake multi-GPU learning tests into separate buildkite job. ( #18065 )
2021-08-31 14:56:53 +02:00
Sven Mika
4888d7c9af
[RLlib] Replay buffers: Add config option to store contents in checkpoints. ( #17999 )
2021-08-31 12:21:49 +02:00
Kai Fricke
012f9eb687
[buildkite] Fix jar upload directory ( #18253 )
2021-08-31 11:18:34 +02:00
Simon Mo
2e0b816d64
[Buildkite] Upload jars to os specific dir ( #18229 )
2021-08-31 09:32:01 +02:00
SangBin Cho
eab506cc37
[Test] Disable non streaming shuffle 5000 partitions ( #18224 )
...
* Disable non streaming shuffle 5000 partitions
* increase timeout for 5000 partition shuffle
2021-08-31 00:28:15 -07:00
Chen Shen
5f3ec7634b
Fix off by one test bug ( #18239 )
2021-08-31 00:07:03 -07:00
Clark Zinzow
e154f87cab
Added split_at_indices to DatasetPipeline. ( #18243 )
2021-08-31 00:06:35 -07:00
Eric Liang
db9b5f142d
Disable worker logs temporarily during driver breakpoints ( #18192 )
2021-08-30 20:26:16 -07:00
Stephanie Wang
8e06db7280
Revert "[Core] revert: revert Unified worker starter ( #18008 )" ( #18228 )
...
This reverts commit b9978dd02b
.
2021-08-30 17:28:41 -07:00
Tim Hopper
fd2a8a6b9c
[docs] Fix broken urls ( #18206 )
2021-08-30 17:24:06 -07:00
Yi Cheng
7a65815108
[workflow] Defer input preparation until run ( #18225 )
2021-08-30 16:37:34 -07:00