Commit graph

9434 commits

Author SHA1 Message Date
Sven Mika
a772c775cd
[RLlib] Set random seed (if provided) to Trainer process as well. (#18307) 2021-09-04 11:02:30 +02:00
Eric Liang
c4199a8054
Add more workflow comparisons (#18347) 2021-09-03 19:26:33 -07:00
Alex Wu
7912a8554c
[code oweners] Add Hao to autoscaler compatibility (#18218) 2021-09-03 18:55:09 -07:00
Yi Cheng
23e9af0601
[test] Add x nodes y actors test to nightly tests (#18291) 2021-09-03 18:54:23 -07:00
Chen Shen
cf4fb4edb3
[Core][plasma] fix the data race issue (#18312) 2021-09-03 18:51:27 -07:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. (#18236) 2021-09-03 14:23:56 -07:00
Jiajun Yao
e049d52d29
Retry application-level error by default for datasets (#18296) 2021-09-03 14:21:38 -07:00
ellimac54
772d25cc38
Add Initial Windows Dockerfile (#17474) 2021-09-03 11:41:06 -07:00
matthewdeng
26f73ebb0b
[sgd] Implement resources_per_worker (#18327)
* [sgd] add support for additional resources per worker

* [sgd] add support for additional resources per worker

* update test

* lint

* update comments for case-sensitivity
2021-09-03 11:10:46 -07:00
xwjiang2010
01adf030ec
[Tune] Raise Error when there are insufficient resources. (#17957) 2021-09-03 10:49:54 -07:00
Kai Fricke
ac5d255c9c
[rllib/docker] silent unzip of atari roms (#18340) 2021-09-03 17:55:03 +01:00
Edward Oakes
a11978ea42
[runtime_env] Remove unused serialized-runtime-env from worker args (#18295) 2021-09-03 10:57:01 -05:00
Edward Oakes
1f6705d35d
[runtime_env] Centralize runtime_env logic into ray._private.runtime_env submodule (#18310) 2021-09-03 10:19:00 -05:00
Kai Fricke
6aa8a4eddc
[release] prettier output of release test results and artifacts (#18337) 2021-09-03 14:00:55 +01:00
Sven Mika
9a8ca6a69d
[RLlib] Fix Atari learning test regressions (2 bugs) and 1 minor attention net bug. (#18306) 2021-09-03 13:29:57 +02:00
Kai Fricke
fb38d06cfb
Move RLLib GPU release test dependencies to ml docker (#18208) 2021-09-03 09:35:18 +01:00
gjoliver
336e79956a
[RLlib] Make MultiAgentEnv inherit gym.Env to avoid direct class type manipulation (#18156) 2021-09-03 08:02:05 +02:00
qicosmos
72739462a9
[C++ Worker]Add some api of placement group part1. (#17925)
* linkopts shared

* add some pg api

* add Wait for PlacementGroup
2021-09-03 13:32:28 +08:00
Alex Wu
fa961032e1
[workflow] object ref integration (#18128)
* notes

* notes

* .

* seems to work?

* .

* seems to work

* needs tests

* needs tests

* parallelize uploads

* fixed

* fixed

* .

* dumb test

* .

* .

* fix festsg

* .

* works

* .:

* .

* .

* Update common.py

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-09-02 19:59:45 -07:00
SangBin Cho
814095add6
Revert "Change instance type for some tests (#18248)" (#18320)
This reverts commit 34026a7bd5.
2021-09-02 17:45:02 -07:00
Amog Kamsetty
40b6d765df
[SGD] v2 tune checkpointing (#18179)
* wip

* wip

* wip

* wip

* fix test

* finish

* fix failing tests

* address comments

* wip

* address comments

* update

* fix

* fix fault tolerance checkpoint id

* lint

* updates

* updates

* add test

* updates

* update

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* lint

* fix

* fix test

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-02 17:44:37 -07:00
Jiajun Yao
d9538a958b
Avoid duplicate exports of functions (#18284) 2021-09-02 17:36:52 -07:00
Eric Liang
7dcae690b9
Mark datasets as still in alpha for now (#18321) 2021-09-02 17:07:33 -07:00
SangBin Cho
9b9eae1e86
Change misleading documentation from the placement group (#18257)
* Modify a doc

* completed
2021-09-02 16:40:48 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue (#18064)
* begin

* build

* add test

* add first test

* add test

* fix build

* lint bazel

* fix build

* fix build

* fix crash

* fix some comment

* revert shared_ptr ObjectLifecycleManager

* fix RemoveGetRequest lost

* no defer

* fix lots of comments

* fix build

* fix data race

* fix comments

* Revert "fix data race"

This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.

* refine

* fix mac build

* fix unit test

* fix unit test
2021-09-02 14:16:50 -07:00
Edward Oakes
549a8fa948
[runtime_env] [ray_client] Remove PrepRuntimeEnv RPC, upload working_dir before calling ray.init in server (#18240) 2021-09-02 14:02:39 -05:00
Antoni Baum
4c95ea6d0a
[client] Improve Ray Client connection timeout information (#18281)
* Improve Ray Client connection timeout information

* fix lint issue.

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-09-02 16:34:11 +03:00
Sven Mika
2357bbc0c8
[RLlib] Issue 18231: Better (earlier) env validation and error message improvement. (#18249) 2021-09-02 09:28:16 +02:00
gjoliver
6621bb5611
[RLlib] Minor renaming and cleanups related to last rollout worker seed fix. (#18155) 2021-09-02 06:57:46 +02:00
xwjiang2010
9fa7951171
[core] Log once when get_gpu_ids is called on driver. (#18282) 2021-09-01 16:47:00 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error (#17971)
* Attach call site to ObjectRef

* flag

* Fix build

* build

* build

* build

* x

* x

* skip on windows

* lint
2021-09-01 15:29:05 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core (#18265)
* up

* up

* up

* format

* up

* up

* format
2021-09-01 13:04:35 -07:00
Chris K. W
1a10108765
[core] Release function actor lock while waiting for actor class to be loaded by import thread (#18175) 2021-09-01 12:59:48 -07:00
Sven Mika
a7670d9fab
[RLlib; Testing] Fix smoke-test settings for nightly learning_tests and stress_test; Add pybullet_envs to app-config. (#18274) 2021-09-01 21:46:06 +02:00
Amog Kamsetty
9c2e7ffd97
[SGD] v2 Fault Tolerance (#18090)
* wip

* wip

* wip

* wip

* update

* finish

* remove

* fix

* update

* update

* update comment

* handle backend failures

* bump test timeout

* address comments

* fix

* fix

* address comments

* formatting

* add comment

* address comment

* fix failing test

* update error message

* Update python/ray/util/sgd/v2/trainer.py

* wip

* fix failing test

* formatting

* fix
2021-09-01 12:43:10 -07:00
Edward Oakes
0326bbb30a
[serve] Skip test_standalone namespace test on windows (#18277) 2021-09-01 12:58:59 -05:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors (#18176)
* Retry application-level errors

* Retry application-level errors

* Push retry message to the driver
2021-09-01 10:53:06 -07:00
Edward Oakes
673bf35c1f
Refactor BackendState to be per-backend instead of global (#18255) 2021-09-01 09:46:22 -05:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list (#16944) 2021-08-31 16:31:13 -07:00
Edward Oakes
5d122cf7b7
[runtime_env] Move working dir setup to the agent (#18170) 2021-08-31 17:22:49 -05:00
Guyang Song
be772df4dc
[Event] Add some error level events (#18118)
* add event 'RAY_WORKER_FAILURE' and 'RAY_DRIVER_FAILURE'

* add some events

* move event 'EL_RAY_NODE_REMOVED' to 'RemoveNode()'
2021-08-31 14:15:13 -07:00
Sven Mika
82465f9342
[RLlib] Better PolicyServer example (w/ or w/o tune) and add printing out actual listen port address in log-level=INFO. (#18254) 2021-08-31 22:03:23 +02:00
matthewdeng
a3123b6860
[SGD] v2 Horovod backend (#18047)
* [SGD] add Horovod backend

* address comments: set CUDA_VISIBLE_DEVICES, refactor code

* fix gpu test

* fix lint/test import

* address comments, add example cluster config

* delay horovod imports
2021-08-31 12:54:59 -07:00
Wesley Gifford
6133a561e9
Dataset from modin (#18122) 2021-08-31 11:19:35 -07:00
Nikita Vemuri
c5b99ab590
[serve] Start RayInternalKVStore in controller namespace (#18164) 2021-08-31 13:09:33 -05:00
Edward Oakes
17dded543c
Support passing gcs_client to internal_kv (#18235) 2021-08-31 12:46:41 -05:00
xwjiang2010
63f00843f3
[Tune] Inform users of the setup needed for uploading results to cloud. (#18220) 2021-08-31 10:27:50 -07:00
mwtian
134ac0ef55
[CI] Fix clang-format to always compare against master (#18140) 2021-08-31 10:16:33 -07:00
SangBin Cho
34026a7bd5
Change instance type for some tests (#18248) 2021-08-31 10:10:46 -07:00
SangBin Cho
d240d26525
[Object Spilling] Fix a bug where object url is empty. (#18193)
* Fix a bug

* Addressed code review.

* Fix a test
2021-08-31 10:10:28 -07:00