Eric Liang
c4199a8054
Add more workflow comparisons ( #18347 )
2021-09-03 19:26:33 -07:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. ( #18236 )
2021-09-03 14:23:56 -07:00
Jiajun Yao
e049d52d29
Retry application-level error by default for datasets ( #18296 )
2021-09-03 14:21:38 -07:00
matthewdeng
26f73ebb0b
[sgd] Implement resources_per_worker
( #18327 )
...
* [sgd] add support for additional resources per worker
* [sgd] add support for additional resources per worker
* update test
* lint
* update comments for case-sensitivity
2021-09-03 11:10:46 -07:00
xwjiang2010
01adf030ec
[Tune] Raise Error when there are insufficient resources. ( #17957 )
2021-09-03 10:49:54 -07:00
Edward Oakes
a11978ea42
[runtime_env] Remove unused serialized-runtime-env from worker args ( #18295 )
2021-09-03 10:57:01 -05:00
Edward Oakes
1f6705d35d
[runtime_env] Centralize runtime_env logic into ray._private.runtime_env submodule ( #18310 )
2021-09-03 10:19:00 -05:00
Kai Fricke
fb38d06cfb
Move RLLib GPU release test dependencies to ml docker ( #18208 )
2021-09-03 09:35:18 +01:00
Alex Wu
fa961032e1
[workflow] object ref integration ( #18128 )
...
* notes
* notes
* .
* seems to work?
* .
* seems to work
* needs tests
* needs tests
* parallelize uploads
* fixed
* fixed
* .
* dumb test
* .
* .
* fix festsg
* .
* works
* .:
* .
* .
* Update common.py
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-09-02 19:59:45 -07:00
Amog Kamsetty
40b6d765df
[SGD] v2 tune checkpointing ( #18179 )
...
* wip
* wip
* wip
* wip
* fix test
* finish
* fix failing tests
* address comments
* wip
* address comments
* update
* fix
* fix fault tolerance checkpoint id
* lint
* updates
* updates
* add test
* updates
* update
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* Update python/ray/util/sgd/v2/backends/backend.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* lint
* fix
* fix test
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-02 17:44:37 -07:00
Jiajun Yao
d9538a958b
Avoid duplicate exports of functions ( #18284 )
2021-09-02 17:36:52 -07:00
Edward Oakes
549a8fa948
[runtime_env] [ray_client] Remove PrepRuntimeEnv RPC, upload working_dir before calling ray.init in server ( #18240 )
2021-09-02 14:02:39 -05:00
Antoni Baum
4c95ea6d0a
[client] Improve Ray Client connection timeout information ( #18281 )
...
* Improve Ray Client connection timeout information
* fix lint issue.
Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-09-02 16:34:11 +03:00
xwjiang2010
9fa7951171
[core] Log once when get_gpu_ids is called on driver. ( #18282 )
2021-09-01 16:47:00 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error ( #17971 )
...
* Attach call site to ObjectRef
* flag
* Fix build
* build
* build
* build
* x
* x
* skip on windows
* lint
2021-09-01 15:29:05 -07:00
Chris K. W
1a10108765
[core] Release function actor lock while waiting for actor class to be loaded by import thread ( #18175 )
2021-09-01 12:59:48 -07:00
Amog Kamsetty
9c2e7ffd97
[SGD] v2 Fault Tolerance ( #18090 )
...
* wip
* wip
* wip
* wip
* update
* finish
* remove
* fix
* update
* update
* update comment
* handle backend failures
* bump test timeout
* address comments
* fix
* fix
* address comments
* formatting
* add comment
* address comment
* fix failing test
* update error message
* Update python/ray/util/sgd/v2/trainer.py
* wip
* fix failing test
* formatting
* fix
2021-09-01 12:43:10 -07:00
Edward Oakes
0326bbb30a
[serve] Skip test_standalone namespace test on windows ( #18277 )
2021-09-01 12:58:59 -05:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors ( #18176 )
...
* Retry application-level errors
* Retry application-level errors
* Push retry message to the driver
2021-09-01 10:53:06 -07:00
Edward Oakes
673bf35c1f
Refactor BackendState to be per-backend instead of global ( #18255 )
2021-09-01 09:46:22 -05:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list ( #16944 )
2021-08-31 16:31:13 -07:00
Edward Oakes
5d122cf7b7
[runtime_env] Move working dir setup to the agent ( #18170 )
2021-08-31 17:22:49 -05:00
matthewdeng
a3123b6860
[SGD] v2 Horovod backend ( #18047 )
...
* [SGD] add Horovod backend
* address comments: set CUDA_VISIBLE_DEVICES, refactor code
* fix gpu test
* fix lint/test import
* address comments, add example cluster config
* delay horovod imports
2021-08-31 12:54:59 -07:00
Wesley Gifford
6133a561e9
Dataset from modin ( #18122 )
2021-08-31 11:19:35 -07:00
Nikita Vemuri
c5b99ab590
[serve] Start RayInternalKVStore in controller namespace ( #18164 )
2021-08-31 13:09:33 -05:00
Edward Oakes
17dded543c
Support passing gcs_client to internal_kv ( #18235 )
2021-08-31 12:46:41 -05:00
Ryan L. Melvin
c081c68de7
[tune] Conditional search space example using hyperopt ( #18130 )
...
Co-authored-by: Ryan Melvin <rmelvin@uabmc.edu>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-08-31 17:06:22 +02:00
Kai Fricke
a8dbc44f9a
[ci] minimal dependency install test ( #18071 )
2021-08-31 15:26:25 +02:00
Chen Shen
5f3ec7634b
Fix off by one test bug ( #18239 )
2021-08-31 00:07:03 -07:00
Clark Zinzow
e154f87cab
Added split_at_indices to DatasetPipeline. ( #18243 )
2021-08-31 00:06:35 -07:00
Eric Liang
db9b5f142d
Disable worker logs temporarily during driver breakpoints ( #18192 )
2021-08-30 20:26:16 -07:00
Stephanie Wang
8e06db7280
Revert "[Core] revert: revert Unified worker starter ( #18008 )" ( #18228 )
...
This reverts commit b9978dd02b
.
2021-08-30 17:28:41 -07:00
Yi Cheng
7a65815108
[workflow] Defer input preparation until run ( #18225 )
2021-08-30 16:37:34 -07:00
Antoni Baum
5be6bda4cf
[tests] Add Ludwig CI test ( #18126 )
2021-08-30 12:27:39 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port ( #17855 )" ( #18217 )
...
This reverts commit 53ddb551d5
.
2021-08-30 10:46:37 -07:00
Yi Cheng
f579822790
[workflow] Workflow inside virtual actor ( #18066 )
2021-08-30 10:40:22 -07:00
Chen Shen
7631d042bb
[Test] increase timeout for object spilling test caused by EBS cold storage issue ( #18200 )
2021-08-30 00:28:26 -07:00
SangBin Cho
0e968c1e82
[Core] Reduce spilling threshold ( #17910 )
...
* Lower the threshold
* ip
* Handle test failure
* lint
* last fix
* .
* Retry
2021-08-30 00:09:35 -07:00
fyrestone
53ddb551d5
Auto discover dashboard agent port ( #17855 )
2021-08-30 12:06:28 +08:00
Stephanie Wang
7bc1ef0dd9
[core] Prestart workers up to available CPU limit ( #18166 )
...
* Prestart workers according to num available CPUs
* lint
* Prestart min(available CPU, backlog)
* Fix test, adjust policy
* debug
* retry
* lint
2021-08-29 14:11:53 -07:00
Yi Cheng
d5cd95364b
[workflow] Some usability issues fixing ( #18133 )
2021-08-28 16:51:00 -07:00
Amog Kamsetty
3b77840c1b
PyTorch Lightning Updates ( #17876 )
2021-08-27 23:15:51 -07:00
Antoni Baum
e7bbadb920
[tune] Extend Tune Callback API ( #17794 )
...
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-08-27 18:05:12 -07:00
Antoni Baum
714193ce6f
[SGDv2] Tensorboard Callback ( #17824 )
...
* [SGD] save checkpoints to disk
* fix test; add logs
* Extend SGDv2 callback API
* Move json file creation to JsonLoggerCallback
* TBXLoggerCallback
* Simplify, fix linear example
* rename log_dir to logdir for consistency with tune
* Add test
* Fix
* Break up logging classes
* Fix error
* Update type hint for results
* Refactor
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-08-27 17:50:26 -07:00
Eric Liang
95b5ad12ba
Initial version of workflow documentation ( #18138 )
2021-08-27 16:20:48 -07:00
Jiao
c7e38ceb10
[serve] Better constructor failure handling ( #16922 )
2021-08-27 18:05:22 -05:00
mwtian
26679d62c5
[Core][ObjectRef] Change default to not record call stack during ObjectRef creation ( #18078 )
2021-08-27 15:45:34 -07:00
Clark Zinzow
c0598de82a
[Datasets] Port write APIs to use file-based datasources. ( #18135 )
2021-08-27 15:24:54 -07:00
Chen Shen
28e6ae5ce0
[Test] fix object spilling 2 ( #18141 )
2021-08-27 13:52:42 -07:00
Clark Zinzow
aee7ba2510
[Datasets] Add from_numpy() and to_numpy() APIs ( #18146 )
2021-08-27 13:33:11 -07:00