Commit graph

5056 commits

Author SHA1 Message Date
Edward Oakes
f0555f88d6
[runtime_env] Move worker process startup logic to context (#18341) 2021-09-08 17:08:27 -05:00
Antoni Baum
dd6abed6ce
[tune] Fix an edge case where DurableTrainable would not delete checkpoints in remote storage (#18318)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2021-09-08 15:00:09 -07:00
Ian Rodney
c91e0eb065
[Dashboard] Increase Actor Snapshot Size (#18433) 2021-09-08 12:06:33 -07:00
Sasha Sobol
f76f14fedf
[client] pass _credentials down from init (#18425) 2021-09-08 10:30:26 -07:00
Clark Zinzow
b30c41759d
[Datasets] Adds tensor column support (tensors-in-tables) via Pandas/Arrow extension types/arrays. (#18301) 2021-09-08 10:09:01 -07:00
mwtian
e427e4a467
Fix flakiness in test_proxy_manager_internal_kv (#18416) 2021-09-08 15:46:45 +03:00
Kai Fricke
dac3a8bc8e
[setup] Upstream conda patches (#17575)
Co-authored-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
2021-09-08 10:37:17 +01:00
Edward Oakes
56adaa32f1
[serve] Better logging for exceptions in backend_state.update() (#18402) 2021-09-07 21:40:41 -05:00
Simon Mo
a29da81cfc
Revert "Revert "Fix tracing bug when actors are defined before connecting to …" (#16122) 2021-09-07 16:19:49 -07:00
Edward Oakes
f2afb08125
[runtime_env] Don't modify passed runtime_env dictionary when validating (#18404) 2021-09-07 16:14:28 -07:00
Lada Kunc
1a72c49009
[serve] Fix get_handle execution from threads (#18198) 2021-09-07 14:49:36 -07:00
Guyang Song
f104a5aad7
[docs] Fix cpp wheel description (#18386) 2021-09-07 15:45:04 -05:00
xwjiang2010
64c2f86a22
[Tune] Respect default_resources during Trial.reset(). (#18209) 2021-09-07 19:14:44 +01:00
Clark Zinzow
26b2720915
Add test coverage for writing to fsspec filesystems. (#18394) 2021-09-07 10:16:59 -07:00
Jiajun Yao
2740d28fad
[client] Increase timeout for ProxyManager.get_channel (#18350) 2021-09-07 11:06:17 -05:00
Sven Mika
cabaa3b3c6
[RLlib Testing] Add A3C/APPO/BC/DDPPO/MARWIL/CQL/ES/ARS/TD3 to weekly learning tests. (#18381) 2021-09-07 11:48:41 +02:00
Jiajun Yao
64040a90a5
Datasets schema should match the columns selection for Parquet (#18361) 2021-09-07 00:41:26 -07:00
Sasha Sobol
f24ccf475e
[client] Add a grpc.ChannelCredentials argument to ray.init (#18365)
Co-authored-by: Thomas Desrosiers <thomas@anyscale.com>
2021-09-07 00:17:13 -07:00
Kai Fricke
f3a3a4bc92
[tune] Queue more than more actor/placement group (#18338) 2021-09-06 09:41:08 -07:00
Eric Liang
cbdafa0b63
[doc] Fix various workflow doc bugs (#18357) 2021-09-06 01:39:08 -07:00
Richard Liaw
0594deafdf
[tune] allow users to configure bootstrap for docker syncer (#17786) 2021-09-05 22:04:31 -07:00
Richard Liaw
93f7976215
[docs/deps] Clean up dependency ux/docs #18360
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-09-05 22:03:32 -07:00
Eric Liang
c4199a8054
Add more workflow comparisons (#18347) 2021-09-03 19:26:33 -07:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. (#18236) 2021-09-03 14:23:56 -07:00
Jiajun Yao
e049d52d29
Retry application-level error by default for datasets (#18296) 2021-09-03 14:21:38 -07:00
matthewdeng
26f73ebb0b
[sgd] Implement resources_per_worker (#18327)
* [sgd] add support for additional resources per worker

* [sgd] add support for additional resources per worker

* update test

* lint

* update comments for case-sensitivity
2021-09-03 11:10:46 -07:00
xwjiang2010
01adf030ec
[Tune] Raise Error when there are insufficient resources. (#17957) 2021-09-03 10:49:54 -07:00
Edward Oakes
a11978ea42
[runtime_env] Remove unused serialized-runtime-env from worker args (#18295) 2021-09-03 10:57:01 -05:00
Edward Oakes
1f6705d35d
[runtime_env] Centralize runtime_env logic into ray._private.runtime_env submodule (#18310) 2021-09-03 10:19:00 -05:00
Kai Fricke
fb38d06cfb
Move RLLib GPU release test dependencies to ml docker (#18208) 2021-09-03 09:35:18 +01:00
Alex Wu
fa961032e1
[workflow] object ref integration (#18128)
* notes

* notes

* .

* seems to work?

* .

* seems to work

* needs tests

* needs tests

* parallelize uploads

* fixed

* fixed

* .

* dumb test

* .

* .

* fix festsg

* .

* works

* .:

* .

* .

* Update common.py

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-09-02 19:59:45 -07:00
Amog Kamsetty
40b6d765df
[SGD] v2 tune checkpointing (#18179)
* wip

* wip

* wip

* wip

* fix test

* finish

* fix failing tests

* address comments

* wip

* address comments

* update

* fix

* fix fault tolerance checkpoint id

* lint

* updates

* updates

* add test

* updates

* update

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update python/ray/util/sgd/v2/backends/backend.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* lint

* fix

* fix test

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-02 17:44:37 -07:00
Jiajun Yao
d9538a958b
Avoid duplicate exports of functions (#18284) 2021-09-02 17:36:52 -07:00
Edward Oakes
549a8fa948
[runtime_env] [ray_client] Remove PrepRuntimeEnv RPC, upload working_dir before calling ray.init in server (#18240) 2021-09-02 14:02:39 -05:00
Antoni Baum
4c95ea6d0a
[client] Improve Ray Client connection timeout information (#18281)
* Improve Ray Client connection timeout information

* fix lint issue.

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-09-02 16:34:11 +03:00
xwjiang2010
9fa7951171
[core] Log once when get_gpu_ids is called on driver. (#18282) 2021-09-01 16:47:00 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error (#17971)
* Attach call site to ObjectRef

* flag

* Fix build

* build

* build

* build

* x

* x

* skip on windows

* lint
2021-09-01 15:29:05 -07:00
Chris K. W
1a10108765
[core] Release function actor lock while waiting for actor class to be loaded by import thread (#18175) 2021-09-01 12:59:48 -07:00
Amog Kamsetty
9c2e7ffd97
[SGD] v2 Fault Tolerance (#18090)
* wip

* wip

* wip

* wip

* update

* finish

* remove

* fix

* update

* update

* update comment

* handle backend failures

* bump test timeout

* address comments

* fix

* fix

* address comments

* formatting

* add comment

* address comment

* fix failing test

* update error message

* Update python/ray/util/sgd/v2/trainer.py

* wip

* fix failing test

* formatting

* fix
2021-09-01 12:43:10 -07:00
Edward Oakes
0326bbb30a
[serve] Skip test_standalone namespace test on windows (#18277) 2021-09-01 12:58:59 -05:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors (#18176)
* Retry application-level errors

* Retry application-level errors

* Push retry message to the driver
2021-09-01 10:53:06 -07:00
Edward Oakes
673bf35c1f
Refactor BackendState to be per-backend instead of global (#18255) 2021-09-01 09:46:22 -05:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list (#16944) 2021-08-31 16:31:13 -07:00
Edward Oakes
5d122cf7b7
[runtime_env] Move working dir setup to the agent (#18170) 2021-08-31 17:22:49 -05:00
matthewdeng
a3123b6860
[SGD] v2 Horovod backend (#18047)
* [SGD] add Horovod backend

* address comments: set CUDA_VISIBLE_DEVICES, refactor code

* fix gpu test

* fix lint/test import

* address comments, add example cluster config

* delay horovod imports
2021-08-31 12:54:59 -07:00
Wesley Gifford
6133a561e9
Dataset from modin (#18122) 2021-08-31 11:19:35 -07:00
Nikita Vemuri
c5b99ab590
[serve] Start RayInternalKVStore in controller namespace (#18164) 2021-08-31 13:09:33 -05:00
Edward Oakes
17dded543c
Support passing gcs_client to internal_kv (#18235) 2021-08-31 12:46:41 -05:00
Ryan L. Melvin
c081c68de7
[tune] Conditional search space example using hyperopt (#18130)
Co-authored-by: Ryan Melvin <rmelvin@uabmc.edu>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2021-08-31 17:06:22 +02:00
Kai Fricke
a8dbc44f9a
[ci] minimal dependency install test (#18071) 2021-08-31 15:26:25 +02:00