Commit graph

9564 commits

Author SHA1 Message Date
Simon Mo
2367a2cb90 Fix windows build environment breakage (#19019) 2021-10-01 16:44:55 -07:00
gjoliver
95bd355350 Upgrade bazel version to 4.2.1 (#18996) 2021-10-01 16:43:38 -07:00
Stephanie Wang
2da4ad8784 [core] Remove "plasma promotion" for serialized ObjectRefs 2021-10-01 11:19:34 -07:00
architkulkarni
7fa75fec14 [Doc] [runtime env] Remove delta caching remark and state Client+@remote limitation (#19010) 2021-09-30 20:39:00 -07:00
Amog Kamsetty
366e7ddb29 [SGD] v1 to v2 Migration Guide (#18887)
* wip

* add guide

* fix test

* address comments

* add to docs

* fix

* remove markdown

* add warning to all pages

* formatting

* fix

* links

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* Update doc/source/raysgd/v2/migration-guide.rst

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* address comments

* fix

* address comments

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-30 20:37:15 -07:00
architkulkarni
15638d0edd [runtime env] Parse local pip/conda requirements files locally upon task/actor definition (#18988) 2021-09-30 20:30:13 -07:00
Dmitri Gekhtman
7b99b12294 Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)" (#18954)
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)"

This reverts commit fc6a739e4b.

* move to large test

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 08:09:45 -07:00
matthewdeng
20ec029255 [SGD] add share_cuda_visible_devices config flag (#18958) 2021-09-29 08:08:44 -07:00
Chris K. W
08a7bef9e0 [client] remove ray_trace_ctx from kwargs if tracing disabled (#18926) 2021-09-29 08:07:25 -07:00
matthewdeng
4da70fbf06 [SGD] add SGDv2 survey link to docs (#18934) 2021-09-29 08:05:11 -07:00
Jiajun Yao
0b071427e6 Fix nil redis array element (#18813) 2021-09-24 23:00:57 -07:00
architkulkarni
ce455be708 [runtime env] [Serve] Fix error when uris field is None (#18874) 2021-09-24 23:00:07 -07:00
Guyang Song
77713a913e [C++ API][hotfix] fix C++ worker dynamic library loading issue on macOS (#18877)
* fix C++ worker in macox

* fix
2021-09-24 22:58:28 -07:00
Guyang Song
80990d18c7 [C++ API] support head_args config in C++ API (#18709) 2021-09-23 18:55:14 -07:00
Simon Mo
9ec62ee865 [Serve] Exit run_forever when actor shutdown (#18820) 2021-09-23 15:51:04 -07:00
Guyang Song
dab5b70e49 [wheel][cpp] recover cpp extra (#18597) 2021-09-23 14:16:28 -07:00
Sven Mika
3a44a3cbb3 [RLlib] POC: Separate losses for APPO/IMPALA. Enable TFPolicy to handle multiple optimizers/losses (like TorchPolicy). (#18669) 2021-09-23 14:15:51 -07:00
Yi Cheng
672d3f2736 [nightly] Deflaky nightly test many_nodes_actor_test (#18582) 2021-09-23 14:14:21 -07:00
Kai Fricke
e39b0f785e [tune/rllib] Only disable ipython in remote actors (#18789) 2021-09-23 14:13:37 -07:00
Amog Kamsetty
f71cfca439 [SGD] Retry sgd.local_rank() (#18824)
* finish

* fix

* wip

* address comment

* update

* fix test

* fix failing test

* address comments

* fix test

* fix
2021-09-23 14:12:15 -07:00
Simon Mo
5e9cb232c7 [Serve] Doc: Mock ray.serve.generated package for doc building (#18767) 2021-09-20 15:42:53 -07:00
Jiajun Yao
5b16fa1e9f Bump ray version to 1.7.0 2021-09-20 14:06:18 -07:00
Kai Fricke
2e99fb215f
[tune] Cache unstaged placement groups for potential re-use (#18706) 2021-09-20 20:23:35 +01:00
Sven Mika
e6aae61487
[RLlib; testing] Fix bug in stress tests not handling >1 trials per experiment (due to grid-search in IMPALA stress tests). (#18705) 2021-09-20 15:31:57 +02:00
Ian Rodney
8d6ddcee53
[GCP] Add conda to the path when possible. (#18653) 2021-09-19 23:06:48 -07:00
Eric Liang
85aaca8d45
Update the contribution guide / style guide (#18753) 2021-09-19 20:14:51 -07:00
Chen Shen
b321abc560
[Core] fix another thread safety issue in instrumented_io_context 2021-09-19 17:44:31 -07:00
Eric Liang
2fa9648ef0
Revert "add integration test for gpu scheduling/avoidance (#18729)" (#18754)
This reverts commit 57edc0c607.
2021-09-19 17:05:05 -07:00
Dmitri Gekhtman
ffe533b297
[autoscaler] Log ips and ids when terminating nodes, code structure (#18180)
* recovery failure uses same termination function

* More cleanup

* More cleanup

* ips

* wip

* wip

* wip

* Fix tests

* tweak
2021-09-19 18:44:38 -04:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor (#18746) 2021-09-19 12:01:31 -07:00
xwjiang2010
5551cdac19
[Tune] Break from loop after warning msg is logged. (#18720) 2021-09-18 16:33:44 -07:00
mwtian
32f71765e9
[Client] Allow Client{Object,Actor}Ref to accept a future. (#18677)
* Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously.

* rename callback, fix
2021-09-18 16:32:02 -07:00
Eric Liang
d6ff390858
Task failure should not log error (#18742) 2021-09-18 13:26:32 -07:00
Sasha Sobol
57edc0c607
add integration test for gpu scheduling/avoidance (#18729) 2021-09-18 01:32:18 -07:00
qicosmos
64c25987f3
[C++ Worker]Simple kv store example (#18613) 2021-09-18 16:02:44 +08:00
Chen Shen
eab1d28fd3
fix test (#18737) 2021-09-18 00:57:34 -07:00
Qing Wang
6f1d3f94db
Publish actor state PENDING_CREATION for dashboard showing. (#18666) 2021-09-18 15:44:58 +08:00
Jiao
948508efb8
[Serve] Add checkpoint options and custom storage option (#18657) 2021-09-18 00:04:29 -07:00
DK.Pino
4ef8fd6942
remove the legacy retry mechanism (#18589) 2021-09-18 11:11:19 +08:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI (#18622)
* Add Bazel config for building with llvm. Upgrade C++ std to 17.

* Fix redis. Try fixing asan and tsan

* Fix asan and format

* Update comments.

Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Amog Kamsetty
0211101e6f
[SGD] Redo Class API (#18728)
* wip

* wip

* add horovod example

* add example

* lint

* fix

* address comments

* updates

* lint

* update example

* address comment

* address comment

* update

* fix

* Update python/ray/util/sgd/v2/examples/horovod/horovod_stateful_example.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* address comments

* add back name mangling

* fix tests

* Update python/ray/util/sgd/v2/trainer.py

* fix

* lint

* fix

* fix docstring

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* update

* fix failing test

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-09-17 18:17:51 -07:00
Clark Zinzow
aaa097c293
[Datasets] Shuffled data loading support (#18678) 2021-09-17 16:08:53 -07:00
Simon Mo
f2ea6c4e68
[Serve] Call Callable.__del__ explicit during graceful shutdown (#18446) 2021-09-17 15:19:57 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures (#18329)
* wip

* client tests working again

* extra prints

* start reconnect logic for proxier

* local proxy more wip

* delay cleanup logic working on proxy

* Fix up dataservicer logic

* lint + fix proxy data servicer exit logic

* hmmm

* delay cleanup always in dataservicer

* fix last_seen check

* cancel channel on error

* explicitly request cleanup

* cleanup request fixes

* fix dataclient proxy

* start idempotence logic

* change default channel state

* add backoff logic

* move connection logic back into worker.__init__

* add logic for replay cache case where request was received but response hasn't been fully resolved

* new proto entries for data stream caching

* start replay_cache logic, increase cleanup delay

* hardcode retries

* Let data channel attempt reconnects

* manually reset queue, remove replay_cache logic

* reduce cleanup delay to 5 minutes

* fix local tests

* Remove async cache logic

* retry async requests

* simplify backoff logic

* Fix ray client proto

* Configurable reconnect grace period

* Basic logsclient fix?

* Configure grace through environment variable

* Use stopped event to force faster datapath cleanup

* Better connect+reconnect logic

* fix reconnect_grace_period default

* init fixes for reconnect_grace_period

* cleanup

* fix _get_client_id_from_context call

* add logic for pathological cache cases

* less intrusive data channel error message

* fix tests

* Make stuff less painful to read

* add ordered replay cache for dataservicer, replay cache tests

* fix ordering import, start_reconnect test

* add middleman testing logic

* enforce ordering of dataclient requests

* retry wheels

* grace period through env only, restore test_dataclient_disconnect

* minor fixes

* force rerun

* less intrusive error msgs

* address review

* replay->response cache

* remove unneeded sleep

* typing

* extra response cache test

* fix error msg

* remove TODO

* add _reconnect_channel

* add grace period test

* store thread_id and req_id in metadata

* Revert "store thread_id and req_id in metadata"

This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.

* Revert "Revert "store thread_id and req_id in metadata""

This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.

* fix metadata check

* remove comment

* removed unused cv

* cast back to int

* refactor Datapath for readability

* Revert refactor

This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.

* fix comment

* merge fixes

* refactor _shutdown

* address reviews

* log errors in both cases

* add comments

* address reviews

* move reconnect test to medium

* Always propogate error to callbacks

* readability

* formatting

* Faster cleanup on uncaught dataservicer errors

* delete tmp file

* offset commit

* rrefactor

* propagate data servicer error message

* Stricter handling/propagation of errors

* remove tmp file

* better docs

* forward reconnecting metadata

* add annotation

* fix invalidate + add test

* fix docstrings and types

* disable retries and caching if reconnect grace period is set to 0

* update comments

* address review, increase ack batch size and skip ack's if reconnect isn't enabled

* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Jiajun Yao
ffe7108eae
Fix cpp api doc (#18671) 2021-09-17 14:01:23 -07:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
xwjiang2010
09e760a1fd
[Release] Change all cpus_per_actor in xgboost test. (#18717) 2021-09-17 12:57:21 -07:00
xwjiang2010
2c92f737f9
Fix dask_xgboost_test (#18713) 2021-09-17 11:25:54 -07:00
xwjiang2010
9c8c6c09cb
Revert "[SGD] v2 Class API (#18571)" (#18715)
This reverts commit de050e8187.
2021-09-17 10:34:36 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring (#18009) 2021-09-17 10:03:35 -07:00