Commit graph

2363 commits

Author SHA1 Message Date
Stephanie Wang
c052395f4e
[core] Remove "plasma promotion" for serialized ObjectRefs 2021-10-01 10:39:55 -07:00
architkulkarni
b0a5564f4e
[Serve] Integrate metrics with minimal autoscaling algorithm and add e2e test (#18793) 2021-10-01 10:21:12 -07:00
Tom Birch
aa0cab5cae
Don't export absl symbols as they collide with tensorflow (#18870)
Co-authored-by: Tom Birch <tom@powerlinespro.com>
2021-10-01 13:20:59 +08:00
mwtian
49a57aa477
[Scheduling] Report resource demand for infeasible 1-CPU tasks (#19000) 2021-09-30 22:03:02 -07:00
Edward Oakes
8e5d48d668
[runtime_env] Remove deprecated override_environment_variables and worker_env fields (#18213) 2021-09-30 18:55:24 -05:00
mwtian
d12e35ce53
[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955) 2021-09-30 10:19:54 -07:00
Simon Mo
910553c3bb
[Core] Add private method to retrieve current task queue length (#18964) 2021-09-30 09:20:04 -07:00
Stephanie Wang
5eddaabd11
[core] Fix bug in dependency resolution for actor handles (#18862)
* x

* lint
2021-09-29 13:25:31 -07:00
Jiajun Yao
ed9118393c
Listen to 127.0.0.1 by default on mac osx (#18904) 2021-09-29 11:40:19 -07:00
Dmitri Gekhtman
944309c017
Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)" (#18954)
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)"

This reverts commit fc6a739e4b.

* move to large test

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 11:02:14 -04:00
Chong-Li
42744f29ee
[GCS] Make Gcs-based actor scheduler's bookkeeping consistent (#18546)
* Make Gcs-based scheduler's bookkeeping consistent

* Remove this from lambda function

* Fix lambda function

* Trigger SchedulePendingActors

* Test for acquiring/releasing resources

* Reorganize structure

* Avoid overloading post

* Fix gcs_actor_manager_test

* Fix post counter and rename some func

* Fix unique_ptr

* Fix unique_ptr

* Fix book lint error

* Lint

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2021-09-29 05:53:34 -07:00
Lixin Wei
a6a02779fe
[Core] remove verbose log from task execution (#18736) 2021-09-29 00:31:33 -07:00
Yi Cheng
96dff6e46d
[core] fix implicit merge conflict (#18961) 2021-09-28 19:18:54 -07:00
Edward Oakes
73b8936aa8
[runtime_env] Unify rpc::RuntimeEnv with serialized_runtime_env field (#18641) 2021-09-28 15:13:15 -05:00
Yi Cheng
4af07a8917
[rpc] cpu improvement of protobuf in gcs (#17933) 2021-09-28 11:47:19 -07:00
SangBin Cho
a0a02f4982
[Placement Group] Fix placement group high cpu usage part 1 (#18652) 2021-09-28 11:14:59 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" (#18871)
* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057644.

* up
2021-09-28 05:53:52 -07:00
Chen Shen
057c425122
[Core][CoreWorker] call shutdown in the correct thread (#18910) 2021-09-28 01:29:47 -07:00
Chen Shen
25d14cb4de
Ensure task_execution_service_ is destructed first (#18913) 2021-09-27 18:31:56 -07:00
Chen Shen
cbd7dc749c
[Core][CoreWorker] fix data race of exiting_ 2021-09-27 10:55:03 -07:00
mwtian
66aac2e219
[C++] Use RayConfig to read internal environment variables only once (#18869)
* store environ on first access

* fix

* Use RayConfig

* fix

* fix

* Revert removal of headers. They are actually used.

* rename

* fix lint

* format

* use std::getenv()

* fix
2021-09-25 12:27:42 -07:00
Jiajun Yao
e79f271b05
Fix nil redis array element (#18813) 2021-09-24 20:11:43 -07:00
Eric Liang
11a2dfcaab
Improve unschedulable task warning messages by integrating with the autoscaler (#18724) 2021-09-24 12:19:58 -07:00
Stephanie Wang
7b1e594412
[core] Fix bug in ref counting protocol for nested objects (#18821)
* Fix assertion crash

* test, lint

* todo

* tests

* protocol

* test

* fix

* lint

* header

* recursive

* note

* forward test

* lock

* lint

* unneeded check
2021-09-23 09:45:12 -07:00
mwtian
e41109a5e7
[Client] Use async rpc for remote call and actor creation (#18298)
* Use async rpc for remote calls, task and actor creations.

* fix

* check placement

* check placement group. wait for id in destructor

* fix

* fix exception in destructor

* Add test

* revert change

* Fix comment

* fix
2021-09-22 18:30:50 -07:00
Yi Cheng
8dd3057644
Revert "[test] add unit test for PR #17634 (#18585)" (#18830)
This reverts commit 73c3cff18b.
2021-09-22 16:51:02 -07:00
Yi Cheng
73c3cff18b
[test] add unit test for PR #17634 (#18585) 2021-09-22 14:39:30 -07:00
Yi Cheng
fc6a739e4b
[nightly] Deflaky nightly test many_nodes_actor_test (#18582) 2021-09-20 22:43:48 -07:00
DK.Pino
d329101469
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." (#18735)
* fix conflict

* cxx lint
2021-09-20 20:18:12 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722) 2021-09-20 19:17:00 -07:00
Ameer Haj Ali
9efbd80733
[core] avoid scheduling on gpu nodes by default (#18743)
* [core] avoid scheduling on gpu nodes by default

* Fix cluster_task_manager_test tests.

Made most tests in cluster_task_manager_test not use GPU on the head
node.

Also added another test to scheduling_policy_test.

Co-authored-by: Sasha Sobol <sasha@asobol.com>
2021-09-20 17:38:40 -07:00
Stephanie Wang
eafe6d5c79
Fix ref counting assertion check (#18752)
* Fix assertion crash

* test, lint

* todo

* x
2021-09-20 15:16:19 -07:00
Chen Shen
b321abc560
[Core] fix another thread safety issue in instrumented_io_context 2021-09-19 17:44:31 -07:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor (#18746) 2021-09-19 12:01:31 -07:00
Eric Liang
d6ff390858
Task failure should not log error (#18742) 2021-09-18 13:26:32 -07:00
Qing Wang
6f1d3f94db
Publish actor state PENDING_CREATION for dashboard showing. (#18666) 2021-09-18 15:44:58 +08:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI (#18622)
* Add Bazel config for building with llvm. Upgrade C++ std to 17.

* Fix redis. Try fixing asan and tsan

* Fix asan and format

* Update comments.

Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures (#18329)
* wip

* client tests working again

* extra prints

* start reconnect logic for proxier

* local proxy more wip

* delay cleanup logic working on proxy

* Fix up dataservicer logic

* lint + fix proxy data servicer exit logic

* hmmm

* delay cleanup always in dataservicer

* fix last_seen check

* cancel channel on error

* explicitly request cleanup

* cleanup request fixes

* fix dataclient proxy

* start idempotence logic

* change default channel state

* add backoff logic

* move connection logic back into worker.__init__

* add logic for replay cache case where request was received but response hasn't been fully resolved

* new proto entries for data stream caching

* start replay_cache logic, increase cleanup delay

* hardcode retries

* Let data channel attempt reconnects

* manually reset queue, remove replay_cache logic

* reduce cleanup delay to 5 minutes

* fix local tests

* Remove async cache logic

* retry async requests

* simplify backoff logic

* Fix ray client proto

* Configurable reconnect grace period

* Basic logsclient fix?

* Configure grace through environment variable

* Use stopped event to force faster datapath cleanup

* Better connect+reconnect logic

* fix reconnect_grace_period default

* init fixes for reconnect_grace_period

* cleanup

* fix _get_client_id_from_context call

* add logic for pathological cache cases

* less intrusive data channel error message

* fix tests

* Make stuff less painful to read

* add ordered replay cache for dataservicer, replay cache tests

* fix ordering import, start_reconnect test

* add middleman testing logic

* enforce ordering of dataclient requests

* retry wheels

* grace period through env only, restore test_dataclient_disconnect

* minor fixes

* force rerun

* less intrusive error msgs

* address review

* replay->response cache

* remove unneeded sleep

* typing

* extra response cache test

* fix error msg

* remove TODO

* add _reconnect_channel

* add grace period test

* store thread_id and req_id in metadata

* Revert "store thread_id and req_id in metadata"

This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.

* Revert "Revert "store thread_id and req_id in metadata""

This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.

* fix metadata check

* remove comment

* removed unused cv

* cast back to int

* refactor Datapath for readability

* Revert refactor

This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.

* fix comment

* merge fixes

* refactor _shutdown

* address reviews

* log errors in both cases

* add comments

* address reviews

* move reconnect test to medium

* Always propogate error to callbacks

* readability

* formatting

* Faster cleanup on uncaught dataservicer errors

* delete tmp file

* offset commit

* rrefactor

* propagate data servicer error message

* Stricter handling/propagation of errors

* remove tmp file

* better docs

* forward reconnecting metadata

* add annotation

* fix invalidate + add test

* fix docstrings and types

* disable retries and caching if reconnect grace period is set to 0

* update comments

* address review, increase ack batch size and skip ack's if reconnect isn't enabled

* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring (#18009) 2021-09-17 10:03:35 -07:00
architkulkarni
a9cce8a34b
[serve] Add basic calculate_desired_num_replicas function for autoscaling (#18658) 2021-09-17 00:18:51 -07:00
Simon Mo
3029812b8b
[Serve] Autoscaling metric store take 2 (#18683) 2021-09-16 22:28:13 -07:00
DK.Pino
12b3b1f723
[core] Log resource name not id (#18598) 2021-09-16 16:28:09 -07:00
Simon Mo
317a34c523
[Serve] Use BackendConfig Protobuf (#17835) 2021-09-16 11:08:23 -07:00
Guyang Song
187e4a86ca
[C++ API] expose C++ task failure event (#18596) 2021-09-16 19:20:16 +08:00
Sasha Sobol
2f0e22aa4e
prioritize non-gpu nodes when scheduling CPU-only requests (#18615) 2021-09-16 09:57:24 +01:00
Stephanie Wang
be7cb70c30
[core] Fix ref counting during actor construction (#18646)
* test

* fix

* cpp

* skip windows

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-09-15 22:16:53 -07:00
liuyang-my
ed04ab7140
Define protobuf for RequestMetadata and HTTPRequestWrapper (#18203) 2021-09-15 14:39:27 -07:00
Edward Oakes
7d0a2b39e3
[runtime_env] Remove dynamically imported setup_hook (#18601) 2021-09-15 10:19:55 -05:00
Eric Liang
15512c27c2
Revert "Revert "Route core worker ERROR/FATAL logs to driver logs (#1… (#18604) 2021-09-14 13:32:07 -07:00