Stephanie Wang
c052395f4e
[core] Remove "plasma promotion" for serialized ObjectRefs
2021-10-01 10:39:55 -07:00
architkulkarni
b0a5564f4e
[Serve] Integrate metrics with minimal autoscaling algorithm and add e2e test ( #18793 )
2021-10-01 10:21:12 -07:00
Tom Birch
aa0cab5cae
Don't export absl symbols as they collide with tensorflow ( #18870 )
...
Co-authored-by: Tom Birch <tom@powerlinespro.com>
2021-10-01 13:20:59 +08:00
mwtian
49a57aa477
[Scheduling] Report resource demand for infeasible 1-CPU tasks ( #19000 )
2021-09-30 22:03:02 -07:00
Edward Oakes
8e5d48d668
[runtime_env] Remove deprecated override_environment_variables and worker_env fields ( #18213 )
2021-09-30 18:55:24 -05:00
mwtian
d12e35ce53
[Object manager] don't abort entire pull request on race condition in concurrent chunk receive ( #18955 )
2021-09-30 10:19:54 -07:00
Simon Mo
910553c3bb
[Core] Add private method to retrieve current task queue length ( #18964 )
2021-09-30 09:20:04 -07:00
Stephanie Wang
5eddaabd11
[core] Fix bug in dependency resolution for actor handles ( #18862 )
...
* x
* lint
2021-09-29 13:25:31 -07:00
Jiajun Yao
ed9118393c
Listen to 127.0.0.1 by default on mac osx ( #18904 )
2021-09-29 11:40:19 -07:00
Dmitri Gekhtman
944309c017
Revert "[nightly] Deflaky nightly test many_nodes_actor_test ( #18582 )" ( #18954 )
...
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582 )"
This reverts commit fc6a739e4b
.
* move to large test
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 11:02:14 -04:00
Chong-Li
42744f29ee
[GCS] Make Gcs-based actor scheduler's bookkeeping consistent ( #18546 )
...
* Make Gcs-based scheduler's bookkeeping consistent
* Remove this from lambda function
* Fix lambda function
* Trigger SchedulePendingActors
* Test for acquiring/releasing resources
* Reorganize structure
* Avoid overloading post
* Fix gcs_actor_manager_test
* Fix post counter and rename some func
* Fix unique_ptr
* Fix unique_ptr
* Fix book lint error
* Lint
Co-authored-by: Chong-Li <lc300133@antgroup.com>
2021-09-29 05:53:34 -07:00
Lixin Wei
a6a02779fe
[Core] remove verbose log from task execution ( #18736 )
2021-09-29 00:31:33 -07:00
Yi Cheng
96dff6e46d
[core] fix implicit merge conflict ( #18961 )
2021-09-28 19:18:54 -07:00
Edward Oakes
73b8936aa8
[runtime_env] Unify rpc::RuntimeEnv with serialized_runtime_env field ( #18641 )
2021-09-28 15:13:15 -05:00
Yi Cheng
4af07a8917
[rpc] cpu improvement of protobuf in gcs ( #17933 )
2021-09-28 11:47:19 -07:00
SangBin Cho
a0a02f4982
[Placement Group] Fix placement group high cpu usage part 1 ( #18652 )
2021-09-28 11:14:59 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 ( #18585 )" ( #18830 )" ( #18871 )
...
* Revert "Revert "[test] add unit test for PR #17634 (#18585 )" (#18830 )"
This reverts commit 8dd3057644
.
* up
2021-09-28 05:53:52 -07:00
Chen Shen
057c425122
[Core][CoreWorker] call shutdown in the correct thread ( #18910 )
2021-09-28 01:29:47 -07:00
Chen Shen
25d14cb4de
Ensure task_execution_service_ is destructed first ( #18913 )
2021-09-27 18:31:56 -07:00
Chen Shen
cbd7dc749c
[Core][CoreWorker] fix data race of exiting_
2021-09-27 10:55:03 -07:00
mwtian
66aac2e219
[C++] Use RayConfig to read internal environment variables only once ( #18869 )
...
* store environ on first access
* fix
* Use RayConfig
* fix
* fix
* Revert removal of headers. They are actually used.
* rename
* fix lint
* format
* use std::getenv()
* fix
2021-09-25 12:27:42 -07:00
Jiajun Yao
e79f271b05
Fix nil redis array element ( #18813 )
2021-09-24 20:11:43 -07:00
Eric Liang
11a2dfcaab
Improve unschedulable task warning messages by integrating with the autoscaler ( #18724 )
2021-09-24 12:19:58 -07:00
Stephanie Wang
7b1e594412
[core] Fix bug in ref counting protocol for nested objects ( #18821 )
...
* Fix assertion crash
* test, lint
* todo
* tests
* protocol
* test
* fix
* lint
* header
* recursive
* note
* forward test
* lock
* lint
* unneeded check
2021-09-23 09:45:12 -07:00
mwtian
e41109a5e7
[Client] Use async rpc for remote call and actor creation ( #18298 )
...
* Use async rpc for remote calls, task and actor creations.
* fix
* check placement
* check placement group. wait for id in destructor
* fix
* fix exception in destructor
* Add test
* revert change
* Fix comment
* fix
2021-09-22 18:30:50 -07:00
Yi Cheng
8dd3057644
Revert "[test] add unit test for PR #17634 ( #18585 )" ( #18830 )
...
This reverts commit 73c3cff18b
.
2021-09-22 16:51:02 -07:00
Yi Cheng
73c3cff18b
[test] add unit test for PR #17634 ( #18585 )
2021-09-22 14:39:30 -07:00
Yi Cheng
fc6a739e4b
[nightly] Deflaky nightly test many_nodes_actor_test ( #18582 )
2021-09-20 22:43:48 -07:00
DK.Pino
d329101469
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." ( #18735 )
...
* fix conflict
* cxx lint
2021-09-20 20:18:12 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring ( #18009 )" ( #18719 )" ( #18722 )
2021-09-20 19:17:00 -07:00
Ameer Haj Ali
9efbd80733
[core] avoid scheduling on gpu nodes by default ( #18743 )
...
* [core] avoid scheduling on gpu nodes by default
* Fix cluster_task_manager_test tests.
Made most tests in cluster_task_manager_test not use GPU on the head
node.
Also added another test to scheduling_policy_test.
Co-authored-by: Sasha Sobol <sasha@asobol.com>
2021-09-20 17:38:40 -07:00
Stephanie Wang
eafe6d5c79
Fix ref counting assertion check ( #18752 )
...
* Fix assertion crash
* test, lint
* todo
* x
2021-09-20 15:16:19 -07:00
Chen Shen
b321abc560
[Core] fix another thread safety issue in instrumented_io_context
2021-09-19 17:44:31 -07:00
Chen Shen
35aa944ef4
Fix thread-safety in global state accessor ( #18746 )
2021-09-19 12:01:31 -07:00
Eric Liang
d6ff390858
Task failure should not log error ( #18742 )
2021-09-18 13:26:32 -07:00
Qing Wang
6f1d3f94db
Publish actor state PENDING_CREATION for dashboard showing. ( #18666 )
2021-09-18 15:44:58 +08:00
mwtian
efdbfcfdfb
[Build] Generate Bazel config for compiling with clang and libc++ in CI ( #18622 )
...
* Add Bazel config for building with llvm. Upgrade C++ std to 17.
* Fix redis. Try fixing asan and tsan
* Fix asan and format
* Update comments.
Co-authored-by: Chen Shen <scv119@gmail.com>
2021-09-17 19:01:07 -07:00
Chris K. W
8858489e2f
[client] let ray client reconnect on grpc failures ( #18329 )
...
* wip
* client tests working again
* extra prints
* start reconnect logic for proxier
* local proxy more wip
* delay cleanup logic working on proxy
* Fix up dataservicer logic
* lint + fix proxy data servicer exit logic
* hmmm
* delay cleanup always in dataservicer
* fix last_seen check
* cancel channel on error
* explicitly request cleanup
* cleanup request fixes
* fix dataclient proxy
* start idempotence logic
* change default channel state
* add backoff logic
* move connection logic back into worker.__init__
* add logic for replay cache case where request was received but response hasn't been fully resolved
* new proto entries for data stream caching
* start replay_cache logic, increase cleanup delay
* hardcode retries
* Let data channel attempt reconnects
* manually reset queue, remove replay_cache logic
* reduce cleanup delay to 5 minutes
* fix local tests
* Remove async cache logic
* retry async requests
* simplify backoff logic
* Fix ray client proto
* Configurable reconnect grace period
* Basic logsclient fix?
* Configure grace through environment variable
* Use stopped event to force faster datapath cleanup
* Better connect+reconnect logic
* fix reconnect_grace_period default
* init fixes for reconnect_grace_period
* cleanup
* fix _get_client_id_from_context call
* add logic for pathological cache cases
* less intrusive data channel error message
* fix tests
* Make stuff less painful to read
* add ordered replay cache for dataservicer, replay cache tests
* fix ordering import, start_reconnect test
* add middleman testing logic
* enforce ordering of dataclient requests
* retry wheels
* grace period through env only, restore test_dataclient_disconnect
* minor fixes
* force rerun
* less intrusive error msgs
* address review
* replay->response cache
* remove unneeded sleep
* typing
* extra response cache test
* fix error msg
* remove TODO
* add _reconnect_channel
* add grace period test
* store thread_id and req_id in metadata
* Revert "store thread_id and req_id in metadata"
This reverts commit 12bc05cc0ceb0b764e2279353ba003fca16c3181.
* Revert "Revert "store thread_id and req_id in metadata""
This reverts commit 67874cf3a207fed49e6070c7e955a640f0094d19.
* fix metadata check
* remove comment
* removed unused cv
* cast back to int
* refactor Datapath for readability
* Revert refactor
This reverts commit f789bad473c953eebabefe7eb6aa891e5b8a8f13.
* fix comment
* merge fixes
* refactor _shutdown
* address reviews
* log errors in both cases
* add comments
* address reviews
* move reconnect test to medium
* Always propogate error to callbacks
* readability
* formatting
* Faster cleanup on uncaught dataservicer errors
* delete tmp file
* offset commit
* rrefactor
* propagate data servicer error message
* Stricter handling/propagation of errors
* remove tmp file
* better docs
* forward reconnecting metadata
* add annotation
* fix invalidate + add test
* fix docstrings and types
* disable retries and caching if reconnect grace period is set to 0
* update comments
* address review, increase ack batch size and skip ack's if reconnect isn't enabled
* Don't terminate data stream on missing reconnecting metadata
2021-09-18 01:11:00 +03:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring ( #18009 )" ( #18719 )
...
This reverts commit 8ce01ea2cc
.
2021-09-17 13:34:12 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring ( #18009 )
2021-09-17 10:03:35 -07:00
architkulkarni
a9cce8a34b
[serve] Add basic calculate_desired_num_replicas function for autoscaling ( #18658 )
2021-09-17 00:18:51 -07:00
Simon Mo
3029812b8b
[Serve] Autoscaling metric store take 2 ( #18683 )
2021-09-16 22:28:13 -07:00
DK.Pino
12b3b1f723
[core] Log resource name not id ( #18598 )
2021-09-16 16:28:09 -07:00
Simon Mo
317a34c523
[Serve] Use BackendConfig Protobuf ( #17835 )
2021-09-16 11:08:23 -07:00
Guyang Song
187e4a86ca
[C++ API] expose C++ task failure event ( #18596 )
2021-09-16 19:20:16 +08:00
Sasha Sobol
2f0e22aa4e
prioritize non-gpu nodes when scheduling CPU-only requests ( #18615 )
2021-09-16 09:57:24 +01:00
Stephanie Wang
be7cb70c30
[core] Fix ref counting during actor construction ( #18646 )
...
* test
* fix
* cpp
* skip windows
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-09-15 22:16:53 -07:00
liuyang-my
ed04ab7140
Define protobuf for RequestMetadata and HTTPRequestWrapper ( #18203 )
2021-09-15 14:39:27 -07:00
Edward Oakes
7d0a2b39e3
[runtime_env] Remove dynamically imported setup_hook ( #18601 )
2021-09-15 10:19:55 -05:00
Eric Liang
15512c27c2
Revert "Revert "Route core worker ERROR/FATAL logs to driver logs (#1… ( #18604 )
2021-09-14 13:32:07 -07:00