Commit graph

2363 commits

Author SHA1 Message Date
Stephanie Wang
344f2d9073
[core] Fix race condition in distributed ref counting (#18584) 2021-09-14 11:02:59 -07:00
SangBin Cho
51d94ebee0
[Tests] Make nightly test work + Remove work stealing logs (#18300)
* make tests work

* .
2021-09-14 09:52:58 -07:00
Edward Oakes
7f8cdce67d
Revert "Route core worker ERROR/FATAL logs to driver logs (#18577)" (#18602)
This reverts commit 3e0ae38e11.
2021-09-14 10:41:10 -05:00
Eric Liang
3e0ae38e11
Route core worker ERROR/FATAL logs to driver logs (#18577) 2021-09-13 23:07:14 -07:00
Guyang Song
dee12be253
[Event] print event message to general log (#18376) 2021-09-14 12:24:49 +08:00
Stephanie Wang
284dee493e
[core][usability] Disambiguate ObjectLostErrors for better understandability (#18292)
* Define error types, throw error for ObjectReleased

* x

* Disambiguate OBJECT_UNRECONSTRUCTABLE and OBJECT_LOST

* OwnerDiedError

* fix test

* x

* ObjectReconstructionFailed

* ObjectReconstructionFailed

* x

* x

* print owner addr

* str

* doc

* rename

* x
2021-09-13 16:16:17 -07:00
Jiajun Yao
f8ae2b2b62
Don't pass in TaskID to TaskManager::MarkPendingTaskFailed since it can (#18532)
be got from TaskSpecification
2021-09-13 11:27:42 -07:00
Lingxuan Zuo
a67b9ee8d7
Remove custom resource from streaming (#18490) 2021-09-12 12:20:59 -07:00
Jiajun Yao
ae10a80d5e
Fix async actor worker process leak after calling ray.actor.exit_actor() (#18526) 2021-09-12 11:09:12 -07:00
Qing Wang
371f03fa48
Remove dynamic resource from client side. (#18514) 2021-09-11 10:39:59 -07:00
Chong-Li
d314d0c10e
[GCS] Fix the Windows build of GCS actor scheduling (#18012) 2021-09-10 17:17:25 -07:00
Lixin Wei
7e37d6e348
[Core] Add gRPC Server Backpressure Tests (#18500) 2021-09-10 17:17:09 -07:00
Edward Oakes
2fcfea10b3
[runtime_env] Move URI deletion logic to the agent, remove util worker code (#18471) 2021-09-10 00:13:32 -07:00
SangBin Cho
7b2ed4c1f8
[Placement group] Placement group scheduling hangs due to creation/removal race condition (#18419) 2021-09-09 20:39:01 -07:00
Chen Shen
5f57079041
use clang for C++ debug testing (#18343) 2021-09-09 15:48:36 -07:00
Lixin Wei
df803cee98
Revert "Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)" (#18424) 2021-09-08 19:55:06 -07:00
Edward Oakes
f0555f88d6
[runtime_env] Move worker process startup logic to context (#18341) 2021-09-08 17:08:27 -05:00
Lixin Wei
052ed115e7
[Core] Make It Easier to Grep Debug State Dump (#18382)
* add keyword to debug dump

* fix
2021-09-08 12:03:54 -07:00
Yi Cheng
7126d01c91
[core] upgrade gtest (#18288)
* up

* up

* format

* up

* flaky fix

* format

* up

* up

* format

* add debug

* up

* up

* up

* up

* up

* format

* fix

* format

* up

* up

* format
2021-09-08 11:15:34 -07:00
Kai Fricke
dac3a8bc8e
[setup] Upstream conda patches (#17575)
Co-authored-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
2021-09-08 10:37:17 +01:00
Lingxuan Zuo
46b941b702
[Streaming] Support streaming metric reporter (#17981)
* Streaming support metric reporter

* fix lint

* fix bazel format lint

* fix lint

* metric deps lint

* lint

* and comments for runtime reporter

* unordered_map instead

* comments

* fix visibility flag

* deps local .so target

* make stats public visibility

* stats lib in public

* add antgroup team tag
2021-09-08 14:36:00 +08:00
Chen Shen
df9c6aa863
[plasma] Check if the get request is removed (#18401) 2021-09-07 21:01:08 -07:00
Chen Shen
d65d291579
Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)
This reverts commit 4f6b50dc46.
2021-09-07 15:47:58 -07:00
Lixin Wei
4f6b50dc46
[Core] Fix ServerCall Leaking (#17863)
* fix backpressure bug

* update comments

* stash

* add test

* add basic tests

* add fixture

* stash

* fix

* draft

* fix

* test added

* fixed

* fixed

* lint

* Update src/ray/rpc/test/grpc_server_test.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* add copyright

* move test service to saperate file

* add ClientCallManager timeout tests

* fix

* lint

* lint

* lint

* test windows CI

* fix

* lint

* lint

* retry windows

* retry windows

* fix mac

* lint

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-09-07 12:15:43 -07:00
Guyang Song
5a89b47f56
[Event] support set event level (#18275)
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2021-09-06 16:41:49 +08:00
Chen Shen
7c9d261dce
[Core][plasma] consolidate stats calculation for plasma store 2021-09-05 22:24:21 -07:00
Chen Shen
cf4fb4edb3
[Core][plasma] fix the data race issue (#18312) 2021-09-03 18:51:27 -07:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. (#18236) 2021-09-03 14:23:56 -07:00
Jiajun Yao
e049d52d29
Retry application-level error by default for datasets (#18296) 2021-09-03 14:21:38 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue (#18064)
* begin

* build

* add test

* add first test

* add test

* fix build

* lint bazel

* fix build

* fix build

* fix crash

* fix some comment

* revert shared_ptr ObjectLifecycleManager

* fix RemoveGetRequest lost

* no defer

* fix lots of comments

* fix build

* fix data race

* fix comments

* Revert "fix data race"

This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.

* refine

* fix mac build

* fix unit test

* fix unit test
2021-09-02 14:16:50 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error (#17971)
* Attach call site to ObjectRef

* flag

* Fix build

* build

* build

* build

* x

* x

* skip on windows

* lint
2021-09-01 15:29:05 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core (#18265)
* up

* up

* up

* format

* up

* up

* format
2021-09-01 13:04:35 -07:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors (#18176)
* Retry application-level errors

* Retry application-level errors

* Push retry message to the driver
2021-09-01 10:53:06 -07:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list (#16944) 2021-08-31 16:31:13 -07:00
Guyang Song
be772df4dc
[Event] Add some error level events (#18118)
* add event 'RAY_WORKER_FAILURE' and 'RAY_DRIVER_FAILURE'

* add some events

* move event 'EL_RAY_NODE_REMOVED' to 'RemoveNode()'
2021-08-31 14:15:13 -07:00
SangBin Cho
d240d26525
[Object Spilling] Fix a bug where object url is empty. (#18193)
* Fix a bug

* Addressed code review.

* Fix a test
2021-08-31 10:10:28 -07:00
Stephanie Wang
8e06db7280
Revert "[Core] revert: revert Unified worker starter (#18008)" (#18228)
This reverts commit b9978dd02b.
2021-08-30 17:28:41 -07:00
SangBin Cho
2ee1b90c17
[Core] Batch obod location updates (#18016)
* Batch impl

* done

* Remove a client pool

* in progress

* Added unit tests.

* Handle owner failure case.

* Fix unit tests

* Addressed code review.
2021-08-30 11:04:08 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port (#17855)" (#18217)
This reverts commit 53ddb551d5.
2021-08-30 10:46:37 -07:00
SangBin Cho
0e968c1e82
[Core] Reduce spilling threshold (#17910)
* Lower the threshold

* ip

* Handle test failure

* lint

* last fix

* .

* Retry
2021-08-30 00:09:35 -07:00
fyrestone
53ddb551d5
Auto discover dashboard agent port (#17855) 2021-08-30 12:06:28 +08:00
Stephanie Wang
7bc1ef0dd9
[core] Prestart workers up to available CPU limit (#18166)
* Prestart workers according to num available CPUs

* lint

* Prestart min(available CPU, backlog)

* Fix test, adjust policy

* debug

* retry

* lint
2021-08-29 14:11:53 -07:00
mwtian
26679d62c5
[Core][ObjectRef] Change default to not record call stack during ObjectRef creation (#18078) 2021-08-27 15:45:34 -07:00
SangBin Cho
a25cc47399
[Core] Set keepalive only at gcs (#18086) 2021-08-27 01:26:51 -07:00
Edward Oakes
5c4c735119
[runtime_env] Make log message when deleting runtime_env INFO instead of ERROR (#18083) 2021-08-26 15:21:59 -05:00
SangBin Cho
405418f8e8
[Object Spilling] Unpin before updating URL (#17994)
* Unpin before updating URL

* Remove unnecessary logs.

* update compiling issue

* Check the consistent local state instead of stale information from obod.

* Fix the test

* Addressed code review.
2021-08-26 10:23:53 -07:00
Chen Shen
a29b157e2e
[core] better error message for lost objects (#18068) 2021-08-26 00:03:29 -07:00
Tao Wang
15a7514cf6
[Core] Some request counts are missing in debug info (#18069) 2021-08-25 14:02:03 -07:00
Guyang Song
16502cc438
[Event] support multi-thread context copy (#17919) 2021-08-25 14:03:20 +08:00
Tao Wang
0b5f5890f7
[Named Actor] Throw RayException when getting named actor timed out (#17998)
* [Named Actor]throw RayException when getting named actor timed out

* lint

* correct the message

* lint

* nice catch
2021-08-25 13:50:53 +08:00