Stephanie Wang
344f2d9073
[core] Fix race condition in distributed ref counting ( #18584 )
2021-09-14 11:02:59 -07:00
SangBin Cho
51d94ebee0
[Tests] Make nightly test work + Remove work stealing logs ( #18300 )
...
* make tests work
* .
2021-09-14 09:52:58 -07:00
Edward Oakes
7f8cdce67d
Revert "Route core worker ERROR/FATAL logs to driver logs ( #18577 )" ( #18602 )
...
This reverts commit 3e0ae38e11
.
2021-09-14 10:41:10 -05:00
Eric Liang
3e0ae38e11
Route core worker ERROR/FATAL logs to driver logs ( #18577 )
2021-09-13 23:07:14 -07:00
Guyang Song
dee12be253
[Event] print event message to general log ( #18376 )
2021-09-14 12:24:49 +08:00
Stephanie Wang
284dee493e
[core][usability] Disambiguate ObjectLostErrors for better understandability ( #18292 )
...
* Define error types, throw error for ObjectReleased
* x
* Disambiguate OBJECT_UNRECONSTRUCTABLE and OBJECT_LOST
* OwnerDiedError
* fix test
* x
* ObjectReconstructionFailed
* ObjectReconstructionFailed
* x
* x
* print owner addr
* str
* doc
* rename
* x
2021-09-13 16:16:17 -07:00
Jiajun Yao
f8ae2b2b62
Don't pass in TaskID to TaskManager::MarkPendingTaskFailed since it can ( #18532 )
...
be got from TaskSpecification
2021-09-13 11:27:42 -07:00
Lingxuan Zuo
a67b9ee8d7
Remove custom resource from streaming ( #18490 )
2021-09-12 12:20:59 -07:00
Jiajun Yao
ae10a80d5e
Fix async actor worker process leak after calling ray.actor.exit_actor() ( #18526 )
2021-09-12 11:09:12 -07:00
Qing Wang
371f03fa48
Remove dynamic resource from client side. ( #18514 )
2021-09-11 10:39:59 -07:00
Chong-Li
d314d0c10e
[GCS] Fix the Windows build of GCS actor scheduling ( #18012 )
2021-09-10 17:17:25 -07:00
Lixin Wei
7e37d6e348
[Core] Add gRPC Server Backpressure Tests ( #18500 )
2021-09-10 17:17:09 -07:00
Edward Oakes
2fcfea10b3
[runtime_env] Move URI deletion logic to the agent, remove util worker code ( #18471 )
2021-09-10 00:13:32 -07:00
SangBin Cho
7b2ed4c1f8
[Placement group] Placement group scheduling hangs due to creation/removal race condition ( #18419 )
2021-09-09 20:39:01 -07:00
Chen Shen
5f57079041
use clang for C++ debug testing ( #18343 )
2021-09-09 15:48:36 -07:00
Lixin Wei
df803cee98
Revert "Revert "[Core] Fix ServerCall Leaking ( #17863 )" ( #18410 )" ( #18424 )
2021-09-08 19:55:06 -07:00
Edward Oakes
f0555f88d6
[runtime_env] Move worker process startup logic to context ( #18341 )
2021-09-08 17:08:27 -05:00
Lixin Wei
052ed115e7
[Core] Make It Easier to Grep Debug State Dump ( #18382 )
...
* add keyword to debug dump
* fix
2021-09-08 12:03:54 -07:00
Yi Cheng
7126d01c91
[core] upgrade gtest ( #18288 )
...
* up
* up
* format
* up
* flaky fix
* format
* up
* up
* format
* add debug
* up
* up
* up
* up
* up
* format
* fix
* format
* up
* up
* format
2021-09-08 11:15:34 -07:00
Kai Fricke
dac3a8bc8e
[setup] Upstream conda patches ( #17575 )
...
Co-authored-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
2021-09-08 10:37:17 +01:00
Lingxuan Zuo
46b941b702
[Streaming] Support streaming metric reporter ( #17981 )
...
* Streaming support metric reporter
* fix lint
* fix bazel format lint
* fix lint
* metric deps lint
* lint
* and comments for runtime reporter
* unordered_map instead
* comments
* fix visibility flag
* deps local .so target
* make stats public visibility
* stats lib in public
* add antgroup team tag
2021-09-08 14:36:00 +08:00
Chen Shen
df9c6aa863
[plasma] Check if the get request is removed ( #18401 )
2021-09-07 21:01:08 -07:00
Chen Shen
d65d291579
Revert "[Core] Fix ServerCall Leaking ( #17863 )" ( #18410 )
...
This reverts commit 4f6b50dc46
.
2021-09-07 15:47:58 -07:00
Lixin Wei
4f6b50dc46
[Core] Fix ServerCall Leaking ( #17863 )
...
* fix backpressure bug
* update comments
* stash
* add test
* add basic tests
* add fixture
* stash
* fix
* draft
* fix
* test added
* fixed
* fixed
* lint
* Update src/ray/rpc/test/grpc_server_test.cc
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
* add copyright
* move test service to saperate file
* add ClientCallManager timeout tests
* fix
* lint
* lint
* lint
* test windows CI
* fix
* lint
* lint
* retry windows
* retry windows
* fix mac
* lint
* lint
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-09-07 12:15:43 -07:00
Guyang Song
5a89b47f56
[Event] support set event level ( #18275 )
...
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2021-09-06 16:41:49 +08:00
Chen Shen
7c9d261dce
[Core][plasma] consolidate stats calculation for plasma store
2021-09-05 22:24:21 -07:00
Chen Shen
cf4fb4edb3
[Core][plasma] fix the data race issue ( #18312 )
2021-09-03 18:51:27 -07:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. ( #18236 )
2021-09-03 14:23:56 -07:00
Jiajun Yao
e049d52d29
Retry application-level error by default for datasets ( #18296 )
2021-09-03 14:21:38 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue ( #18064 )
...
* begin
* build
* add test
* add first test
* add test
* fix build
* lint bazel
* fix build
* fix build
* fix crash
* fix some comment
* revert shared_ptr ObjectLifecycleManager
* fix RemoveGetRequest lost
* no defer
* fix lots of comments
* fix build
* fix data race
* fix comments
* Revert "fix data race"
This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.
* refine
* fix mac build
* fix unit test
* fix unit test
2021-09-02 14:16:50 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error ( #17971 )
...
* Attach call site to ObjectRef
* flag
* Fix build
* build
* build
* build
* x
* x
* skip on windows
* lint
2021-09-01 15:29:05 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core ( #18265 )
...
* up
* up
* up
* format
* up
* up
* format
2021-09-01 13:04:35 -07:00
Jiajun Yao
fbb3ac6a86
Retry application-level errors ( #18176 )
...
* Retry application-level errors
* Retry application-level errors
* Push retry message to the driver
2021-09-01 10:53:06 -07:00
mwtian
be50c13251
[Client] Use a single RPC to fetch ClientObjectRefs passed in a list ( #16944 )
2021-08-31 16:31:13 -07:00
Guyang Song
be772df4dc
[Event] Add some error level events ( #18118 )
...
* add event 'RAY_WORKER_FAILURE' and 'RAY_DRIVER_FAILURE'
* add some events
* move event 'EL_RAY_NODE_REMOVED' to 'RemoveNode()'
2021-08-31 14:15:13 -07:00
SangBin Cho
d240d26525
[Object Spilling] Fix a bug where object url is empty. ( #18193 )
...
* Fix a bug
* Addressed code review.
* Fix a test
2021-08-31 10:10:28 -07:00
Stephanie Wang
8e06db7280
Revert "[Core] revert: revert Unified worker starter ( #18008 )" ( #18228 )
...
This reverts commit b9978dd02b
.
2021-08-30 17:28:41 -07:00
SangBin Cho
2ee1b90c17
[Core] Batch obod location updates ( #18016 )
...
* Batch impl
* done
* Remove a client pool
* in progress
* Added unit tests.
* Handle owner failure case.
* Fix unit tests
* Addressed code review.
2021-08-30 11:04:08 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port ( #17855 )" ( #18217 )
...
This reverts commit 53ddb551d5
.
2021-08-30 10:46:37 -07:00
SangBin Cho
0e968c1e82
[Core] Reduce spilling threshold ( #17910 )
...
* Lower the threshold
* ip
* Handle test failure
* lint
* last fix
* .
* Retry
2021-08-30 00:09:35 -07:00
fyrestone
53ddb551d5
Auto discover dashboard agent port ( #17855 )
2021-08-30 12:06:28 +08:00
Stephanie Wang
7bc1ef0dd9
[core] Prestart workers up to available CPU limit ( #18166 )
...
* Prestart workers according to num available CPUs
* lint
* Prestart min(available CPU, backlog)
* Fix test, adjust policy
* debug
* retry
* lint
2021-08-29 14:11:53 -07:00
mwtian
26679d62c5
[Core][ObjectRef] Change default to not record call stack during ObjectRef creation ( #18078 )
2021-08-27 15:45:34 -07:00
SangBin Cho
a25cc47399
[Core] Set keepalive only at gcs ( #18086 )
2021-08-27 01:26:51 -07:00
Edward Oakes
5c4c735119
[runtime_env] Make log message when deleting runtime_env INFO instead of ERROR ( #18083 )
2021-08-26 15:21:59 -05:00
SangBin Cho
405418f8e8
[Object Spilling] Unpin before updating URL ( #17994 )
...
* Unpin before updating URL
* Remove unnecessary logs.
* update compiling issue
* Check the consistent local state instead of stale information from obod.
* Fix the test
* Addressed code review.
2021-08-26 10:23:53 -07:00
Chen Shen
a29b157e2e
[core] better error message for lost objects ( #18068 )
2021-08-26 00:03:29 -07:00
Tao Wang
15a7514cf6
[Core] Some request counts are missing in debug info ( #18069 )
2021-08-25 14:02:03 -07:00
Guyang Song
16502cc438
[Event] support multi-thread context copy ( #17919 )
2021-08-25 14:03:20 +08:00
Tao Wang
0b5f5890f7
[Named Actor] Throw RayException when getting named actor timed out ( #17998 )
...
* [Named Actor]throw RayException when getting named actor timed out
* lint
* correct the message
* lint
* nice catch
2021-08-25 13:50:53 +08:00