Commit graph

297 commits

Author SHA1 Message Date
SangBin Cho
99b5932d06
Add a simple node failure integration test + clean up spammy logs upon node failures (#19695)
* .

* Done

* clean up

* lint

* fix a bug

* lint

* fix issue

* Remove no-op from StartRayLog

* Addressed code review.
2021-10-29 18:42:35 -04:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
Yi Cheng
48fb86a978
[core] Fix the spilling back failure in case of node missing (#19564)
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
2021-10-22 11:22:07 -07:00
mwtian
530f2d7c5e
[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600)
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
   -  Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.

Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.

Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.

This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.

## Related issue number
2021-10-22 10:52:36 -07:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Matti Picus
9ca34c7192
add dependencies to BUILD.bazel and update windows bazel to 4.2.1 (#19132)
* add dependencies to BUILD.bazel and update windows bazel to 4.2.1

* fixes from review
2021-10-11 10:25:19 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
Yi Cheng
16cf719aff
[core] hot fix of build failure (#18963) 2021-09-28 20:29:28 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" (#18871)
* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057644.

* up
2021-09-28 05:53:52 -07:00
Jiajun Yao
e79f271b05
Fix nil redis array element (#18813) 2021-09-24 20:11:43 -07:00
Yi Cheng
8dd3057644
Revert "[test] add unit test for PR #17634 (#18585)" (#18830)
This reverts commit 73c3cff18b.
2021-09-22 16:51:02 -07:00
Yi Cheng
73c3cff18b
[test] add unit test for PR #17634 (#18585) 2021-09-22 14:39:30 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722) 2021-09-20 19:17:00 -07:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring (#18009) 2021-09-17 10:03:35 -07:00
Simon Mo
317a34c523
[Serve] Use BackendConfig Protobuf (#17835) 2021-09-16 11:08:23 -07:00
Stephanie Wang
be7cb70c30
[core] Fix ref counting during actor construction (#18646)
* test

* fix

* cpp

* skip windows

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-09-15 22:16:53 -07:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Chong-Li
d314d0c10e
[GCS] Fix the Windows build of GCS actor scheduling (#18012) 2021-09-10 17:17:25 -07:00
mwtian
26fd10c9e8
[CI] Add clang-tidy to lint (#18124)
* clang-tidy

* fix

* fix script

* test clang compiler

* fix clang-tidy rules

* Fix windows and other issues.

* Fix

* Improve information when running check-git-clang-tidy-output.sh on different OS
2021-09-09 00:41:53 -07:00
Lixin Wei
df803cee98
Revert "Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)" (#18424) 2021-09-08 19:55:06 -07:00
Yi Cheng
7126d01c91
[core] upgrade gtest (#18288)
* up

* up

* format

* up

* flaky fix

* format

* up

* up

* format

* add debug

* up

* up

* up

* up

* up

* format

* fix

* format

* up

* up

* format
2021-09-08 11:15:34 -07:00
Lingxuan Zuo
46b941b702
[Streaming] Support streaming metric reporter (#17981)
* Streaming support metric reporter

* fix lint

* fix bazel format lint

* fix lint

* metric deps lint

* lint

* and comments for runtime reporter

* unordered_map instead

* comments

* fix visibility flag

* deps local .so target

* make stats public visibility

* stats lib in public

* add antgroup team tag
2021-09-08 14:36:00 +08:00
Chen Shen
d65d291579
Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)
This reverts commit 4f6b50dc46.
2021-09-07 15:47:58 -07:00
Lixin Wei
4f6b50dc46
[Core] Fix ServerCall Leaking (#17863)
* fix backpressure bug

* update comments

* stash

* add test

* add basic tests

* add fixture

* stash

* fix

* draft

* fix

* test added

* fixed

* fixed

* lint

* Update src/ray/rpc/test/grpc_server_test.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* add copyright

* move test service to saperate file

* add ClientCallManager timeout tests

* fix

* lint

* lint

* lint

* test windows CI

* fix

* lint

* lint

* retry windows

* retry windows

* fix mac

* lint

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-09-07 12:15:43 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue (#18064)
* begin

* build

* add test

* add first test

* add test

* fix build

* lint bazel

* fix build

* fix build

* fix crash

* fix some comment

* revert shared_ptr ObjectLifecycleManager

* fix RemoveGetRequest lost

* no defer

* fix lots of comments

* fix build

* fix data race

* fix comments

* Revert "fix data race"

This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.

* refine

* fix mac build

* fix unit test

* fix unit test
2021-09-02 14:16:50 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core (#18265)
* up

* up

* up

* format

* up

* up

* format
2021-09-01 13:04:35 -07:00
SangBin Cho
2ee1b90c17
[Core] Batch obod location updates (#18016)
* Batch impl

* done

* Remove a client pool

* in progress

* Added unit tests.

* Handle owner failure case.

* Fix unit tests

* Addressed code review.
2021-08-30 11:04:08 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port (#17855)" (#18217)
This reverts commit 53ddb551d5.
2021-08-30 10:46:37 -07:00
fyrestone
53ddb551d5
Auto discover dashboard agent port (#17855) 2021-08-30 12:06:28 +08:00
wanxing
abb46de4dc
[object store refactor 5/n] Add eviction policy tests (#17984)
* add eviction policy tests

* fix object_lifecycle_manager_test build

* make IsObjectExists private
2021-08-24 00:50:28 -07:00
Eric Liang
236b772465
Revert "[GCS] GCS Based Actor Scheduler (#16580)" (#17941)
This reverts commit a9b4545502.
2021-08-19 21:46:52 -07:00
Chong-Li
5e22257cec
[GCS] Fix: GCS Based Actor Scheduler (#17944) 2021-08-18 23:40:35 -07:00
Simon Mo
b573864928
[CI] Add test owners (#17893) 2021-08-18 18:38:31 -07:00
Chen Shen
89d83228f6
[Core][Plasma-store] add stats-collector that eagerly collect stats 2021-08-18 13:47:50 -07:00
Chong-Li
a9b4545502
[GCS] GCS Based Actor Scheduler (#16580) 2021-08-18 13:44:59 -07:00
Guyang Song
8227e24424
[event] event framework integration in raylet, gcs server and core worker (#17671) 2021-08-17 11:21:23 +08:00
Yi Cheng
03a82d733a
Revert "Revert "Export useful metrics"" (#17755)
* Revert "Revert "[Observability] Export useful metrics (#17578)" (#17752)"

This reverts commit 02e79f3fe5.

* Update metric.h

* up

* up

* Update server_call.h

* Update test_metrics_agent.py

* up

* fix comment
2021-08-16 17:05:56 -07:00
Chen Shen
b349c6bc4f
[object store refactor 4/n] object lifecycle manager (#17344)
* lifecycle

* address comments
2021-08-16 09:58:35 -07:00
Eric Liang
ce171f10a1
Remove legacy plasma unlimited and pull manager pinning flag (#17753) 2021-08-11 20:19:12 -07:00
Yi Cheng
02e79f3fe5
Revert "[Observability] Export useful metrics (#17578)" (#17752)
This reverts commit bd4db53df2.
2021-08-11 12:21:50 -07:00
Yi Cheng
bd4db53df2
[Observability] Export useful metrics (#17578)
* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* checkpoint

* up

* up

* up

* up

* fix

* up

* up

* up

* up

* up

* up

* up

* up

* up

* up

* add comments

* up

* up

* up

* up

* add tests
2021-08-10 17:14:42 -07:00
Chen Shen
4ff35d43b3
[object store refactor 3/n] introduce object_store (#17332)
refactor-allocator

add object_store
2021-08-05 17:36:27 -07:00
SongGuyang
79bec61e12
[event] support WithField option in RAY_EVENT api (#17476) 2021-08-05 20:45:55 +08:00
Chen Shen
1b89fa8624
[object store refactor 2/n] More refactor on PlasmaAllocator, and add unit tests 2021-08-01 22:10:03 -07:00
Chen Shen
96c69f8c77
[object store refactor 1/n] Introduce IAllocator and PlasmaAllocator (#17307)
* initial commit

* address comments
2021-07-30 19:08:20 -07:00
Tao Wang
d98ec7fc4d
Remove libray_redis_module (#17283) 2021-07-25 23:15:29 -07:00
Edward Oakes
f6375cbb7c
[core] Fix bazel test sizes for C++ unit tests (#17272) 2021-07-22 17:38:56 -05:00
Amog Kamsetty
8dfd471823
Revert "Revert "[Dashboard][event] Basic event module (#16985)" (#17068)" (#17107)
This reverts commit c17e171f92.

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-18 12:59:04 +08:00