Commit graph

307 commits

Author SHA1 Message Date
Yi Cheng
ea1d081aac
[core] Simple chaos testing for asio (#19970)
Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.

To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`

This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
2021-12-07 14:47:07 -08:00
Qing Wang
116bda8f05
[Core] Remove duplicated implementations of concurrency group executor. (#20467)
## Why are these changes needed?
ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them.

Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.
2021-11-27 12:57:40 +08:00
Gagandeep Singh
f22a24aca4
Replace time based seed generation with absl::BitGen and absl::Uniform (#20696) 2021-11-24 14:36:35 -08:00
Guyang Song
53630ee03b
Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692)
- Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641
- Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.
2021-11-24 09:01:01 -08:00
Alex Wu
9388d28233
Revert "[runtime env] redefine runtime env to protobuf" (#20641)
Reverts #19511

Breaks windows compilation
2021-11-22 13:11:30 -08:00
Guyang Song
ad56b9b432
[runtime env] redefine runtime env to protobuf (#19511) 2021-11-20 16:54:42 +08:00
Chen Shen
f02b53a810
[Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue (#20150)
Why are these changes needed?
This is the third PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20149 Next PR #20160
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.

In this PR, we implemented the out-of-order of queue that supports out of order execution. Conceptually it's very simple: it sends the requests as soon as the dependency is resolved.
2021-11-16 10:48:49 -08:00
Tao Wang
507bd9186b
[Core]Make convertion between ray/grpc status more specific (#20047)
* [Core]Make convertion between ray/grpc status more specific

* per comments

* lint

* per comments

* use ABORT instead of UNKNOWN, add some tests

* lint

* lint
2021-11-10 00:48:05 -08:00
Lingxuan Zuo
97259e33b2
Relink grpc/absl for streaming.so (#20136)
To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so.

Side-effect:
Static variables might be uninitialized if core worker lib and streaming lib both use them.
2021-11-09 14:13:53 +08:00
mwtian
ef4b6e4648
[Core][GCS] remove gcs object manager (#19963) 2021-11-02 16:20:53 -07:00
SangBin Cho
99b5932d06
Add a simple node failure integration test + clean up spammy logs upon node failures (#19695)
* .

* Done

* clean up

* lint

* fix a bug

* lint

* fix issue

* Remove no-op from StartRayLog

* Addressed code review.
2021-10-29 18:42:35 -04:00
Lixin Wei
56301e34b2
[Refactor] Remove ServiceBased Abstraction (#19694)
## Why are these changes needed?

Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}

class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```

However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.

I think this abstraction is not necessary and will make development hard(I have to modify two files every time).

This PR removes all ServiceBasedXxx and moves its implementations to the base class.

Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
2021-10-29 10:16:14 -07:00
mwtian
b238297bfb
[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687)
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.

Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.

Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
2021-10-28 01:18:54 +08:00
Yi Cheng
48fb86a978
[core] Fix the spilling back failure in case of node missing (#19564)
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.

This PR filter out the node that's not available when select the node.

## Related issue number
#19438
2021-10-22 11:22:07 -07:00
mwtian
530f2d7c5e
[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600)
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
   -  Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.

Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.

Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.

This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.

## Related issue number
2021-10-22 10:52:36 -07:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Matti Picus
9ca34c7192
add dependencies to BUILD.bazel and update windows bazel to 4.2.1 (#19132)
* add dependencies to BUILD.bazel and update windows bazel to 4.2.1

* fixes from review
2021-10-11 10:25:19 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
Yi Cheng
16cf719aff
[core] hot fix of build failure (#18963) 2021-09-28 20:29:28 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" (#18871)
* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057644.

* up
2021-09-28 05:53:52 -07:00
Jiajun Yao
e79f271b05
Fix nil redis array element (#18813) 2021-09-24 20:11:43 -07:00
Yi Cheng
8dd3057644
Revert "[test] add unit test for PR #17634 (#18585)" (#18830)
This reverts commit 73c3cff18b.
2021-09-22 16:51:02 -07:00
Yi Cheng
73c3cff18b
[test] add unit test for PR #17634 (#18585) 2021-09-22 14:39:30 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722) 2021-09-20 19:17:00 -07:00
Yi Cheng
cf64ab5b90
Revert "[core] Async submitting actor registerring (#18009)" (#18719)
This reverts commit 8ce01ea2cc.
2021-09-17 13:34:12 -07:00
Yi Cheng
8ce01ea2cc
[core] Async submitting actor registerring (#18009) 2021-09-17 10:03:35 -07:00
Simon Mo
317a34c523
[Serve] Use BackendConfig Protobuf (#17835) 2021-09-16 11:08:23 -07:00
Stephanie Wang
be7cb70c30
[core] Fix ref counting during actor construction (#18646)
* test

* fix

* cpp

* skip windows

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-09-15 22:16:53 -07:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Chong-Li
d314d0c10e
[GCS] Fix the Windows build of GCS actor scheduling (#18012) 2021-09-10 17:17:25 -07:00
mwtian
26fd10c9e8
[CI] Add clang-tidy to lint (#18124)
* clang-tidy

* fix

* fix script

* test clang compiler

* fix clang-tidy rules

* Fix windows and other issues.

* Fix

* Improve information when running check-git-clang-tidy-output.sh on different OS
2021-09-09 00:41:53 -07:00
Lixin Wei
df803cee98
Revert "Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)" (#18424) 2021-09-08 19:55:06 -07:00
Yi Cheng
7126d01c91
[core] upgrade gtest (#18288)
* up

* up

* format

* up

* flaky fix

* format

* up

* up

* format

* add debug

* up

* up

* up

* up

* up

* format

* fix

* format

* up

* up

* format
2021-09-08 11:15:34 -07:00
Lingxuan Zuo
46b941b702
[Streaming] Support streaming metric reporter (#17981)
* Streaming support metric reporter

* fix lint

* fix bazel format lint

* fix lint

* metric deps lint

* lint

* and comments for runtime reporter

* unordered_map instead

* comments

* fix visibility flag

* deps local .so target

* make stats public visibility

* stats lib in public

* add antgroup team tag
2021-09-08 14:36:00 +08:00
Chen Shen
d65d291579
Revert "[Core] Fix ServerCall Leaking (#17863)" (#18410)
This reverts commit 4f6b50dc46.
2021-09-07 15:47:58 -07:00
Lixin Wei
4f6b50dc46
[Core] Fix ServerCall Leaking (#17863)
* fix backpressure bug

* update comments

* stash

* add test

* add basic tests

* add fixture

* stash

* fix

* draft

* fix

* test added

* fixed

* fixed

* lint

* Update src/ray/rpc/test/grpc_server_test.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* add copyright

* move test service to saperate file

* add ClientCallManager timeout tests

* fix

* lint

* lint

* lint

* test windows CI

* fix

* lint

* lint

* retry windows

* retry windows

* fix mac

* lint

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-09-07 12:15:43 -07:00
wanxing
60f84fa051
Abstract plasma store get request queue (#18064)
* begin

* build

* add test

* add first test

* add test

* fix build

* lint bazel

* fix build

* fix build

* fix crash

* fix some comment

* revert shared_ptr ObjectLifecycleManager

* fix RemoveGetRequest lost

* no defer

* fix lots of comments

* fix build

* fix data race

* fix comments

* Revert "fix data race"

This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.

* refine

* fix mac build

* fix unit test

* fix unit test
2021-09-02 14:16:50 -07:00
Yi Cheng
d470e679df
[core] Add some mock headers for ray core (#18265)
* up

* up

* up

* format

* up

* up

* format
2021-09-01 13:04:35 -07:00
SangBin Cho
2ee1b90c17
[Core] Batch obod location updates (#18016)
* Batch impl

* done

* Remove a client pool

* in progress

* Added unit tests.

* Handle owner failure case.

* Fix unit tests

* Addressed code review.
2021-08-30 11:04:08 -07:00
Eric Liang
1adce7da4e
Revert "Auto discover dashboard agent port (#17855)" (#18217)
This reverts commit 53ddb551d5.
2021-08-30 10:46:37 -07:00
fyrestone
53ddb551d5
Auto discover dashboard agent port (#17855) 2021-08-30 12:06:28 +08:00
wanxing
abb46de4dc
[object store refactor 5/n] Add eviction policy tests (#17984)
* add eviction policy tests

* fix object_lifecycle_manager_test build

* make IsObjectExists private
2021-08-24 00:50:28 -07:00
Eric Liang
236b772465
Revert "[GCS] GCS Based Actor Scheduler (#16580)" (#17941)
This reverts commit a9b4545502.
2021-08-19 21:46:52 -07:00
Chong-Li
5e22257cec
[GCS] Fix: GCS Based Actor Scheduler (#17944) 2021-08-18 23:40:35 -07:00
Simon Mo
b573864928
[CI] Add test owners (#17893) 2021-08-18 18:38:31 -07:00
Chen Shen
89d83228f6
[Core][Plasma-store] add stats-collector that eagerly collect stats 2021-08-18 13:47:50 -07:00
Chong-Li
a9b4545502
[GCS] GCS Based Actor Scheduler (#16580) 2021-08-18 13:44:59 -07:00
Guyang Song
8227e24424
[event] event framework integration in raylet, gcs server and core worker (#17671) 2021-08-17 11:21:23 +08:00
Yi Cheng
03a82d733a
Revert "Revert "Export useful metrics"" (#17755)
* Revert "Revert "[Observability] Export useful metrics (#17578)" (#17752)"

This reverts commit 02e79f3fe5.

* Update metric.h

* up

* up

* Update server_call.h

* Update test_metrics_agent.py

* up

* fix comment
2021-08-16 17:05:56 -07:00
Chen Shen
b349c6bc4f
[object store refactor 4/n] object lifecycle manager (#17344)
* lifecycle

* address comments
2021-08-16 09:58:35 -07:00