Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method.
To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"`
This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.
## Why are these changes needed?
ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them.
Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.
Why are these changes needed?
This is the third PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20149 Next PR #20160
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.
In this PR, we implemented the out-of-order of queue that supports out of order execution. Conceptually it's very simple: it sends the requests as soon as the dependency is resolved.
* [Core]Make convertion between ray/grpc status more specific
* per comments
* lint
* per comments
* use ABORT instead of UNKNOWN, add some tests
* lint
* lint
To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so.
Side-effect:
Static variables might be uninitialized if core worker lib and streaming lib both use them.
## Why are these changes needed?
Prior to this PR, we have:
```cpp
class XxxAccessor {}
class ServiceBasedXxxAccessor : public XxxAccessor{}
class GcsClient {}
class ServiceBasedGcsClient : public GcsClient{}
```
However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient.
I think this abstraction is not necessary and will make development hard(I have to modify two files every time).
This PR removes all ServiceBasedXxx and moves its implementations to the base class.
Now we only have:
```cpp
class XxxAccessor {}
class GcsClient {}
```
This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis.
Most important logic added are
GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto
GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc}
GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc}
Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups.
This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added.
All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior.
The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in.
Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.
This PR filter out the node that's not available when select the node.
## Related issue number
#19438
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
- Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.
Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.
Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.
This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.
## Related issue number
## Why are these changes needed?
There are some issues left from previous PRs.
- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags.
## Related issue number
* exp backoff
* up
* format
* up
* up
* up
* up
* up
* format
* fix
* up
* format
* adjust ordering
* up
* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"
This reverts commit 2e99fb215f.
* up
* update
* format
* up
* format
* fix
* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""
This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.
* up
* format
* fix lint
* up
* up
* up
* up
* check
* add test1
* format
* up
* add test
* up
* up
* up
* fix
* up
* up
* up
* add test
* format
* up
* up
* fix lint
* format
* fix
* format
* fix
* up
* clang-tidy
* fix
* fix script
* test clang compiler
* fix clang-tidy rules
* Fix windows and other issues.
* Fix
* Improve information when running check-git-clang-tidy-output.sh on different OS
* Streaming support metric reporter
* fix lint
* fix bazel format lint
* fix lint
* metric deps lint
* lint
* and comments for runtime reporter
* unordered_map instead
* comments
* fix visibility flag
* deps local .so target
* make stats public visibility
* stats lib in public
* add antgroup team tag
* begin
* build
* add test
* add first test
* add test
* fix build
* lint bazel
* fix build
* fix build
* fix crash
* fix some comment
* revert shared_ptr ObjectLifecycleManager
* fix RemoveGetRequest lost
* no defer
* fix lots of comments
* fix build
* fix data race
* fix comments
* Revert "fix data race"
This reverts commit 8f58e3d70b73af864566e056211ff1b70cab870c.
* refine
* fix mac build
* fix unit test
* fix unit test