This second PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20148 Next PR #20150
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.
This PR we further separate out the logic for ordering actor requests on the client side. In the next PR, we will implement a different type of queue that supports out of order execution.
This is part of stack that enable out-of-order execution for actors. Previous PR #20150 Next PR #20176
Refactor the actor receiver code, by separating classes into their own header/cc files. specifically:
scheduling_queue.h for ScheduleQueue interface;
actor_scheduling_util.h for InBountRequest/DependencyWaiter/DependencyWaiterImpl
actor_scheduling_queue.h for ActorScheudlingQueue (the sequential execution queue)
normal_scheduling_queue.h for NormalSchedulingQueue (the task execution queue)
fiber_state_manager.h for FiberStateManager
thread_pool_manager.h for PoolManager and BoundedExecutor
## Why are these changes needed?
Since we are using gcs client as kv backend, we need to make it auto-reconnect in case of a failure. This PR adds this feature.
This PR adds auto_reconnect decorator to gcs-utils and in case of a failure it'll try to reconnect to gcs until it succeeds.
This feature right now support redis which should be deleted later once we finished bootstrap since kv will always go to gcs.
## Related issue number
Some Ray client users are likely seeing an issue similar to #7084. Inside a container, connecting to localhost: fails but connecting to 127.0.0.1: succeeds. Changing Ray client to use 127.0.0.1 for localhost connection / serving should fix the issue.
This is the stack of PRs that supports out or order execution for threaded/async actors. Next PR #20149
At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion.
The major changes of this stack is
introduce OutOfOrderActorSubmitQueue ([Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue #20150)
Specifically, we have a per-client task submit queue, which guarantees the sequential order of task submission. In [Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue #20150 we introduce OutOfOrderActorSubmitQueue, which relaxes the guarantee; it send the task over the network as soon as its dependency is resolved.
-- there are 2 PRs ([Core][actor out-of-order execution 1/n] Move CoreWorkerDirectActorTaskSubmitter into a separate file #20148 and [Core][actor out-of-order execution 2/n] create abstraction for the queuing logic on the client/actor submission. #20149) precedes this PR, which refactor the actor submission logic to make the abstraction possible.
OutOfOrderActorSchedullingQueue ([Core][actor out-of-order execution 5/n] implement out-of-order scheduling queue #20176)
Similarly, we also have a per-client task scheduling queue on the actor to ensure tasks are executed according to the submission order (sequence_no). OutOfOrderActorSchedullingQueue relaxes the guarantee by enqueuing the task as soon as all their dependencies are resolved.
-- there is one PR ([Core][actor out-of-order execution 4/n] refactor the actor receiver code #20160) precedes this PR, which refactor the actor scheduling logic to make it the code easier to read; however this one is optional.
plumbing PR ([Core][actor out-of-order execution 6/n] plumbing work to make it work e2e #20177)
This PR enables the out of order execution by introducing options(execute_out_of_order=True); and create actor client/server components according to the configuration.
There are something the PR hasn't touched upon is the restart/retry guarantees, which might need some discussion.
This PR we separate client from server code for Actor task submission. This makes the follow up change easier.