mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
![]() A follow-up PR from this one: https://github.com/ray-project/ray/pull/24628 In the previous PR, it fixed the resubscribing issue for raylet. But there is also core worker which needs to do resubscribing. There are two ways of doing resubscribe: 1. When the client-side detects any failure, it'll do resubscribing. 2. Server side will ask the client to do resubscribing. 1) is a cleaner and better solution. However, it's a little bit hard due to the following reasons: - We are using long-polling, so for some extreme cases, we won't be able to detect the failure. For example, the client-side received the message, but before it sends another request, the server-side restarts, and the client will miss the opportunity of detecting the failure. This could happen if we have a standby GCS that starts very fast and somehow the client-side has a lot of traffic and runs very slow. - The current gRPC framework doesn't give the user a way to handle failure which might need some refactoring on this one. We can go with this way once we have gRPC streaming. This PR is implementing 2) which includes three parts: - raylet: (https://github.com/ray-project/ray/pull/24628) - core worker: (this pr) - python Correctness: whenever when a worker started, it'll register to raylet immediately (sync call) before connecting to GCS. So, we just need to send all restart rpcs to registered workers and it should work because: - if the worker just started and hasn't registered with the raylet: it's ok, because the worker hasn't connected with GCS yet, so no need to do resubscribing. - if the worker has registered with the rayelt: it's covered by the code path here. |
||
---|---|---|
.. | ||
mock | ||
ray | ||
shims/windows |