ray/src
SangBin Cho 7baf62386a
[Core] Shorten the GCS dead detection to 60 seconds instead of 10 minutes. (#20900)
Currently, when the GCS RPC failed with gRPC unavailable error because the GCS is dead, it will retry forever. 

b3a9d4d87d/src/ray/rpc/gcs_server/gcs_rpc_client.h (L57)

And it takes about 10 minutes to detect the GCS server failure, meaning if GCS is dead, users will notice in 10 minutes.

This can easily cause confusion that the cluster is hanging (since users are not that patient). Also, since GCS is not fault tolerant in OSS now, 10 minutes are too long timeout to detect GCS death.

This PR changes the value to 60 seconds, which I believe is much more reasonable (since this is the same value as our blocking RPC call timeout).
2021-12-14 07:50:45 -08:00
..
mock [Core] Support back pressure for actor tasks. (#20894) 2021-12-13 23:56:07 -08:00
ray [Core] Shorten the GCS dead detection to 60 seconds instead of 10 minutes. (#20900) 2021-12-14 07:50:45 -08:00
shims/windows Support copyright format for c++ files (#14348) 2021-08-04 17:19:38 +08:00