[Core] Suppress gRPC server alerting on too many keep-alive pings (#27769)

# Why are these changes needed?
(map pid=516, ip=172.31.64.223) E0526 12:32:19.203322360     675 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". See [this](https://github.com/ray-project/ray/issues/25367#issuecomment-1189421372) for more details. 
We currently see this in many of the large nightly tests.

# Root Cause
The root cause (with pretty high confidence level) has been some misconfigs between gRPC server/clients. Essentially the client is pinging the server too frequently for keep-alive heartbeats.

# Mitigation
This PR is merely a mitigation step. I will keep looking into the exact client/server pair later, but probably don't have bandwidth for now largely because the test iteration takes quite a while and verbose logging with gRPC and ray backend have not revealed much useful info. This only kicks in at the end of a long running map phase, and verbose logging doesn't tell me which client is sending the pings.
This commit is contained in:
Ricky Xu 2022-08-17 04:53:47 -04:00 committed by GitHub
parent 9330d8f244
commit 68b5d4302c
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -59,6 +59,15 @@ void GrpcServer::Run() {
RayConfig::instance().grpc_keepalive_timeout_ms()); RayConfig::instance().grpc_keepalive_timeout_ms());
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 0); builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 0);
// NOTE(rickyyx): This argument changes how frequent the gRPC server expects a keepalive
// ping from the client. See https://github.com/grpc/grpc/blob/HEAD/doc/keepalive.md#faq
// We set this to 1min because GCS gRPC client currently sends keepalive every 1min:
// https://github.com/ray-project/ray/blob/releases/2.0.0/python/ray/_private/gcs_utils.py#L72
// Setting this value larger will trigger GOAWAY from the gRPC server to be sent to the
// client to back-off keepalive pings. (https://github.com/ray-project/ray/issues/25367)
builder.AddChannelArgument(GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
60000);
if (RayConfig::instance().USE_TLS()) { if (RayConfig::instance().USE_TLS()) {
// Create credentials from locations specified in config // Create credentials from locations specified in config
std::string rootcert = ReadCert(RayConfig::instance().TLS_CA_CERT()); std::string rootcert = ReadCert(RayConfig::instance().TLS_CA_CERT());