mirror of
https://github.com/vale981/ray
synced 2025-03-05 10:01:43 -05:00
[Core] Suppress gRPC server alerting on too many keep-alive pings (#27769)
# Why are these changes needed? (map pid=516, ip=172.31.64.223) E0526 12:32:19.203322360 675 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". See [this](https://github.com/ray-project/ray/issues/25367#issuecomment-1189421372) for more details. We currently see this in many of the large nightly tests. # Root Cause The root cause (with pretty high confidence level) has been some misconfigs between gRPC server/clients. Essentially the client is pinging the server too frequently for keep-alive heartbeats. # Mitigation This PR is merely a mitigation step. I will keep looking into the exact client/server pair later, but probably don't have bandwidth for now largely because the test iteration takes quite a while and verbose logging with gRPC and ray backend have not revealed much useful info. This only kicks in at the end of a long running map phase, and verbose logging doesn't tell me which client is sending the pings.
This commit is contained in:
parent
9330d8f244
commit
68b5d4302c
1 changed files with 9 additions and 0 deletions
|
@ -59,6 +59,15 @@ void GrpcServer::Run() {
|
||||||
RayConfig::instance().grpc_keepalive_timeout_ms());
|
RayConfig::instance().grpc_keepalive_timeout_ms());
|
||||||
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 0);
|
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 0);
|
||||||
|
|
||||||
|
// NOTE(rickyyx): This argument changes how frequent the gRPC server expects a keepalive
|
||||||
|
// ping from the client. See https://github.com/grpc/grpc/blob/HEAD/doc/keepalive.md#faq
|
||||||
|
// We set this to 1min because GCS gRPC client currently sends keepalive every 1min:
|
||||||
|
// https://github.com/ray-project/ray/blob/releases/2.0.0/python/ray/_private/gcs_utils.py#L72
|
||||||
|
// Setting this value larger will trigger GOAWAY from the gRPC server to be sent to the
|
||||||
|
// client to back-off keepalive pings. (https://github.com/ray-project/ray/issues/25367)
|
||||||
|
builder.AddChannelArgument(GRPC_ARG_HTTP2_MIN_RECV_PING_INTERVAL_WITHOUT_DATA_MS,
|
||||||
|
60000);
|
||||||
|
|
||||||
if (RayConfig::instance().USE_TLS()) {
|
if (RayConfig::instance().USE_TLS()) {
|
||||||
// Create credentials from locations specified in config
|
// Create credentials from locations specified in config
|
||||||
std::string rootcert = ReadCert(RayConfig::instance().TLS_CA_CERT());
|
std::string rootcert = ReadCert(RayConfig::instance().TLS_CA_CERT());
|
||||||
|
|
Loading…
Add table
Reference in a new issue