ray/dashboard/modules/healthz/utils.py
Yi Cheng dac7bf17d9
[serve] Make serve agent not blocking when GCS is down. (#27526)
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.
2022-08-08 16:29:42 -07:00

23 lines
712 B
Python

from typing import Optional
from ray._private.gcs_utils import GcsAioClient
class HealthChecker:
def __init__(
self, gcs_aio_client: GcsAioClient, local_node_address: Optional[str] = None
):
self._gcs_aio_client = gcs_aio_client
self._local_node_address = local_node_address
async def check_local_raylet_liveness(self) -> bool:
if self._local_node_address is None:
return False
liveness = await self._gcs_aio_client.check_alive(
[self._local_node_address.encode()], 0.1
)
return liveness[0]
async def check_gcs_liveness(self) -> bool:
await self._gcs_aio_client.check_alive([], 0.1)
return True