mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
![]() ## Why are these changes needed? As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets. This PR expose them in the endpoint in dashboard and dashboard agent. For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not. For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not. We think raylet is live if - GCS is dead - GCS is alive but GCS think the raylet is dead If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it. |
||
---|---|---|
.. | ||
actor | ||
event | ||
healthz | ||
job | ||
log | ||
node | ||
reporter | ||
runtime_env | ||
serve | ||
snapshot | ||
state | ||
test | ||
tests | ||
tune | ||
usage_stats | ||
__init__.py | ||
dashboard_sdk.py | ||
version.py |