ray/dashboard
Yi Cheng a68c02a15d
[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408)
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.

This PR expose them in the endpoint in dashboard and dashboard agent.

For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.

For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.

We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead

If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
2022-07-09 13:09:48 -07:00
..
client [Dashboard][Frontend] Worker table enhancement (#25934) 2022-06-21 14:09:48 +08:00
modules [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408) 2022-07-09 13:09:48 -07:00
tests Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
__init__.py [Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00
agent.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
BUILD [job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164) 2021-11-08 23:10:27 -08:00
consts.py [dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK to /api/component_activities (#26297) 2022-07-08 10:51:59 -07:00
dashboard.py [Core][cli][usability] ray stop prints errors during graceful shutdown (#25686) 2022-06-27 08:14:59 -07:00
datacenter.py [Dashboard] fix iterating over GPU processes (#23562) 2022-03-31 17:16:53 -07:00
head.py [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408) 2022-07-09 13:09:48 -07:00
http_server_agent.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
http_server_head.py [Core][cli][usability] ray stop prints errors during graceful shutdown (#25686) 2022-06-27 08:14:59 -07:00
k8s_utils.py [dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688) 2022-03-01 17:15:59 -08:00
memory_utils.py [State Observability] Summary APIs (#25672) 2022-06-22 06:21:50 -07:00
optional_deps.py [Dashboard] Agent in minimal ray installation (#21817) 2022-01-26 04:03:54 -08:00
optional_utils.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
state_aggregator.py [State Observability] Truncate data when there are too many entries to return (#26124) 2022-06-28 18:33:57 -07:00
utils.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00