ray/dashboard/modules
Yi Cheng a68c02a15d
[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408)
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.

This PR expose them in the endpoint in dashboard and dashboard agent.

For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.

For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.

We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead

If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
2022-07-09 13:09:48 -07:00
..
actor [api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695) 2022-06-21 15:13:29 -07:00
event [api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695) 2022-06-21 15:13:29 -07:00
healthz [dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408) 2022-07-09 13:09:48 -07:00
job Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
log Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163) 2022-06-28 16:07:32 -07:00
node [api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695) 2022-06-21 15:13:29 -07:00
reporter [Dashboard] Fix dashboard RAM and CPU with cgroups2 (#25710) 2022-06-26 14:01:26 -07:00
runtime_env [State Observability] Truncate data when there are too many entries to return (#26124) 2022-06-28 18:33:57 -07:00
serve Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
snapshot [dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK to /api/component_activities (#26297) 2022-07-08 10:51:59 -07:00
state [State Observability] Truncate data when there are too many entries to return (#26124) 2022-06-28 18:33:57 -07:00
test [api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695) 2022-06-21 15:13:29 -07:00
tests [serve] Reject Ray client addresses when submitting via Dashboard (#23339) 2022-03-21 11:17:51 -05:00
tune [tune] fix set_tune_experiment (#26298) 2022-07-05 15:04:51 -07:00
usage_stats [Usage stats] Add tags & number of nodes to the report. (#25852) 2022-07-07 08:31:04 -07:00
__init__.py [Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00
dashboard_sdk.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
version.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00