hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

Fork 0

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Commit graph

Author	SHA1	Message	Date
Yi Cheng	dac7bf17d9	[serve] Make serve agent not blocking when GCS is down. (#27526 ) This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status. - internal kv used in dashboard/agent blocks the agent. We use the async one instead - serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout - agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back. To enable Serve HA, we also need to setup: - RAY_gcs_server_request_timeout_seconds=5 - RAY_SERVE_KV_TIMEOUT_S=5 which we should set in KubeRay.	2022-08-08 16:29:42 -07:00
Yi Cheng	a68c02a15d	[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408 ) ## Why are these changes needed? As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets. This PR expose them in the endpoint in dashboard and dashboard agent. For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not. For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not. We think raylet is live if - GCS is dead - GCS is alive but GCS think the raylet is dead If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.	2022-07-09 13:09:48 -07:00

Author

SHA1

Message

Date

Yi Cheng

dac7bf17d9

[serve] Make serve agent not blocking when GCS is down. (#27526 )

This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.

2022-08-08 16:29:42 -07:00

Yi Cheng

a68c02a15d

[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408 )

## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.

This PR expose them in the endpoint in dashboard and dashboard agent.

For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.

For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.

We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead

If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.

2022-07-09 13:09:48 -07:00

2 commits