ray/dashboard
Yi Cheng dac7bf17d9
[serve] Make serve agent not blocking when GCS is down. (#27526)
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.
2022-08-08 16:29:42 -07:00
..
client Add GPU info to new dashboard (#27074) 2022-08-02 15:32:55 -07:00
modules [serve] Make serve agent not blocking when GCS is down. (#27526) 2022-08-08 16:29:42 -07:00
tests Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613) 2022-08-08 06:38:19 -07:00
__init__.py [Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00
agent.py Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613) 2022-08-08 06:38:19 -07:00
BUILD Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525) 2022-07-18 21:21:19 -07:00
consts.py [serve] Make serve agent not blocking when GCS is down. (#27526) 2022-08-08 16:29:42 -07:00
dashboard.py [Usage Stats] Record usage stats when dashboard disabled (#26042) 2022-07-28 23:01:49 -07:00
datacenter.py [Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818) 2022-07-26 03:10:57 -07:00
head.py [core] Support external ray dashboard URL (#27396) 2022-08-05 19:33:10 -07:00
http_server_agent.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
http_server_head.py [Core][cli][usability] ray stop prints errors during graceful shutdown (#25686) 2022-06-27 08:14:59 -07:00
k8s_utils.py [dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688) 2022-03-01 17:15:59 -08:00
memory_utils.py [State Observability] pre-alpha documentation (#26560) 2022-07-26 05:49:28 -07:00
optional_deps.py Make it so pydantic is required before we launch dashboard api server (#27345) 2022-08-03 14:24:51 -07:00
optional_utils.py [serve] Make serve agent not blocking when GCS is down. (#27526) 2022-08-08 16:29:42 -07:00
state_aggregator.py Convert job_manager to be async (#27123) 2022-08-05 19:33:49 -07:00
utils.py Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00