ray/dashboard at da5cf93d976ae03d6090a4d4249adf46818d30eb - hiro/ray

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

History

Yi Cheng 7cf4233858 [core] Resubscribe GCS in python when GCS restarts. (#24887 ) This is a follow-up PRs of https://github.com/ray-project/ray/pull/24813 and https://github.com/ray-project/ray/pull/24628 Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side. In case of a failure, the protocol is: 1. call subscribe 2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down. 3. continue to poll once subscribe ok. However, there is an extreme case where things might be broken: the client might miss detecting a failure. This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered. Here we are not going to take care of this case because: 1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component. 2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job; 3. for the dashboard, it can just restart to fix the issue. A known issue here is that we might miss logs in case of GCS failure due to the following reasons: - py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side) - if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side) We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS. Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because: 1. It's only set in driver and we don't support creating a new driver when GCS is down. 2. When GCS is down, we don't support starting new ray workers. And `run_functions_on_all_workers` is only used when we initialize driver/workers.		2022-05-23 13:06:33 -07:00
..
client	[Core/Observability 1/N] Add a "running" state to task status (#24651 )	2022-05-16 05:39:05 -07:00
modules	[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719 )	2022-05-23 10:25:04 -05:00
tests	[core] Resubscribe GCS in python when GCS restarts. (#24887 )	2022-05-23 13:06:33 -07:00
__init__.py	[Dashboard] New dashboard skeleton (#9099 )	2020-07-27 11:34:47 +08:00
agent.py	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
BUILD	[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164 )	2021-11-08 23:10:27 -08:00
consts.py	[runtime env] URI reference refactor (#22828 )	2022-03-21 11:21:15 -05:00
dashboard.py	GcsPublisher is being constructed with unsupported position argument	2022-04-14 10:47:57 -07:00
datacenter.py	[Dashboard] fix iterating over GPU processes (#23562 )	2022-03-31 17:16:53 -07:00
head.py	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
http_server_agent.py	[CI] Format Python code with Black (#21975 )	2022-01-29 18:41:57 -08:00
http_server_head.py	[Dashboard] Enable dashboard in the minimal ray installation (#21896 )	2022-01-31 22:34:40 -08:00
k8s_utils.py	[dashboard][kubernetes] Dashboard CPU and memory adjustments. (#21688 )	2022-03-01 17:15:59 -08:00
memory_utils.py	[core] Add task and object reconstruction status to ray memory (#22317 )	2022-02-22 21:26:21 -08:00
optional_deps.py	[Dashboard] Agent in minimal ray installation (#21817 )	2022-01-26 04:03:54 -08:00
optional_utils.py	[serve] Remove dashboard's dependency on Serve (#23389 )	2022-03-21 22:14:41 -07:00
state_aggregator.py	[Nightly test] Move two line downloads to one line. (#25061 )	2022-05-22 00:07:03 -07:00
utils.py	[dashboard] Remove redis in dashboard (#22788 )	2022-03-04 12:32:17 -08:00