hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
SangBin Cho	39b9c44c8d	[State Observability] pre-alpha documentation (#26560 ) Adds Documentation for state APIs API reference	2022-07-26 05:49:28 -07:00
Alan Guo	50b20809b8	[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818 ) Signed-off-by: Alan Guo <aguo@anyscale.com> ## Why are these changes needed? Reduces memory footprint of the dashboard. Also adds some cleanup to the errors data. Also cleans up actor cache by removing dead actors from the cache. Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually. <img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png"> ## Related issue number fixes #23680 fixes #22027 fixes #24272	2022-07-26 03:10:57 -07:00
SangBin Cho	15b711ae6a	[State Observability] Warn if callsite is disabled when `ray list objects` + raise exception on missing output (#26880 ) This PR does 3 things. 1. Warn if callsite is disabled when `ray list objects` and `ray summary objects` 2. Decode owner_id for ray list actors 3. Support raise_on_missing_output	2022-07-24 19:55:36 -07:00
brucez-anyscale	d98a2482de	[Dashboard] Fix test dashboard flaky by catch an expected exception (#26555 )	2022-07-14 20:57:46 -07:00
SangBin Cho	e9f6ffc5a5	[Core][State Observability] Use address arg + print warning if API responds slowly (#26008 ) This PR is doing 2 things. (1) Use api_server_url to address which is consistent to other submission APIs. (2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX. (0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.	2022-07-14 06:44:07 -07:00
Ricky Xu	365ffe21e5	[Core \| State Observability] Implement API Server (Dashboard) HTTP Requests Throttling (#26257 ) This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests. This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.	2022-07-13 09:05:26 -07:00
brucez-anyscale	f76d7b23f2	Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336 )	2022-07-06 19:37:30 -07:00
Yi Cheng	12d147ff1f	Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107 )" (#26333 ) This reverts commit `84166ccb04`.	2022-07-06 13:30:33 -07:00
brucez-anyscale	84166ccb04	[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107 ) In Ray 2.0, we want to achieve api server HA. Originally serve endpoints are in head node. This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.	2022-07-06 10:58:00 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Ricky Xu	b1d0b12b4e	[Core \ State Observability] Use Submission client (#25557 ) ## Why are these changes needed? This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. See #24956 for more details. ## Summary <!-- Please give a short summary of the change and the problem this solves. --> - Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. ## Related issue number Closes #24956 Closes #25578	2022-06-13 17:11:19 -07:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
Simon Mo	61099faa58	[CI] Fix dashboard tests broken due to dep version upgrade (#25357 )	2022-06-01 12:14:49 -07:00
mwtian	fa32cb7c40	Revert "[core] Resubscribe GCS in python when GCS restarts. (#24887 )" (#25168 ) This reverts commit `7cf4233858`.	2022-05-24 18:13:40 -07:00
Yi Cheng	7cf4233858	[core] Resubscribe GCS in python when GCS restarts. (#24887 ) This is a follow-up PRs of https://github.com/ray-project/ray/pull/24813 and https://github.com/ray-project/ray/pull/24628 Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side. In case of a failure, the protocol is: 1. call subscribe 2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down. 3. continue to poll once subscribe ok. However, there is an extreme case where things might be broken: the client might miss detecting a failure. This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered. Here we are not going to take care of this case because: 1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component. 2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job; 3. for the dashboard, it can just restart to fix the issue. A known issue here is that we might miss logs in case of GCS failure due to the following reasons: - py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side) - if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side) We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS. Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because: 1. It's only set in driver and we don't support creating a new driver when GCS is down. 2. When GCS is down, we don't support starting new ray workers. And `run_functions_on_all_workers` is only used when we initialize driver/workers.	2022-05-23 13:06:33 -07:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
shrekris-anyscale	b00977b1b1	[serve] Remove dashboard's dependency on Serve (#23389 )	2022-03-21 22:14:41 -07:00
mwtian	72ef9f91aa	[Remove Redis Pubsub 1/n] Remove `enable_gcs_pubsub()` (#23189 ) GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.	2022-03-15 23:56:15 -07:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Guyang Song	902243fb03	[runtime env] support raylet sharing fate with agent (#22382 ) - Remove the agent restart feature. - Raylet shares fate with agent to make the failover logic easier. Refer to issue https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528	2022-02-21 18:16:21 +08:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
SangBin Cho	3566cfd279	[Dashboard] Enable dashboard in the minimal ray installation (#21896 ) This is the last PR to enable dashboard in the minimal ray installation. Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;	2022-01-31 22:34:40 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Alex Wu	7a45f60dbc	[autoscaler] Fix ray.autoscaler.sdk import issue (#21795 ) This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-01-25 14:43:24 -08:00
SangBin Cho	2010f13175	Fix dashboard test bug (#21742 ) Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).	2022-01-24 11:38:51 -06:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
Yi Cheng	6dccfbffa9	Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585 ) Reverts ray-project/ray#21584 and turn the flag off	2022-01-13 16:12:03 -08:00
Yi Cheng	bc696212d2	Revert "[gcs] turn on grpc pubsub by default" (#21584 ) test-reconnect seems flaky. Reverts ray-project/ray#21513	2022-01-13 12:34:02 -08:00
Yi Cheng	6194783312	[gcs] turn on grpc pubsub by default (#21513 ) Turn on grpc pubsub by default. This PR also fixed several tests which are failed before. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-12 22:13:03 -08:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
Antoni Baum	20fc9f907d	[CI] Fix tune dashboard, increase timeout for `test_commands` (#20453 )	2021-11-16 17:52:17 -08:00
Simon Mo	32a4f48aa2	[CI] Don't test tune dashboard (#20452 )	2021-11-16 15:07:56 -08:00
Yi Cheng	e54d3117a4	[gcs] Update all redis kv usage in python except function table (#20014 ) ## Why are these changes needed? This is part of redis removal project. In this PR all direct usage of redis got removed except function table. Function table will be migrated in the next PR ## Related issue number #19443	2021-11-10 20:24:53 -08:00
Guyang Song	119318932a	remove the env config 'RAY_DASHBOARD_MODULE_EVENT' (#19629 )	2021-10-28 16:51:59 +09:00
Matti Picus	f372bb07aa	Enable dashboard on Windows (#19319 )	2021-10-14 14:42:22 -07:00
SangBin Cho	7fcf1bf57e	[Dashboard] Refine the dashboard restart logic. (#18973 ) * in progress * Refine the dashboard agent retry logic * refine * done * lint	2021-10-04 05:01:51 -07:00
Eric Liang	11a2dfcaab	Improve unschedulable task warning messages by integrating with the autoscaler (#18724 )	2021-09-24 12:19:58 -07:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Simon Mo	e61160d514	[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. (#18236 )	2021-09-03 14:23:56 -07:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
fyrestone	57b9b1bb0f	[Dashboard] Use a dedicated RPC to check the GCS is alive (#16330 ) * Dashboard check gcs is alive * Fix dashboard hangs at exit * ray health-check call GCS CheckAlive * Minor fixes Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-27 14:05:44 +08:00
Amog Kamsetty	8dfd471823	Revert "Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 )" (#17107 ) This reverts commit `c17e171f92`. Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-18 12:59:04 +08:00
Amog Kamsetty	c17e171f92	Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 ) This reverts commit `f1faa79a04`.	2021-07-13 23:18:43 -07:00
fyrestone	f1faa79a04	[Dashboard][event] Basic event module (#16985 ) * Basic event module * Fix comments * Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2 * Fix lint * Fix lint * Clean code * Try to fix flaky * Fix test * Disable event module by default * Make monitor events task cancellable * Fix error Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-13 19:08:39 -07:00
Amog Kamsetty	a14342ce6f	Revert "[Dashboard][event] Basic event module (#16698 )" (#17004 ) This reverts commit `66ea099897`.	2021-07-12 11:22:46 -07:00
fyrestone	66ea099897	[Dashboard][event] Basic event module (#16698 ) * Basic event module * Fix comments * Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2 * Fix lint * Fix lint * Clean code * Try to fix flaky * Fix test * Disable event module by default Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-09 10:25:30 -07:00
fyrestone	4ca316a0f4	Move test_snapshot from test_dashboard.py to modules/snapshot/tests/test_snapshot.py (#16306 ) Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-06-08 10:26:03 -07:00
fyrestone	dfadf33a94	[Dashboard] Reorganize dashboard modules - node (#16217 )	2021-06-07 19:50:46 -07:00
Alex Wu	cd2fc7792f	[dashboard] Snapshot of cluster state (#15868 )	2021-05-20 08:10:32 -07:00

1 2

78 commits