hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Yi Cheng	dac7bf17d9	[serve] Make serve agent not blocking when GCS is down. (#27526 ) This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status. - internal kv used in dashboard/agent blocks the agent. We use the async one instead - serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout - agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back. To enable Serve HA, we also need to setup: - RAY_gcs_server_request_timeout_seconds=5 - RAY_SERVE_KV_TIMEOUT_S=5 which we should set in KubeRay.	2022-08-08 16:29:42 -07:00
SangBin Cho	6084ee5a63	Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308 )" (#27613 ) This reverts commit `ccf411604e`.	2022-08-08 06:38:19 -07:00
Jialing He	ccf411604e	Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308 )	2022-08-05 16:32:48 +08:00
SangBin Cho	ec69fec1e0	Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302 )" (#27242 ) This reverts commit `14dee5f6a3`.	2022-07-30 00:08:23 -07:00
SangBin Cho	c1ac2bb80f	[Test] Try fixing a flaky gcs heartbeat manager test. (#27096 ) Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread	2022-07-28 22:42:13 -07:00
Jialing He	14dee5f6a3	[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302 ) This is the first PR of #25963 : 1. Moved the agent information from `internal KV to `GCSNodeInfo`, 2. raylet registers itself after the agent process finished register. Motivation: Storing agent information in `internal KV` and registering nodes in GCS (write node information to `GCSNodeInfo`) are two asynchronous operations, which will bring some complex timing problems, especially after `raylet` failover	2022-07-28 22:20:28 +08:00
SangBin Cho	2ca11d61b3	[State Observability] Set the default detail formatting as yaml + quicker head node register (#26946 ) ## Why are these changes needed? This PR does 2 things. 1. When `--detail` is specified, set the default formatting as yaml. 2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server. ## Related issue number Closes https://github.com/ray-project/ray/issues/26939	2022-07-26 13:49:30 -07:00
Alan Guo	50b20809b8	[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818 ) Signed-off-by: Alan Guo <aguo@anyscale.com> ## Why are these changes needed? Reduces memory footprint of the dashboard. Also adds some cleanup to the errors data. Also cleans up actor cache by removing dead actors from the cache. Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually. <img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png"> ## Related issue number fixes #23680 fixes #22027 fixes #24272	2022-07-26 03:10:57 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
Archit Kulkarni	cc864401fb	[Dashboard] Add environment variable flag to skip dashboard log processing (#24263 )	2022-04-27 15:33:08 -07:00
SangBin Cho	73ed67e9e6	[State API] State api limit + Removing unnecessary modules (#24098 ) This PR does Move all routes into the same module, state_head.py Support a limit feature.	2022-04-22 15:59:46 -07:00
SangBin Cho	30ab5458a7	[State Observability] Tasks and Objects API (#23912 ) This PR implements ray list tasks and ray list objects APIs. NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.	2022-04-21 18:45:03 -07:00
jon-chuang	ddcc252b51	[Core] Ray logs API (1/n) (#23435 ) Expose HTTP endpoint to retrieve logs from ray cluster	2022-04-20 23:11:02 -07:00
SangBin Cho	1c3329fa38	Revert "Revert "[State Observability] Basic functionality for central… (#23933 ) …ized data (#23744)" (#23918)" This reverts commit `fb14e82`.	2022-04-18 21:15:43 -07:00
Amog Kamsetty	fb14e82242	Revert "[State Observability] Basic functionality for centralized data (#23744 )" (#23918 ) This reverts commit `51a4a1a802`. breaking tune multinode tests and kuberay:test_autoscaling_e2e	2022-04-14 14:28:42 -07:00
SangBin Cho	51a4a1a802	[State Observability] Basic functionality for centralized data (#23744 ) Support listing actor/pg/job/node/workers Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2 Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.	2022-04-14 07:33:18 -07:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	e62c0052a0	[Dashboard] Agent in minimal ray installation (#21817 ) This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation. Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.	2022-01-26 04:03:54 -08:00
SangBin Cho	2010f13175	Fix dashboard test bug (#21742 ) Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).	2022-01-24 11:38:51 -06:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
mwtian	f18a8bd87f	[Dashboard] turn a noisy `info` log into `debug` (#21746 ) Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.	2022-01-20 22:08:23 -08:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00
mwtian	6871a72a5c	[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860 ) Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.	2021-12-10 14:36:57 -08:00
mwtian	d751a242d8	[Core][Dashboard Pubsub 2/n] Add resource reporter and actor to Python GCS pubsub (#21001 ) Dashboard contains resource reporter and actor subscribers. Dashboard agent has resource report publisher. So GCS pubsub needs to support these channel types. Also refactor GCS AIO subscribers to have each subscriber per channel. This matches the API of GCS sync subscribers, and make subscribing with multiple channels easier.	2021-12-09 23:10:10 -08:00
mwtian	95c26eec26	[Core][Pubsub][Logging 2/n] Use GCS pubsub for logs (#20492 ) Using Ray pubsub for publishing and subscribing logs via GCS, from Python worker, log importer, dashboard and unit tests. This change is guarded behind the RAY_gcs_grpc_based_pubsub feature flag.	2021-11-30 12:00:44 -08:00
mwtian	da79f24e8c	[Core][Pubsub] Refactor to prepare for migrating logging to Ray pubsub (#20560 ) ## Why are these changes needed? Publisher and subscriber for logs, in driver, dashboard and tests are refactored to make it easier to support using Ray pubsub for logs. Actual support of Ray pubsub for logs will be added later in #20492. This PR does not intend to introduce any behavior change. ## Related issue number	2021-11-19 12:28:37 -08:00
mwtian	0330852baf	[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111 ) ## Why are these changes needed? This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`. Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work. Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published. GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change. ## Related issue number	2021-11-11 14:59:57 -08:00
Yi Cheng	e54d3117a4	[gcs] Update all redis kv usage in python except function table (#20014 ) ## Why are these changes needed? This is part of redis removal project. In this PR all direct usage of redis got removed except function table. Function table will be migrated in the next PR ## Related issue number #19443	2021-11-10 20:24:53 -08:00
Oscar Knagg	5a05e89267	[Core] Add TLS/SSL support to gRPC channels (#18631 )	2021-10-20 22:39:11 -07:00
SangBin Cho	3222d39fb8	[Dashboard] Dashboard memory improvement (#19385 ) * many ppo profiling * completed * improve memory usage lint * revert temporarily * Addressed code review * Fix a test	2021-10-19 19:34:42 -07:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
fyrestone	57b9b1bb0f	[Dashboard] Use a dedicated RPC to check the GCS is alive (#16330 ) * Dashboard check gcs is alive * Fix dashboard hangs at exit * ray health-check call GCS CheckAlive * Minor fixes Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-27 14:05:44 +08:00
Amog Kamsetty	8dfd471823	Revert "Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 )" (#17107 ) This reverts commit `c17e171f92`. Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-18 12:59:04 +08:00
Amog Kamsetty	c17e171f92	Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 ) This reverts commit `f1faa79a04`.	2021-07-13 23:18:43 -07:00
fyrestone	f1faa79a04	[Dashboard][event] Basic event module (#16985 ) * Basic event module * Fix comments * Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2 * Fix lint * Fix lint * Clean code * Try to fix flaky * Fix test * Disable event module by default * Make monitor events task cancellable * Fix error Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-13 19:08:39 -07:00
fyrestone	dfadf33a94	[Dashboard] Reorganize dashboard modules - node (#16217 )	2021-06-07 19:50:46 -07:00

39 commits