hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-09 04:46:38 -04:00

Author	SHA1	Message	Date
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
mwtian	fa32cb7c40	Revert "[core] Resubscribe GCS in python when GCS restarts. (#24887 )" (#25168 ) This reverts commit `7cf4233858`.	2022-05-24 18:13:40 -07:00
Yi Cheng	7cf4233858	[core] Resubscribe GCS in python when GCS restarts. (#24887 ) This is a follow-up PRs of https://github.com/ray-project/ray/pull/24813 and https://github.com/ray-project/ray/pull/24628 Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side. In case of a failure, the protocol is: 1. call subscribe 2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down. 3. continue to poll once subscribe ok. However, there is an extreme case where things might be broken: the client might miss detecting a failure. This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered. Here we are not going to take care of this case because: 1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component. 2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job; 3. for the dashboard, it can just restart to fix the issue. A known issue here is that we might miss logs in case of GCS failure due to the following reasons: - py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side) - if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side) We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS. Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because: 1. It's only set in driver and we don't support creating a new driver when GCS is down. 2. When GCS is down, we don't support starting new ray workers. And `run_functions_on_all_workers` is only used when we initialize driver/workers.	2022-05-23 13:06:33 -07:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
shrekris-anyscale	b00977b1b1	[serve] Remove dashboard's dependency on Serve (#23389 )	2022-03-21 22:14:41 -07:00
mwtian	72ef9f91aa	[Remove Redis Pubsub 1/n] Remove `enable_gcs_pubsub()` (#23189 ) GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.	2022-03-15 23:56:15 -07:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Yi Cheng	11bbf00338	[dashboard] Remove redis in dashboard (#22788 ) As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.	2022-03-04 12:32:17 -08:00
Guyang Song	902243fb03	[runtime env] support raylet sharing fate with agent (#22382 ) - Remove the agent restart feature. - Raylet shares fate with agent to make the failover logic easier. Refer to issue https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528	2022-02-21 18:16:21 +08:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
SangBin Cho	3566cfd279	[Dashboard] Enable dashboard in the minimal ray installation (#21896 ) This is the last PR to enable dashboard in the minimal ray installation. Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;	2022-01-31 22:34:40 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Alex Wu	7a45f60dbc	[autoscaler] Fix ray.autoscaler.sdk import issue (#21795 ) This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-01-25 14:43:24 -08:00
SangBin Cho	2010f13175	Fix dashboard test bug (#21742 ) Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).	2022-01-24 11:38:51 -06:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
Yi Cheng	6dccfbffa9	Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585 ) Reverts ray-project/ray#21584 and turn the flag off	2022-01-13 16:12:03 -08:00
Yi Cheng	bc696212d2	Revert "[gcs] turn on grpc pubsub by default" (#21584 ) test-reconnect seems flaky. Reverts ray-project/ray#21513	2022-01-13 12:34:02 -08:00
Yi Cheng	6194783312	[gcs] turn on grpc pubsub by default (#21513 ) Turn on grpc pubsub by default. This PR also fixed several tests which are failed before. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-12 22:13:03 -08:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
Yi Cheng	e54d3117a4	[gcs] Update all redis kv usage in python except function table (#20014 ) ## Why are these changes needed? This is part of redis removal project. In this PR all direct usage of redis got removed except function table. Function table will be migrated in the next PR ## Related issue number #19443	2021-11-10 20:24:53 -08:00
Matti Picus	f372bb07aa	Enable dashboard on Windows (#19319 )	2021-10-14 14:42:22 -07:00
SangBin Cho	7fcf1bf57e	[Dashboard] Refine the dashboard restart logic. (#18973 ) * in progress * Refine the dashboard agent retry logic * refine * done * lint	2021-10-04 05:01:51 -07:00
Eric Liang	11a2dfcaab	Improve unschedulable task warning messages by integrating with the autoscaler (#18724 )	2021-09-24 12:19:58 -07:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
fyrestone	57b9b1bb0f	[Dashboard] Use a dedicated RPC to check the GCS is alive (#16330 ) * Dashboard check gcs is alive * Fix dashboard hangs at exit * ray health-check call GCS CheckAlive * Minor fixes Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-27 14:05:44 +08:00
Amog Kamsetty	8dfd471823	Revert "Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 )" (#17107 ) This reverts commit `c17e171f92`. Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-18 12:59:04 +08:00
Amog Kamsetty	c17e171f92	Revert "[Dashboard][event] Basic event module (#16985 )" (#17068 ) This reverts commit `f1faa79a04`.	2021-07-13 23:18:43 -07:00
fyrestone	f1faa79a04	[Dashboard][event] Basic event module (#16985 ) * Basic event module * Fix comments * Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2 * Fix lint * Fix lint * Clean code * Try to fix flaky * Fix test * Disable event module by default * Make monitor events task cancellable * Fix error Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-07-13 19:08:39 -07:00
fyrestone	4ca316a0f4	Move test_snapshot from test_dashboard.py to modules/snapshot/tests/test_snapshot.py (#16306 ) Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-06-08 10:26:03 -07:00
fyrestone	dfadf33a94	[Dashboard] Reorganize dashboard modules - node (#16217 )	2021-06-07 19:50:46 -07:00
Alex Wu	cd2fc7792f	[dashboard] Snapshot of cluster state (#15868 )	2021-05-20 08:10:32 -07:00
Amog Kamsetty	ebc44c3d76	[CI] Upgrade flake8 to 3.9.1 (#15527 ) * formatting * format util * format release * format rllib/agents * format rllib/env * format rllib/execution * format rllib/evaluation * format rllib/examples * format rllib/policy * format rllib utils and tests * format streaming * more formatting * update requirements files * fix rllib type checking * updates * update * fix circular import * Update python/ray/tests/test_runtime_env.py * noqa	2021-05-03 14:23:28 -07:00
fyrestone	43de7f48a7	Fix reported dashboard ip when using 0.0.0.0 (#15506 )	2021-04-27 23:48:22 +08:00
fyrestone	5e76a51d56	[Dashboard] Select port in dashboard (#13763 ) * Dashboard select port; Fix dashboard may hangs when exit * Add test case * Fix * Fix test_stats_collector.py::test_get_all_node_details * Refine dashboard error messages * Refine code * Refine code * Show last 10 lines of dashboard log if start dashboard failed * Fix ValueError: too many values to unpack (expected 2) when getsockname * Fix test_multi_node_3.py::test_calling_start_ray_head may fail * Fix Windows CI * Disable dashboard in C++ test * Refine code * Fix issue 7084 Co-authored-by: 刘宝 <po.lb@antfin.com>	2021-02-23 16:27:48 -08:00
Xianyang Liu	4ecd29ea2b	[dashboard] Fixes dashboard issues when environments have set http_proxy (#12598 ) * fixes ray start with http_proxy * format * fixes * fixes * increase timeout * address comments	2021-01-21 20:10:01 -08:00
Simon Mo	dac8b3d58a	[CI] Enable Dashboard tests for master (#13425 )	2021-01-15 09:43:34 -08:00
Alex Wu	8df94e33e0	[Autoscaler] New output log format (#12772 )	2020-12-23 12:02:55 -08:00
Edward Oakes	261b2f9053	Check for raylet PID as ppid in dashboard agent fate-sharing (#12867 )	2020-12-15 12:13:11 -06:00
Max Fitton	d0813c1c58	[Dashboard] Add dashboard multi-node churn test (#11768 )	2020-12-14 17:03:33 -06:00
Stephanie Wang	a776209aec	Revert "Fix dashboard agent check ppid is raylet pid (#12256 )" (#12729 ) This reverts commit `3ce9286977`.	2020-12-09 17:20:38 -05:00
fyrestone	3ce9286977	Fix dashboard agent check ppid is raylet pid (#12256 ) * Dashboard agent check ppid is raylet pid * Improve implementation * Refine code * Make the RAY_NODE_PID environment required for dashboard agent Co-authored-by: 刘宝 <po.lb@antfin.com>	2020-12-09 09:12:34 -05:00
Max Fitton	2708b3abbc	[Dashboard][Bug] Fix duplicate node total rows in dashboard (#12410 ) * Fix duplicate node total rows in dashboard by changing the react key of the NodeTotalRow component from the node IP to the node ID (node IP can be duplicated in the case of docker). * simplify a piece of test code and fix a flaky time out * lint	2020-11-30 18:43:09 -08:00
SangBin Cho	753cda2f28	[Dashboard] Delete old dashboard (#12144 ) * Delete old dashboard from repo. * Delete old dashboard from repo. 2	2020-11-25 11:31:02 -08:00
fyrestone	05ad4c7499	[Dashboard] Optimize dashboard datacenter (#11391 ) * Optimize dashboard datacenter * Fix tests * Fix tests * Fix * Fix CI * python/build-wheel-macos.sh Co-authored-by: 刘宝 <po.lb@antfin.com> Co-authored-by: Max Fitton <maxfitton@anyscale.com>	2020-10-27 23:49:31 -07:00
Edward Oakes	798bd6a359	[dashboard] Add /api/cluster_status endpoint (#11456 )	2020-10-19 11:00:47 -05:00
fyrestone	defd41aad7	[Dashboard] http route handler cache (#10921 ) * Add aiohttp_cache to dashboard * Add comments; Refine code * Keep NODE_STATS_UPDATE_INTERVAL_SECONDS 1 second; Change AIOHTTP_CACHE_TTL_SECONDS to 2 seconds * Update merge Co-authored-by: 刘宝 <po.lb@antfin.com>	2020-10-09 22:27:05 -07:00
fyrestone	50784e2496	[Dashboard] Dashboard node grouping (#10528 ) * Add RAY_NODE_ID environment var to agent * Node ralated data use node id as key * ray.init() return node id; Pass test_reporter.py * Fix lint & CI * Fix comments * Minor fixes * Fix CI * Add const to ClientID in AgentManager::Options * Use fstring * Add comments * Fix lint * Add test_multi_nodes_info Co-authored-by: 刘宝 <po.lb@antfin.com>	2020-09-16 10:17:29 -07:00
Edward Oakes	523705ac0f	Fix new dashboard test process check (#10584 )	2020-09-04 22:04:44 -05:00
fyrestone	e9b046306a	[Dashboard] Dashboard basic modules (#10303 ) * Improve reporter module * Add test_node_physical_stats to test_reporter.py * Add test_class_method_route_table to test_dashboard.py * Add stats_collector module for dashboard * Subscribe actor table data * Add log module for dashboard * Only enable test module in some test cases * CI run all dashboard tests * Reduce test timeout to 10s * Use fstring * Remove unused code * Remove blank line * Fix dashboard tests * Fix asyncio.create_task not available in py36; Fix lint * Add format_web_url to ray.test_utils * Update dashboard/modules/reporter/reporter_head.py Co-authored-by: Max Fitton <mfitton@berkeley.edu> * Add DictChangeItem type for Dict change * Refine logger.exception * Refine GET /api/launch_profiling * Remove disable_test_module fixture * Fix test_basic may fail Co-authored-by: 刘宝 <po.lb@antfin.com> Co-authored-by: Max Fitton <mfitton@berkeley.edu>	2020-08-29 23:09:34 -07:00

1 2

54 commits