Commit graph

32 commits

Author SHA1 Message Date
Alan Guo
50b20809b8
[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818)
Signed-off-by: Alan Guo <aguo@anyscale.com>

## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.

Also cleans up actor cache by removing dead actors from the cache.

Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">


## Related issue number
fixes #23680 
fixes #22027
fixes #24272
2022-07-26 03:10:57 -07:00
mwtian
1a4c3c07f7
[Dashboard] fix iterating over GPU processes (#23562)
Current logic looks broken, as reported in #22954 (comment)

I fixed the logic as best as I can, and tested it on Anyscale platform with GPU. No process info was reported from gpustat. But the logic works under this case.
2022-03-31 17:16:53 -07:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
mwtian
e8ce01c525
[Dashboard] offload blocking work to a thread pool (#21762)
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
2022-01-21 17:55:11 -08:00
SangBin Cho
3222d39fb8
[Dashboard] Dashboard memory improvement (#19385)
* many ppo profiling

* completed

* improve memory usage lint

* revert temporarily

* Addressed code review

* Fix a test
2021-10-19 19:34:42 -07:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Amog Kamsetty
8dfd471823
Revert "Revert "[Dashboard][event] Basic event module (#16985)" (#17068)" (#17107)
This reverts commit c17e171f92.

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-18 12:59:04 +08:00
Amog Kamsetty
c17e171f92
Revert "[Dashboard][event] Basic event module (#16985)" (#17068)
This reverts commit f1faa79a04.
2021-07-13 23:18:43 -07:00
fyrestone
f1faa79a04
[Dashboard][event] Basic event module (#16985)
* Basic event module

* Fix comments

* Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2

* Fix lint

* Fix lint

* Clean code

* Try to fix flaky

* Fix test

* Disable event module by default

* Make monitor events task cancellable

* Fix error

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-13 19:08:39 -07:00
Amog Kamsetty
a14342ce6f
Revert "[Dashboard][event] Basic event module (#16698)" (#17004)
This reverts commit 66ea099897.
2021-07-12 11:22:46 -07:00
fyrestone
66ea099897
[Dashboard][event] Basic event module (#16698)
* Basic event module

* Fix comments

* Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2

* Fix lint

* Fix lint

* Clean code

* Try to fix flaky

* Fix test

* Disable event module by default

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-09 10:25:30 -07:00
architkulkarni
06dfd8dddb
Revert "[Dashboard][event] Basic event module (#16283)" (#16676)
This reverts commit 5afa53aa64.
2021-06-25 09:38:18 -07:00
fyrestone
5afa53aa64
[Dashboard][event] Basic event module (#16283) 2021-06-25 13:59:02 +08:00
fyrestone
c53893cb13
[Dashboard] Reorganize dashboard modules - actor (#16170) 2021-06-02 06:58:30 -07:00
SongGuyang
b8ff86adb9
Add objectStore stats to dashboard API. (#15677) 2021-05-10 11:32:14 -05:00
fyrestone
4853aa96cb
[Dashboard] Fix missing actor pid (#13229) 2021-01-13 16:45:12 +08:00
fyrestone
6a54897577
Job module without submission (#13081)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-31 11:12:17 +08:00
fyrestone
62a5832007
[Dashboard] Add GET /logical/actors API (#12913) 2020-12-23 11:14:23 +08:00
Max Fitton
ac24d1db30
[Dashboard][Bugfix] Fix GPU List Bug (#12666)
* Fix bug where None was passed as the empty value for ActorInfo.gpu_stats instead of an empty list

* lint

* dashboard/modules/logical_view

* fix test

* trigger build
2020-12-12 23:34:24 -08:00
Max Fitton
2e95552f0c
[Dashboard] Defensive change to make sure we do not iterate over "None" in the case that workers is not present in node physical stats for a given node (#12358) 2020-11-25 11:06:45 -08:00
SangBin Cho
5fb410cfbf
[Dashboard] New dashboard view data doesn't exist. (#12129)
* Fix.

* Fix the issue.
2020-11-19 11:04:59 -08:00
Max Fitton
f545418c3f
[Dashboard] Fix dashboard regression caused by logCount and errCount being removed from worker payload (#11954) 2020-11-11 14:55:54 -08:00
Max Fitton
368b14a0da
Stop dashboard from erroring when an actor does not have a corresponding core worker (#11870) 2020-11-09 11:36:34 -06:00
fyrestone
05ad4c7499
[Dashboard] Optimize dashboard datacenter (#11391)
* Optimize dashboard datacenter

* Fix tests

* Fix tests

* Fix

* Fix CI

* python/build-wheel-macos.sh

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
2020-10-27 23:49:31 -07:00
Max Fitton
caf3b04b27
[Dashboard] Turn on new dashboard by default pt 2 (#11510) 2020-10-23 15:52:14 -05:00
Max Fitton
cdca5af53b
Revert "[Dashboard] Turn on New Dashboard by Default (#11321)" (#11502)
This reverts commit f500292d41.
2020-10-20 10:53:10 -05:00
Max Fitton
f500292d41
[Dashboard] Turn on New Dashboard by Default (#11321) 2020-10-19 12:31:11 -05:00
Max Fitton
cd9dcfca0d
[Dashboard] CPU/GPU usage details in actor pane (#11269) 2020-10-13 20:23:23 -05:00
Max Fitton
ff6d412ad9
[Dashboard] Add API support for the logical view and machine view in new backend (#11012)
* Add API support for the logical view and machine view, which lean on datacenter in common.

* Update dashboard/datacenter.py

Co-authored-by: fyrestone <fyrestone@outlook.com>

* Update dashboard/modules/logical_view/logical_view_head.py

Co-authored-by: fyrestone <fyrestone@outlook.com>

* Address PR comments

* lint

* Add dashboard tests to CI build

* Fix integration issues

* lint

Co-authored-by: Max Fitton <max@semprehealth.com>
Co-authored-by: fyrestone <fyrestone@outlook.com>
2020-10-02 17:58:44 -07:00
fyrestone
50784e2496
[Dashboard] Dashboard node grouping (#10528)
* Add RAY_NODE_ID environment var to agent

* Node ralated data use node id as key

* ray.init() return node id; Pass test_reporter.py

* Fix lint & CI

* Fix comments

* Minor fixes

* Fix CI

* Add const to ClientID in AgentManager::Options

* Use fstring

* Add comments

* Fix lint

* Add test_multi_nodes_info

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-09-16 10:17:29 -07:00
fyrestone
e9b046306a
[Dashboard] Dashboard basic modules (#10303)
* Improve reporter module

* Add test_node_physical_stats to test_reporter.py

* Add test_class_method_route_table to test_dashboard.py

* Add stats_collector module for dashboard

* Subscribe actor table data

* Add log module for dashboard

* Only enable test module in some test cases

* CI run all dashboard tests

* Reduce test timeout to 10s

* Use fstring

* Remove unused code

* Remove blank line

* Fix dashboard tests

* Fix asyncio.create_task not available in py36; Fix lint

* Add format_web_url to ray.test_utils

* Update dashboard/modules/reporter/reporter_head.py

Co-authored-by: Max Fitton <mfitton@berkeley.edu>

* Add DictChangeItem type for Dict change

* Refine logger.exception

* Refine GET /api/launch_profiling

* Remove disable_test_module fixture

* Fix test_basic may fail

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
2020-08-29 23:09:34 -07:00
fyrestone
4d08ddbf24
[Dashboard] New dashboard skeleton (#9099) 2020-07-27 11:34:47 +08:00