Commit graph

73 commits

Author SHA1 Message Date
Amog Kamsetty
c17e171f92
Revert "[Dashboard][event] Basic event module (#16985)" (#17068)
This reverts commit f1faa79a04.
2021-07-13 23:18:43 -07:00
Amog Kamsetty
7ec18f671a
[Core] Remove gpustat from core dependencies (#17059) 2021-07-13 21:22:02 -07:00
fyrestone
f1faa79a04
[Dashboard][event] Basic event module (#16985)
* Basic event module

* Fix comments

* Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2

* Fix lint

* Fix lint

* Clean code

* Try to fix flaky

* Fix test

* Disable event module by default

* Make monitor events task cancellable

* Fix error

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-13 19:08:39 -07:00
Amog Kamsetty
a14342ce6f
Revert "[Dashboard][event] Basic event module (#16698)" (#17004)
This reverts commit 66ea099897.
2021-07-12 11:22:46 -07:00
fyrestone
66ea099897
[Dashboard][event] Basic event module (#16698)
* Basic event module

* Fix comments

* Set the SCAN_EVENT_DIR_INTERVAL_SECONDS defaults to 2

* Fix lint

* Fix lint

* Clean code

* Try to fix flaky

* Fix test

* Disable event module by default

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-07-09 10:25:30 -07:00
architkulkarni
06dfd8dddb
Revert "[Dashboard][event] Basic event module (#16283)" (#16676)
This reverts commit 5afa53aa64.
2021-06-25 09:38:18 -07:00
SongGuyang
e74d9d3ded
[runtime env] Download runtime env(conda) in agent instead of setup_worker (#16525) 2021-06-25 19:39:05 +08:00
fyrestone
5afa53aa64
[Dashboard][event] Basic event module (#16283) 2021-06-25 13:59:02 +08:00
SongGuyang
874e947d6f
[runtime env] support create or delete runtime envs in agent (#15904) 2021-06-09 20:22:25 +08:00
fyrestone
4ca316a0f4
Move test_snapshot from test_dashboard.py to modules/snapshot/tests/test_snapshot.py (#16306)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-06-08 10:26:03 -07:00
fyrestone
dfadf33a94
[Dashboard] Reorganize dashboard modules - node (#16217) 2021-06-07 19:50:46 -07:00
Alex Wu
e1da31f149
[dashboard] Include ray session name in dashboard snapshot (#16199)
* .

* .

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-02 15:07:06 -07:00
fyrestone
c53893cb13
[Dashboard] Reorganize dashboard modules - actor (#16170) 2021-06-02 06:58:30 -07:00
Alex Wu
f080911d9b
[dashboard] include worker id in actor snapshot (#15967)
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-05-21 09:26:37 -07:00
Alex Wu
cd2fc7792f
[dashboard] Snapshot of cluster state (#15868) 2021-05-20 08:10:32 -07:00
fyrestone
56c309416e
[Job submission] Basic job submission structure (#15103) 2021-05-12 15:08:20 +08:00
Ian Rodney
90ce25cb35
[dashboard] Avoid global min_workers (#15660) 2021-05-10 15:47:51 -07:00
SongGuyang
b8ff86adb9
Add objectStore stats to dashboard API. (#15677) 2021-05-10 11:32:14 -05:00
Ian Rodney
546e5f6f13
[API] Remove non-API top Level function imports (#15440) 2021-04-27 12:33:59 -07:00
Dmitri Gekhtman
410f768046
[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. (#14676) 2021-04-05 17:16:20 -07:00
Clark Zinzow
1a9ba19012
[Core] Adds deprecation decorator and fixes privatization of a few APIs. (#14811) 2021-03-22 10:31:50 -07:00
Ian Rodney
eb12033612
[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-18 13:10:57 -07:00
Lixin Wei
72d87093b9
[Core] Make Actor DEAD and Save Exceptions in GCS When Error Happens in Constructor (#14211) 2021-03-17 12:50:28 -07:00
Kathryn Zhou
01dda99b8c
Export cluster statistics to Prometheus (#14612) 2021-03-15 19:28:13 -07:00
Dmitri Gekhtman
6babd1928c
[Kubernetes][dashboard][minor] Fix uptime (#14655) 2021-03-12 18:30:13 -06:00
Dmitri Gekhtman
a90cffe26c
[dashboard][k8s] Better CPU reporting when running on K8s (#14593) 2021-03-12 12:02:15 -06:00
Clark Zinzow
5a788474aa
[Core] First pass at privatizing non-public Python APIs. (#14607)
* async_compat

* utils

* cluster_utils

* compat

* function_manager

* import_thread

* memory_monitor

* monitor, log_monitor, ray_process_reaper

* metrics_agent

* parameter

* prometheus_exporter

* ray_logging

* signature
2021-03-10 22:47:28 -08:00
Dmitri Gekhtman
4a7d9e71bb
[dashboard][kubernetes] Show container's memory info on K8s, not the physical host's. (#14499)
* random doc typo

* more reasonable memory output

* no if

* get rid of comment
2021-03-08 18:59:41 -08:00
fyrestone
3616424f10
Disable dashboard tune module if pandas version is incorrect (#14381) 2021-03-08 20:40:59 -06:00
fyrestone
2da58bb021
[Dashboard] Fix reporter agent (#14378) 2021-03-08 13:12:34 -06:00
fyrestone
5e76a51d56
[Dashboard] Select port in dashboard (#13763)
* Dashboard select port; Fix dashboard may hangs when exit

* Add test case

* Fix

* Fix test_stats_collector.py::test_get_all_node_details

* Refine dashboard error messages

* Refine code

* Refine code

* Show last 10 lines of dashboard log if start dashboard failed

* Fix ValueError: too many values to unpack (expected 2) when getsockname

* Fix test_multi_node_3.py::test_calling_start_ray_head may fail

* Fix Windows CI

* Disable dashboard in C++ test

* Refine code

* Fix issue 7084

Co-authored-by: 刘宝 <po.lb@antfin.com>
2021-02-23 16:27:48 -08:00
Kathryn Zhou
d6521be7ef
Export GPU metrics, CPU count, and additional Memory metrics to Prometheus (#14170) 2021-02-22 10:04:18 -08:00
Kathryn Zhou
f6b5e838fe
Add disk and network metrics to Prometheus and fix dashboard (#14144) 2021-02-17 10:27:14 -08:00
Simon Mo
33316d4f8f
Revert "Export additional metrics to Prometheus (#14061)" (#14134)
This reverts commit 82539f2da4.
2021-02-16 12:49:12 -08:00
Kathryn Zhou
82539f2da4
Export additional metrics to Prometheus (#14061) 2021-02-14 23:16:26 -08:00
Tao Wang
56ee6ef55f
[GCS]only update states related fields when publish actor table data (#13448) 2021-01-28 11:12:57 +08:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Tao Wang
aa5d7a5e6c
[Dashboard]Don't set node actors when node_id of actor is Nil (#13573)
* Don't set node actors when node_id of actor is Nil

* add test per comment
2021-01-21 20:18:34 -08:00
Xianyang Liu
4ecd29ea2b
[dashboard] Fixes dashboard issues when environments have set http_proxy (#12598)
* fixes ray start with http_proxy

* format

* fixes

* fixes

* increase timeout

* address comments
2021-01-21 20:10:01 -08:00
Simon Mo
dac8b3d58a
[CI] Enable Dashboard tests for master (#13425) 2021-01-15 09:43:34 -08:00
fyrestone
4853aa96cb
[Dashboard] Fix missing actor pid (#13229) 2021-01-13 16:45:12 +08:00
fyrestone
a6d135a072
[Dashboard] Add GET /log_proxy API (#13165) 2021-01-08 11:45:07 +08:00
SangBin Cho
32dc5676b4
[Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
fyrestone
6a54897577
Job module without submission (#13081)
Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-31 11:12:17 +08:00
Alex Wu
8df94e33e0
[Autoscaler] New output log format (#12772) 2020-12-23 12:02:55 -08:00
fyrestone
62a5832007
[Dashboard] Add GET /logical/actors API (#12913) 2020-12-23 11:14:23 +08:00
Eric Liang
03a5b90ed6
Revert "Revert "Increase the number of unique bits for actors to avoi… (#12990) 2020-12-21 15:16:42 -08:00
Eric Liang
64c97d25d3
Enable by default new scheduler (#12735) 2020-12-19 13:22:24 -08:00
Eric Liang
5d987f5988
Revert "Increase the number of unique bits for actors to avoid handle collisions (#12894)" (#12988)
This reverts commit 3e492a79ec.
2020-12-18 23:51:44 -08:00
Eric Liang
3e492a79ec
Increase the number of unique bits for actors to avoid handle collisions (#12894) 2020-12-18 15:59:03 -08:00