Commit graph

67 commits

Author SHA1 Message Date
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695)
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
mwtian
65d7a610ab
[Core] Push message to driver when a Raylet dies (#25516)
Currently when Raylets die, it is hard to figure out:

if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
2022-06-09 05:54:34 -07:00
mwtian
1ce0ab7b7c
[Core] Export additional metrics for workers and Raylet memory (#25418)
Add visibility into the following to help Ray users and developers debug performance and OOM issues:

    Raylet memory usage broken down by USS vs remaining RSS.
    Total workers' count, CPU percentage usage, and memory usage.
2022-06-06 10:58:14 -07:00
Eric Liang
905258dbc1
Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
Dmitri Gekhtman
6d09244a7e
[Dashboard][K8s] Add toggle to enable showing node disk usage on K8s (#24416)
https://github.com/ray-project/ray/pull/14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.

However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.

This PR adds a flag to enable displaying disk usage in those situations.
2022-05-03 10:58:05 -05:00
SangBin Cho
6f192b6e17
[Metrics] Allow to completely disable metrics collection (#24333)
This PR allows for Ray to disable metrics collection. It was possible with RAY_enable_metrics_collection, but it didn't fully disable collection because there was a metrics collection happening from agent that wasn't properly disabled. This PR also adds tests.
2022-05-02 05:33:03 -07:00
Stephanie Wang
b43426bc33
[core] Add metrics for disk and network I/O (#23546)
Adds some metrics useful for object-intensive workloads:

    Per raylet/object manager:
        Add num bytes pending restore to spill manager
        Add num requests cumulative to PullManager
        Num bytes pushed/pulled from other nodes cumulative
        Histogram for request latencies in PullManager:
            total life time of request, from start to cancel
            request satisfaction time, from start to object local
            pull time, from object activation to object local
    Per-node disk read/write speed, IOPS
2022-04-01 11:15:34 -07:00
Matti Picus
77c4c1e48e
WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449) 2022-03-29 00:56:42 -07:00
mwtian
c2404cce62
avoid adding gpu utilization when unavailable (#23468)
From #22954, GPU utilization can be unavailable for consumer hardware. So dashboard should not assume the value cannot be None.

There might be a better way to represent "not reported". But currently utilizations are summed up which makes using non-zero to represent "not reported" hard to do.
2022-03-24 22:48:17 -07:00
mwtian
72ef9f91aa
[Remove Redis Pubsub 1/n] Remove enable_gcs_pubsub() (#23189)
GCS pubsub has been the default for awhile. There is little chance that we would need to revert back to Redis pubsub in future. This is the step in removing Redis pubsub, by first removing the `enable_gcs_pubsub()` feature guard.
2022-03-15 23:56:15 -07:00
Guyang Song
f65971756d
[dashboard agent] Catch agent port conflict (#23024) 2022-03-15 16:09:15 +08:00
Yi Cheng
bb5fa6b851
Remove redis in setup.py (#22979) 2022-03-10 11:05:03 -08:00
Yi Cheng
11bbf00338
[dashboard] Remove redis in dashboard (#22788)
As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.
2022-03-04 12:32:17 -08:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
Shantanu
ae60548ef3
Silence "cut: write error: Broken pipe" log spew (#21686)
On machines without GPUs, this can run subprocesses that spew to
stderr. Then with log_to_driver=True, we get log spew from every
single raylet. To avoid this, disable the GPU usage check on
certain errors.

Resolves #14305

Co-authored-by: hauntsaninja <>
2022-01-19 23:01:10 -08:00
Yi Cheng
09421a4ca6
[2/gcs] Bootstrap dashboard for gcs ha (#21179)
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 16:58:03 -08:00
mwtian
6871a72a5c
[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860)
Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.
2021-12-10 14:36:57 -08:00
Yi Cheng
e54d3117a4
[gcs] Update all redis kv usage in python except function table (#20014)
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR

## Related issue number
#19443
2021-11-10 20:24:53 -08:00
Jiajun Yao
6acf276959
Listen to 127.0.0.1 if node ip is 127.0.0.1 (#19918)
* Listen to 127.0.0.1 if node ip is 127.0.0.1

* Listen to 127.0.0.1 if node ip is 127.0.0.1

* Listen to 127.0.0.1 if node ip is 127.0.0.1
2021-11-03 12:17:55 +09:00
Oscar Knagg
5a05e89267
[Core] Add TLS/SSL support to gRPC channels (#18631) 2021-10-20 22:39:11 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows (#19319) 2021-10-14 14:42:22 -07:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Clark Zinzow
d958457d07
[Core] Second pass at privatizing APIs. (#17885)
* gcs_utils

* resource_spec

* profiling

* ray_perf and ray_cluster_perf

* test_utils
2021-08-18 20:56:33 -07:00
Amog Kamsetty
add6ceb3ec
[Dependencies] Fix missing dependency UX (#17420)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-08-05 20:18:42 -07:00
Richard Liaw
597dc08dfe
Revert "Revert "[core] remove opencensus/prometheus_exporter dependencies"" (#17254)
* Revert "Revert "[core] remove opencensus/prometheus_exporter dependencies" (#17251)"

This reverts commit 7b44dd8ecb.

* Lint

* Fix more imports

Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-07-26 21:09:25 -07:00
Amog Kamsetty
cb74053ee5
Retry remove gpustat dependency (#17115)
* remove gpustat

* move psutil imports
2021-07-19 11:14:10 -07:00
Amog Kamsetty
caa78a3cff
Revert "[Core] Remove gpustat from core dependencies (#17059)" (#17106)
This reverts commit 7ec18f671a.
2021-07-14 20:19:33 -07:00
Amog Kamsetty
7ec18f671a
[Core] Remove gpustat from core dependencies (#17059) 2021-07-13 21:22:02 -07:00
Ian Rodney
90ce25cb35
[dashboard] Avoid global min_workers (#15660) 2021-05-10 15:47:51 -07:00
Dmitri Gekhtman
410f768046
[Kubernetes] [Dashboard] Remove disk data from dashboard when running on K8s. (#14676) 2021-04-05 17:16:20 -07:00
Ian Rodney
eb12033612
[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-18 13:10:57 -07:00
Kathryn Zhou
01dda99b8c
Export cluster statistics to Prometheus (#14612) 2021-03-15 19:28:13 -07:00
Dmitri Gekhtman
6babd1928c
[Kubernetes][dashboard][minor] Fix uptime (#14655) 2021-03-12 18:30:13 -06:00
Dmitri Gekhtman
a90cffe26c
[dashboard][k8s] Better CPU reporting when running on K8s (#14593) 2021-03-12 12:02:15 -06:00
Clark Zinzow
5a788474aa
[Core] First pass at privatizing non-public Python APIs. (#14607)
* async_compat

* utils

* cluster_utils

* compat

* function_manager

* import_thread

* memory_monitor

* monitor, log_monitor, ray_process_reaper

* metrics_agent

* parameter

* prometheus_exporter

* ray_logging

* signature
2021-03-10 22:47:28 -08:00
Dmitri Gekhtman
4a7d9e71bb
[dashboard][kubernetes] Show container's memory info on K8s, not the physical host's. (#14499)
* random doc typo

* more reasonable memory output

* no if

* get rid of comment
2021-03-08 18:59:41 -08:00
fyrestone
2da58bb021
[Dashboard] Fix reporter agent (#14378) 2021-03-08 13:12:34 -06:00
Kathryn Zhou
d6521be7ef
Export GPU metrics, CPU count, and additional Memory metrics to Prometheus (#14170) 2021-02-22 10:04:18 -08:00
Kathryn Zhou
f6b5e838fe
Add disk and network metrics to Prometheus and fix dashboard (#14144) 2021-02-17 10:27:14 -08:00
Simon Mo
33316d4f8f
Revert "Export additional metrics to Prometheus (#14061)" (#14134)
This reverts commit 82539f2da4.
2021-02-16 12:49:12 -08:00
Kathryn Zhou
82539f2da4
Export additional metrics to Prometheus (#14061) 2021-02-14 23:16:26 -08:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Xianyang Liu
4ecd29ea2b
[dashboard] Fixes dashboard issues when environments have set http_proxy (#12598)
* fixes ray start with http_proxy

* format

* fixes

* fixes

* increase timeout

* address comments
2021-01-21 20:10:01 -08:00
SangBin Cho
32dc5676b4
[Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
Alex Wu
8df94e33e0
[Autoscaler] New output log format (#12772) 2020-12-23 12:02:55 -08:00