Commit graph

24 commits

Author SHA1 Message Date
Alan Guo
be92dd72d5
[Dashboard] Fix edge cases for log file names in the dashboard log viewer (#27772) 2022-08-12 09:39:54 -07:00
SangBin Cho
def02bd4c9
Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163)
* Revert "Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)"

This reverts commit 3017128d5e.
2022-06-28 16:07:32 -07:00
Stephanie Wang
3017128d5e
Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)
This reverts commit 2d58bd5a50.
2022-06-28 10:04:58 -07:00
SangBin Cho
2d58bd5a50
[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)
## Why are these changes needed?

This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.

This fixes the issue by setting timeout=None when keepalive==True

## Related issue number

Closes https://github.com/ray-project/ray/issues/25721
2022-06-28 05:48:25 -07:00
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695)
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
SangBin Cho
856bea31fb
[State Observability] Ray log CLI / API (#25481)
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.

# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]

Args:
    --tail: Tail the last X lines
    --follow: Follow the new logs
    --actor-id: The actor id
    --pid --node-ip: For worker logs
    --node-id: The node id of the log
    --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
2022-06-13 05:52:57 -07:00
SangBin Cho
00e3fd75f3
[State Observability] Ray log alpha API (#24964)
This is the PR to implement ray log to the server side. The PR is continued from #24068.

The PR supports two endpoints;

/api/v0/logs # list logs of the node id filtered by the given glob. 
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.

As a follow-up after this PR, there will be 2 PRs.

PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs
2022-06-04 05:10:23 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
jon-chuang
ddcc252b51
[Core] Ray logs API (1/n) (#23435)
Expose HTTP endpoint to retrieve logs from ray cluster
2022-04-20 23:11:02 -07:00
SangBin Cho
082baa2342
[Test] Fix test_log (#24004)
The test verifies the first line 43~51 bytes are "dashboard"

But due to recent code addition to head.py, the line where logs are written became 2 digits -> 3 digits

Previously,
2022-04-18 23:23:56,946	INFO head.py:[less than 100] -- Dashboard head grpc address: 127.0.0.1:57208
 
Now
2022-04-18 23:23:56,946	INFO head.py:101 -- Dashboard head grpc address: 127.0.0.1:57208
So we should increase the bytes range.
2022-04-19 04:59:30 -07:00
mwtian
d5d2ef4249
[Core] Add a utility to check GCS / Ray cluster health (#23382)
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
2022-04-18 09:58:45 -07:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
Edward Oakes
7736cdd91d
[dashboard] Rename "new_dashboard" -> "dashboard" (#18214) 2021-09-15 11:17:15 -05:00
Simon Mo
e61160d514
[Dashboard] Move gcs health check to a separate thread to avoid crashing due to excessive CPU usage. (#18236) 2021-09-03 14:23:56 -07:00
Clark Zinzow
d958457d07
[Core] Second pass at privatizing APIs. (#17885)
* gcs_utils

* resource_spec

* profiling

* ray_perf and ray_cluster_perf

* test_utils
2021-08-18 20:56:33 -07:00
fyrestone
a6d135a072
[Dashboard] Add GET /log_proxy API (#13165) 2021-01-08 11:45:07 +08:00
SangBin Cho
753cda2f28
[Dashboard] Delete old dashboard (#12144)
* Delete old dashboard from repo.

* Delete old dashboard from repo. 2
2020-11-25 11:31:02 -08:00
fyrestone
05ad4c7499
[Dashboard] Optimize dashboard datacenter (#11391)
* Optimize dashboard datacenter

* Fix tests

* Fix tests

* Fix

* Fix CI

* python/build-wheel-macos.sh

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
2020-10-27 23:49:31 -07:00
Max Fitton
caf3b04b27
[Dashboard] Turn on new dashboard by default pt 2 (#11510) 2020-10-23 15:52:14 -05:00
fyrestone
defd41aad7
[Dashboard] http route handler cache (#10921)
* Add aiohttp_cache to dashboard

* Add comments; Refine code

* Keep NODE_STATS_UPDATE_INTERVAL_SECONDS 1 second; Change AIOHTTP_CACHE_TTL_SECONDS to 2 seconds

* Update merge

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-10-09 22:27:05 -07:00
fyrestone
50784e2496
[Dashboard] Dashboard node grouping (#10528)
* Add RAY_NODE_ID environment var to agent

* Node ralated data use node id as key

* ray.init() return node id; Pass test_reporter.py

* Fix lint & CI

* Fix comments

* Minor fixes

* Fix CI

* Add const to ClientID in AgentManager::Options

* Use fstring

* Add comments

* Fix lint

* Add test_multi_nodes_info

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-09-16 10:17:29 -07:00
fyrestone
e9b046306a
[Dashboard] Dashboard basic modules (#10303)
* Improve reporter module

* Add test_node_physical_stats to test_reporter.py

* Add test_class_method_route_table to test_dashboard.py

* Add stats_collector module for dashboard

* Subscribe actor table data

* Add log module for dashboard

* Only enable test module in some test cases

* CI run all dashboard tests

* Reduce test timeout to 10s

* Use fstring

* Remove unused code

* Remove blank line

* Fix dashboard tests

* Fix asyncio.create_task not available in py36; Fix lint

* Add format_web_url to ray.test_utils

* Update dashboard/modules/reporter/reporter_head.py

Co-authored-by: Max Fitton <mfitton@berkeley.edu>

* Add DictChangeItem type for Dict change

* Refine logger.exception

* Refine GET /api/launch_profiling

* Remove disable_test_module fixture

* Fix test_basic may fail

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
2020-08-29 23:09:34 -07:00