## Why are these changes needed?
This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.
This fixes the issue by setting timeout=None when keepalive==True
## Related issue number
Closes https://github.com/ray-project/ray/issues/25721
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.
# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]
Args:
--tail: Tail the last X lines
--follow: Follow the new logs
--actor-id: The actor id
--pid --node-ip: For worker logs
--node-id: The node id of the log
--interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
This is the PR to implement ray log to the server side. The PR is continued from #24068.
The PR supports two endpoints;
/api/v0/logs # list logs of the node id filtered by the given glob.
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.
As a follow-up after this PR, there will be 2 PRs.
PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs
The test verifies the first line 43~51 bytes are "dashboard"
But due to recent code addition to head.py, the line where logs are written became 2 digits -> 3 digits
Previously,
2022-04-18 23:23:56,946 INFO head.py:[less than 100] -- Dashboard head grpc address: 127.0.0.1:57208
Now
2022-04-18 23:23:56,946 INFO head.py:101 -- Dashboard head grpc address: 127.0.0.1:57208
So we should increase the bytes range.
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.
Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
* Add RAY_NODE_ID environment var to agent
* Node ralated data use node id as key
* ray.init() return node id; Pass test_reporter.py
* Fix lint & CI
* Fix comments
* Minor fixes
* Fix CI
* Add const to ClientID in AgentManager::Options
* Use fstring
* Add comments
* Fix lint
* Add test_multi_nodes_info
Co-authored-by: 刘宝 <po.lb@antfin.com>
* Improve reporter module
* Add test_node_physical_stats to test_reporter.py
* Add test_class_method_route_table to test_dashboard.py
* Add stats_collector module for dashboard
* Subscribe actor table data
* Add log module for dashboard
* Only enable test module in some test cases
* CI run all dashboard tests
* Reduce test timeout to 10s
* Use fstring
* Remove unused code
* Remove blank line
* Fix dashboard tests
* Fix asyncio.create_task not available in py36; Fix lint
* Add format_web_url to ray.test_utils
* Update dashboard/modules/reporter/reporter_head.py
Co-authored-by: Max Fitton <mfitton@berkeley.edu>
* Add DictChangeItem type for Dict change
* Refine logger.exception
* Refine GET /api/launch_profiling
* Remove disable_test_module fixture
* Fix test_basic may fail
Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <mfitton@berkeley.edu>