Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
Add visibility into the following to help Ray users and developers debug performance and OOM issues:
Raylet memory usage broken down by USS vs remaining RSS.
Total workers' count, CPU percentage usage, and memory usage.
https://github.com/ray-project/ray/pull/14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine.
However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage.
This PR adds a flag to enable displaying disk usage in those situations.
Adds some metrics useful for object-intensive workloads:
Per raylet/object manager:
Add num bytes pending restore to spill manager
Add num requests cumulative to PullManager
Num bytes pushed/pulled from other nodes cumulative
Histogram for request latencies in PullManager:
total life time of request, from start to cancel
request satisfaction time, from start to object local
pull time, from object activation to object local
Per-node disk read/write speed, IOPS