This PR is doing 2 things.
(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.
(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests.
This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.
See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828.
Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future.
The change contains three commits with rather descriptive commit message, which I repeat here:
Pass deps to py_test in py_test_module_list
Bazel macro py_test_module_list takes a `deps` argument, but completely
ignores it instead of passes it to `native.py_test`. Fixing that as we
are going to use deps of py_test_module_list in BUILD in later changes.
cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library
from a py_test, which isn't working, see upstream issue:
https://github.com/bazelbuild/bazel/issues/701.
This is fixed by simply removing the (non-working) deps.
Depend on conftest and data files in Python tests BUILD files
Bazel requires that all the files used in a test run should be
represented in the transitive dependencies specified for the test
target. For py_test, it means srcs, deps and data.
Bazel enforces this constraint by creating a "runfiles" directory,
symbolic links files in the dependency closure and run the test in the
"runfiles" directory, so that the test shouldn't see files not in the
dependency graph.
Unfortunately, the constraint does not apply for a large number of
Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic
links during test collection and effectively "breaks out" of the
runfiles tree.
pytest >= 6.0 introduces a breaking change and removed the symbolic link
resolving behaviour, see pytest pull request
https://github.com/pytest-dev/pytest/pull/6523 for more context.
Currently, we are underspecifying dependencies in a lot of BUILD files
and thus blocking us from updating to newer pytest (for Python 3.10
support). This change hopefully fixes all of them, and at least those in
CI, by adding data or source dependencies (mostly for conftest.py-s)
where needed.
Bump pytest version from 5.4.3 to 7.0.1
We want at least pytest 6.2.5 for Python 3.10 support, but not past
7.1.0 since it drops Python 3.6 support (which Ray still supports), thus
the version constraint is set to <7.1.
Updating pytest, combined with earlier BUILD fixes, changed the ground
truth of a few error message based unit test, these tests are updated to
reflect the change.
There are also two small drive-by changes for making test_traceback and
test_cli pass under Python 3.10. These are discovered while debugging CI
failures (on earlier Python) with a Python 3.10 install locally. Expect
more such issues when adding Python 3.10 to CI.
The old dashboard UI was much easier at seeing all the work across all workers because workers were shown along side nodes in the main nodes page. This change brings the same functionality to the new Dashboard UI.
Some changes in this PR:
Factor out the NodeRow into its own component and into its own file.
Introduce WorkerRow which shows information about a worker
Updates the heading of the table column because the column will show different data depending on if its a node row or a worker row.
Makes sure we're rounding percentages to a single decimal place.
Logs button for worker row will go to the logs page and filter out just the log files related to that worker.
Update the api for fetching nodes into fetching nodes + workers.
fix bug where object store memory was not showing the total size but instead the remaining size
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.
This PR expose them in the endpoint in dashboard and dashboard agent.
For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.
For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.
We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead
If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
Add external hook to /api/component_activities endpoint in dashboard snapshot router
Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.
In Ray 2.0, we want to achieve api server HA.
Originally serve endpoints are in head node.
This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.
Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active
This currently only contains a response entry for drivers, but will add entries for other components on request as followups
## Why are these changes needed?
This PR adds data truncation when there are more than N number of entries. The policy is as follow;
By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.
By default, all internal RPCs truncate data if it's > 10K.
For distributed sources, we query each source with 10K limit and we apply limit again at the end.
## Related issue number
Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
## Why are these changes needed?
This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.
This fixes the issue by setting timeout=None when keepalive==True
## Related issue number
Closes https://github.com/ray-project/ray/issues/25721
## Why are these changes needed?
This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed.
```
--filter key value -> --filter "key=val" or ---filter "key!=val"
list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)])
```
## Why are these changes needed?
This is a first implementation of GET APIs for
nodes
actors
placement groups
workers
tasks
objects
E.g.
# CLI
(dev) ➜ ray git:(ricky/obs-get) ray get nodes cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
---
node_id: cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
node_ip: 172.31.47.143
node_name: 172.31.47.143
resources_total:
CPU: 8.0
memory: 16700517582.0
node:172.31.47.143: 1.0
object_store_memory: 8350258790.0
state: ALIVE
# Python
from ray.experimental.state.api import get_node
from ray.experimental.state.common import NodeState
node :NodeState = get_node(<id>)
print(node)
We currently do not support getting specific resources by id for 'jobs' and 'runtime-envs'
jobs: it is not exposing id to be queried easily yet
runtime envs: it doesn't have an id associated.
TODO:
it uses list endpoints + filtering as for now, future iterations will implement GET-specific endpoints and interaction with raylet/GCS with point query APIs.
Unit testing for state_manager for GET endpoints when implemented.
Getting jobs by id
Why are these changes needed?
This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM.
What has been changed?
Added signal handlers for some of the subprocesses:
dashboard (head)
log monitor
ray client server
Changed the --block semantics and prompt messages.
Related issue number
Closes#25518
Closes#25283.
The dashboard shows inaccurate memory and cpu data when run inside of a docker container, in particular when using cgroups v2. This PR fixes that.
Task/actor/object summary
Tasks: Group by the func name. In the future, we will also allow to group by task_group.
Actors: Group by actor class name. In the future, we will also allow to group by actor_group.
Object: Group by callsite. In the future, we will allow to group by reference type or task state.
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
Uses the async KV API for downloading in the runtime env agent. This avoids the complexity of running the runtime env creation functions in a separate thread.
Some functions are still sync, including the working_dir/py_modules upload, installing wheels, and possibly others.
I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order)
Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory)
Also, users can easily build dict index on their own if it is necessary.
Followup from #24622. This is another step towards pluggability for runtime_env. Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list. This PR makes all plugins use `get_uris`, which simplifies the code overall.
Most of the lines in the diff just come from the new `format.sh` which sorts the imports.
Followup PR to https://github.com/ray-project/ray/pull/20273.
- Hides cache logic behind a class.
- Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class.
Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.
## Why are these changes needed?
This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based.
See #24956 for more details.
## Summary
<!-- Please give a short summary of the change and the problem this solves. -->
- Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods.
## Related issue number
Closes#24956Closes#25578
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.
# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]
Args:
--tail: Tail the last X lines
--follow: Follow the new logs
--actor-id: The actor id
--pid --node-ip: For worker logs
--node-id: The node id of the log
--interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
Currently when Raylets die, it is hard to figure out:
if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
The current inheritance behavior for runtime_envs enables the following workflow for Jobs: A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.
There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.
This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:
- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
Add visibility into the following to help Ray users and developers debug performance and OOM issues:
Raylet memory usage broken down by USS vs remaining RSS.
Total workers' count, CPU percentage usage, and memory usage.
This is the PR to implement ray log to the server side. The PR is continued from #24068.
The PR supports two endpoints;
/api/v0/logs # list logs of the node id filtered by the given glob.
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.
As a follow-up after this PR, there will be 2 PRs.
PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs