Commit graph

18 commits

Author SHA1 Message Date
SangBin Cho
8837a4593f
[State Observability] Truncate data when there are too many entries to return (#26124)
## Why are these changes needed?

This PR adds data truncation when there are more than N number of entries. The policy is as follow;

By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.

By default, all internal RPCs truncate data if it's > 10K. 

For distributed sources, we query each source with 10K limit and we apply limit again at the end. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
2022-06-28 18:33:57 -07:00
SangBin Cho
def02bd4c9
Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163)
* Revert "Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)"

This reverts commit 3017128d5e.
2022-06-28 16:07:32 -07:00
Stephanie Wang
3017128d5e
Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)
This reverts commit 2d58bd5a50.
2022-06-28 10:04:58 -07:00
SangBin Cho
68336abf13
[State Observability] Support --detail flag. (#26071)
## Why are these changes needed?

This PR adds --detail flag to the list APIs.
2022-06-28 07:56:44 -07:00
SangBin Cho
2d58bd5a50
[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)
## Why are these changes needed?

This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.

This fixes the issue by setting timeout=None when keepalive==True

## Related issue number

Closes https://github.com/ray-project/ray/issues/25721
2022-06-28 05:48:25 -07:00
SangBin Cho
4b957e99b5
[State Observability] != predicate for filtering. (#26079)
## Why are these changes needed?

This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed.

```
--filter key value -> --filter "key=val" or ---filter "key!=val"

list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)])
```
2022-06-28 05:42:19 -07:00
Ricky Xu
44daf3ecd7
[Core][State Observability] Get API using List endpoints + filtering on ids (#25894)
## Why are these changes needed?
This is a first implementation of GET APIs for

nodes
actors
placement groups
workers
tasks
objects
E.g.

# CLI
(dev) ➜  ray git:(ricky/obs-get) ray get nodes cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
---
node_id: cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
node_ip: 172.31.47.143
node_name: 172.31.47.143
resources_total:
    CPU: 8.0
    memory: 16700517582.0
    node:172.31.47.143: 1.0
    object_store_memory: 8350258790.0
state: ALIVE


# Python 
from ray.experimental.state.api import get_node
from ray.experimental.state.common import NodeState

node :NodeState = get_node(<id>)
print(node)

We currently do not support getting specific resources by id for 'jobs' and 'runtime-envs'

jobs: it is not exposing id to be queried easily yet
runtime envs: it doesn't have an id associated.

TODO:
it uses list endpoints + filtering as for now, future iterations will implement GET-specific endpoints and interaction with raylet/GCS with point query APIs.
Unit testing for state_manager for GET endpoints when implemented.
Getting jobs by id
2022-06-27 17:14:29 -07:00
SangBin Cho
6552e096e6
[State Observability] Summary APIs (#25672)
Task/actor/object summary

Tasks: Group by the func name. In the future, we will also allow to group by task_group.
Actors: Group by actor class name. In the future, we will also allow to group by actor_group.
Object: Group by callsite. In the future, we will allow to group by reference type or task state.
2022-06-22 06:21:50 -07:00
SangBin Cho
411b1d8d2d
[State Observability] Return list instead of dict (#25888)
I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order)

Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory)
Also, users can easily build dict index on their own if it is necessary.
2022-06-20 22:49:29 -07:00
SangBin Cho
856bea31fb
[State Observability] Ray log CLI / API (#25481)
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.

# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]

Args:
    --tail: Tail the last X lines
    --follow: Follow the new logs
    --actor-id: The actor id
    --pid --node-ip: For worker logs
    --node-id: The node id of the log
    --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
2022-06-13 05:52:57 -07:00
SangBin Cho
00e3fd75f3
[State Observability] Ray log alpha API (#24964)
This is the PR to implement ray log to the server side. The PR is continued from #24068.

The PR supports two endpoints;

/api/v0/logs # list logs of the node id filtered by the given glob. 
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.

As a follow-up after this PR, there will be 2 PRs.

PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs
2022-06-04 05:10:23 -07:00
SangBin Cho
54496d7705
[State Observability API] Support Filtering (#25281)
This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs).

We will also support limited types of columns for each API.

The API is as follows

ray list [resources] -- filter [key] [value] => filter data that's key==value. 
In the future, we can also support more complicated filtering like !=, And, Or , or etc.
2022-06-03 17:17:30 -07:00
SangBin Cho
a7e759317b
[State Observability API] Error handling (#24413)
This improves error handling per https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.pdzl9cil9e8z (the RPC part).

Semantics
If all queries to the source failed, raise a RayStateApiException.

If partial queries are failed, warnings.warn the partial failure when print_api_stats=True. It is true for CLI. It is false when it is used within Python API or json / yaml format is required.
2022-05-24 03:56:49 -07:00
SangBin Cho
2bce07d4ce
[State API] List runtime env API (#24126)
This PR supports list runtime env API
2022-05-02 14:01:00 -07:00
SangBin Cho
73ed67e9e6
[State API] State api limit + Removing unnecessary modules (#24098)
This PR does

Move all routes into the same module, state_head.py
Support a limit feature.
2022-04-22 15:59:46 -07:00
SangBin Cho
1c3329fa38
Revert "Revert "[State Observability] Basic functionality for central… (#23933)
…ized data (#23744)" (#23918)"

This reverts commit fb14e82.
2022-04-18 21:15:43 -07:00
Amog Kamsetty
fb14e82242
Revert "[State Observability] Basic functionality for centralized data (#23744)" (#23918)
This reverts commit 51a4a1a802.

breaking tune multinode tests and kuberay:test_autoscaling_e2e
2022-04-14 14:28:42 -07:00
SangBin Cho
51a4a1a802
[State Observability] Basic functionality for centralized data (#23744)
Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.
2022-04-14 07:33:18 -07:00