Commit graph

25 commits

Author SHA1 Message Date
Alan Guo
326b5bd1ac
Convert job_manager to be async (#27123)
Updates jobs api
Updates snapshot api
Updates state api

Increases jobs api version to 2

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
follow-up for #25902 (comment)
2022-08-05 19:33:49 -07:00
SangBin Cho
028684032b
[State Observability] Add warnings for data truncation + order columns as it is defined in StateSchema (#27018)
# Why are these changes needed?

This PR does 3 things

Add warnings for data truncation (which is a follow-up)
Improve some of confusing warning messages
order columns as it is defined in StateSchema (so that we can customize the column order for better usability). I did this only for list because i thought it wasn't that important for summary, but I might be wrong
2022-07-27 06:56:30 -07:00
SangBin Cho
39b9c44c8d
[State Observability] pre-alpha documentation (#26560)
Adds

Documentation for state APIs
API reference
2022-07-26 05:49:28 -07:00
Ricky Xu
259473c221
[Core][State Observability] Truncate warning message is incorrect when filter is used (#26801)
Signed-off-by: rickyyx rickyx@anyscale.com

# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:

Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.

This PR distinguishes these scenarios and prompt warnings accordingly.

# Related issue number
Closes #26570
Closes #26923
2022-07-25 23:31:49 -07:00
SangBin Cho
15b711ae6a
[State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880)
This PR does 3 things.
1. Warn if callsite is disabled when `ray list objects` and `ray summary objects`
2. Decode owner_id for ray list actors
3. Support raise_on_missing_output
2022-07-24 19:55:36 -07:00
SangBin Cho
37f4692aa8
[State Observability] Fix "No result for get crashing the formatting" and "Filtering not handled properly when key missing in the datum" #26881
Fix two issues

No result for get crashing the formatting
Filtering not handled properly when key missing in the datum
2022-07-23 21:33:07 -07:00
Ricky Xu
6ee37d4ad7
[Core][State Observability] Fix is_alive column with wrong column type that breaks filtering (#26739)
is_alive column of the WorkerState has wrong column type that breaks filtering on is_alive
2022-07-20 16:38:15 -07:00
SangBin Cho
adf24bfa97
[State Observability] Use a table format by default (#26159)
NOTE: tabulate is copied/pasted to the codebase for table formatting.

This PR changes the default layout to be the table format for both summary and list APIs.
2022-07-19 00:54:16 -07:00
SangBin Cho
e9f6ffc5a5
[Core][State Observability] Use address arg + print warning if API responds slowly (#26008)
This PR is doing 2 things.

(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.

(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
2022-07-14 06:44:07 -07:00
SangBin Cho
8837a4593f
[State Observability] Truncate data when there are too many entries to return (#26124)
## Why are these changes needed?

This PR adds data truncation when there are more than N number of entries. The policy is as follow;

By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.

By default, all internal RPCs truncate data if it's > 10K. 

For distributed sources, we query each source with 10K limit and we apply limit again at the end. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
2022-06-28 18:33:57 -07:00
SangBin Cho
68336abf13
[State Observability] Support --detail flag. (#26071)
## Why are these changes needed?

This PR adds --detail flag to the list APIs.
2022-06-28 07:56:44 -07:00
SangBin Cho
4b957e99b5
[State Observability] != predicate for filtering. (#26079)
## Why are these changes needed?

This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed.

```
--filter key value -> --filter "key=val" or ---filter "key!=val"

list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)])
```
2022-06-28 05:42:19 -07:00
SangBin Cho
6552e096e6
[State Observability] Summary APIs (#25672)
Task/actor/object summary

Tasks: Group by the func name. In the future, we will also allow to group by task_group.
Actors: Group by actor class name. In the future, we will also allow to group by actor_group.
Object: Group by callsite. In the future, we will allow to group by reference type or task state.
2022-06-22 06:21:50 -07:00
SangBin Cho
411b1d8d2d
[State Observability] Return list instead of dict (#25888)
I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order)

Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory)
Also, users can easily build dict index on their own if it is necessary.
2022-06-20 22:49:29 -07:00
SangBin Cho
856bea31fb
[State Observability] Ray log CLI / API (#25481)
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.

# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]

Args:
    --tail: Tail the last X lines
    --follow: Follow the new logs
    --actor-id: The actor id
    --pid --node-ip: For worker logs
    --node-id: The node id of the log
    --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
2022-06-13 05:52:57 -07:00
SangBin Cho
54496d7705
[State Observability API] Support Filtering (#25281)
This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs).

We will also support limited types of columns for each API.

The API is as follows

ray list [resources] -- filter [key] [value] => filter data that's key==value. 
In the future, we can also support more complicated filtering like !=, And, Or , or etc.
2022-06-03 17:17:30 -07:00
SangBin Cho
a7e759317b
[State Observability API] Error handling (#24413)
This improves error handling per https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.pdzl9cil9e8z (the RPC part).

Semantics
If all queries to the source failed, raise a RayStateApiException.

If partial queries are failed, warnings.warn the partial failure when print_api_stats=True. It is true for CLI. It is false when it is used within Python API or json / yaml format is required.
2022-05-24 03:56:49 -07:00
SangBin Cho
ec653e3196
[Nightly test] Move two line downloads to one line. (#25061)
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
2022-05-22 00:07:03 -07:00
SangBin Cho
b9c30529d8
[Core/Observability 1/N] Add a "running" state to task status (#24651)
This PR adds 2 more states into TaskStatus

enum TaskStatus {
  // The task is scheduled properly and waiting for execution.
  // It includes time to deliver the task to the remote worker + queueing time
  // from the execution side.
  WAITING_FOR_EXECUTION = 5;
  // The task that is running.
  RUNNING = 6;
}
2022-05-16 05:39:05 -07:00
SangBin Cho
2bce07d4ce
[State API] List runtime env API (#24126)
This PR supports list runtime env API
2022-05-02 14:01:00 -07:00
SangBin Cho
73ed67e9e6
[State API] State api limit + Removing unnecessary modules (#24098)
This PR does

Move all routes into the same module, state_head.py
Support a limit feature.
2022-04-22 15:59:46 -07:00
SangBin Cho
30ab5458a7
[State Observability] Tasks and Objects API (#23912)
This PR implements ray list tasks and ray list objects APIs.

NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
2022-04-21 18:45:03 -07:00
SangBin Cho
1c3329fa38
Revert "Revert "[State Observability] Basic functionality for central… (#23933)
…ized data (#23744)" (#23918)"

This reverts commit fb14e82.
2022-04-18 21:15:43 -07:00
Amog Kamsetty
fb14e82242
Revert "[State Observability] Basic functionality for centralized data (#23744)" (#23918)
This reverts commit 51a4a1a802.

breaking tune multinode tests and kuberay:test_autoscaling_e2e
2022-04-14 14:28:42 -07:00
SangBin Cho
51a4a1a802
[State Observability] Basic functionality for centralized data (#23744)
Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.
2022-04-14 07:33:18 -07:00