Commit graph

433 commits

Author SHA1 Message Date
shrekris-anyscale
010a3566e6
[Serve] Allow and remove trailing slashes in Ray submission address (#26093) 2022-06-30 16:04:53 -07:00
Nikita Vemuri
8fc3409676
[dashboard] Add component_activities API (#25996)
Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active
This currently only contains a response entry for drivers, but will add entries for other components on request as followups
2022-06-30 13:39:01 -07:00
shrekris-anyscale
6e800cc2df
[Serve] Disable test_serve_head.py on OSX (#26178)
`test_serve_head.py` has been very flaky recently on OSX, so this change disables it there.
2022-06-29 11:21:53 -07:00
SangBin Cho
8837a4593f
[State Observability] Truncate data when there are too many entries to return (#26124)
## Why are these changes needed?

This PR adds data truncation when there are more than N number of entries. The policy is as follow;

By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.

By default, all internal RPCs truncate data if it's > 10K. 

For distributed sources, we query each source with 10K limit and we apply limit again at the end. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
2022-06-28 18:33:57 -07:00
SangBin Cho
def02bd4c9
Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163)
* Revert "Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)"

This reverts commit 3017128d5e.
2022-06-28 16:07:32 -07:00
Stephanie Wang
3017128d5e
Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)
This reverts commit 2d58bd5a50.
2022-06-28 10:04:58 -07:00
SangBin Cho
68336abf13
[State Observability] Support --detail flag. (#26071)
## Why are these changes needed?

This PR adds --detail flag to the list APIs.
2022-06-28 07:56:44 -07:00
SangBin Cho
2d58bd5a50
[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)
## Why are these changes needed?

This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.

This fixes the issue by setting timeout=None when keepalive==True

## Related issue number

Closes https://github.com/ray-project/ray/issues/25721
2022-06-28 05:48:25 -07:00
SangBin Cho
4b957e99b5
[State Observability] != predicate for filtering. (#26079)
## Why are these changes needed?

This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed.

```
--filter key value -> --filter "key=val" or ---filter "key!=val"

list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)])
```
2022-06-28 05:42:19 -07:00
Guyang Song
58bfad84d3
[runtime env] plugin refactor[1/n] (#26077) 2022-06-28 14:09:05 +08:00
Ricky Xu
44daf3ecd7
[Core][State Observability] Get API using List endpoints + filtering on ids (#25894)
## Why are these changes needed?
This is a first implementation of GET APIs for

nodes
actors
placement groups
workers
tasks
objects
E.g.

# CLI
(dev) ➜  ray git:(ricky/obs-get) ray get nodes cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
---
node_id: cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9
node_ip: 172.31.47.143
node_name: 172.31.47.143
resources_total:
    CPU: 8.0
    memory: 16700517582.0
    node:172.31.47.143: 1.0
    object_store_memory: 8350258790.0
state: ALIVE


# Python 
from ray.experimental.state.api import get_node
from ray.experimental.state.common import NodeState

node :NodeState = get_node(<id>)
print(node)

We currently do not support getting specific resources by id for 'jobs' and 'runtime-envs'

jobs: it is not exposing id to be queried easily yet
runtime envs: it doesn't have an id associated.

TODO:
it uses list endpoints + filtering as for now, future iterations will implement GET-specific endpoints and interaction with raylet/GCS with point query APIs.
Unit testing for state_manager for GET endpoints when implemented.
Getting jobs by id
2022-06-27 17:14:29 -07:00
Ricky Xu
3d8ca6cf0f
[Core][cli][usability] ray stop prints errors during graceful shutdown (#25686)
Why are these changes needed?
This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM.

What has been changed?
Added signal handlers for some of the subprocesses:
dashboard (head)
log monitor
ray client server
Changed the --block semantics and prompt messages.

Related issue number
Closes #25518
2022-06-27 08:14:59 -07:00
Dmitri Gekhtman
1055eadde0
[Dashboard] Fix dashboard RAM and CPU with cgroups2 (#25710)
Closes #25283.

The dashboard shows inaccurate memory and cpu data when run inside of a docker container, in particular when using cgroups v2. This PR fixes that.
2022-06-26 14:01:26 -07:00
SangBin Cho
6552e096e6
[State Observability] Summary APIs (#25672)
Task/actor/object summary

Tasks: Group by the func name. In the future, we will also allow to group by task_group.
Actors: Group by actor class name. In the future, we will also allow to group by actor_group.
Object: Group by callsite. In the future, we will allow to group by reference type or task state.
2022-06-22 06:21:50 -07:00
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695)
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
Archit Kulkarni
565e366529
[runtime env] Use async internal kv in package download and plugins (#25788)
Uses the async KV API for downloading in the runtime env agent. This avoids the complexity of running the runtime env creation functions in a separate thread.

Some functions are still sync, including the working_dir/py_modules upload, installing wheels, and possibly others.
2022-06-21 15:02:36 -07:00
shrekris-anyscale
3d6a5450c9
[Serve] Stop Ray in test_serve_head.py fixture (#25893) 2022-06-21 11:28:07 -07:00
shrekris-anyscale
ad12f0cd02
[Serve] Deprecate outdated REST API settings (#25932) 2022-06-21 11:06:45 -07:00
Guyang Song
d1d5fe61c2
[Dashboard][Frontend] Worker table enhancement (#25934) 2022-06-21 14:09:48 +08:00
SangBin Cho
411b1d8d2d
[State Observability] Return list instead of dict (#25888)
I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order)

Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory)
Also, users can easily build dict index on their own if it is necessary.
2022-06-20 22:49:29 -07:00
Guyang Song
e13cc4088a
[Dashboard] Don't sort node list by defult (#25884) 2022-06-20 11:35:12 +08:00
Archit Kulkarni
85be093a84
[runtime env] Make all plugins return a List of URIs (#25825)
Followup from #24622.  This is another step towards pluggability for runtime_env.  Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list.  This PR makes all plugins use `get_uris`, which simplifies the code overall.

Most of the lines in the diff just come from the new `format.sh` which sorts the imports.
2022-06-17 14:13:44 -05:00
Simon Mo
e560bce3a4
[Serve] bind to 0.0.0.0 in serve_head (#25862) 2022-06-16 11:45:11 -07:00
Archit Kulkarni
23030dbcaa
[runtime env] Hide URI cache behind class (#24622)
Followup PR to https://github.com/ray-project/ray/pull/20273.

- Hides cache logic behind a class.
- Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class. 

Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.
2022-06-15 16:14:06 -05:00
shrekris-anyscale
a371756b3c
[Serve] Update Serve CLI and REST API behavior to use new config (#25691) 2022-06-14 19:01:51 -07:00
clarng
badf444eda
Respect import order for psutil and setproctitle (#25780)
Sort imports in a way that preserves the ordering requirements. This PR is needed for any file changes that imports psutil or setproctitle.
2022-06-14 17:44:41 -07:00
Ricky Xu
b1d0b12b4e
[Core \ State Observability] Use Submission client (#25557)
## Why are these changes needed?
This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. 

See #24956 for more details. 

## Summary
<!-- Please give a short summary of the change and the problem this solves. -->
- Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. 

## Related issue number
Closes #24956
Closes #25578
2022-06-13 17:11:19 -07:00
shrekris-anyscale
3278763dd7
[Serve] Start all Serve actors in the "serve" namespace only (#25575) 2022-06-13 10:31:28 -07:00
Simon Mo
feb8c29063
Revert "Revert "Revert "use an agent-id rather than the process PID (#24968)"… (#25376)" (#25669)
This reverts commit cb151d5ad6.
2022-06-13 09:22:52 -07:00
SangBin Cho
856bea31fb
[State Observability] Ray log CLI / API (#25481)
This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done.

# If there's only 1 match, print a file content. Otherwise, print all files that match glob.
ray logs [glob_filter] --node-id=[head node by default]

Args:
    --tail: Tail the last X lines
    --follow: Follow the new logs
    --actor-id: The actor id
    --pid --node-ip: For worker logs
    --node-id: The node id of the log
    --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)
2022-06-13 05:52:57 -07:00
mwtian
65d7a610ab
[Core] Push message to driver when a Raylet dies (#25516)
Currently when Raylets die, it is hard to figure out:

if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well.
reason of Raylet's death.
With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.
2022-06-09 05:54:34 -07:00
shrekris-anyscale
f3c2bd6718
[Serve] Make REST API deployments inherit top-level runtime_env (#25502) 2022-06-08 15:58:00 -07:00
Archit Kulkarni
6d2806f951
[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562)
The current inheritance behavior for runtime_envs enables the following workflow for Jobs:  A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.

There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.

This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:

- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
2022-06-08 13:54:06 -07:00
mwtian
1ce0ab7b7c
[Core] Export additional metrics for workers and Raylet memory (#25418)
Add visibility into the following to help Ray users and developers debug performance and OOM issues:

    Raylet memory usage broken down by USS vs remaining RSS.
    Total workers' count, CPU percentage usage, and memory usage.
2022-06-06 10:58:14 -07:00
SangBin Cho
00e3fd75f3
[State Observability] Ray log alpha API (#24964)
This is the PR to implement ray log to the server side. The PR is continued from #24068.

The PR supports two endpoints;

/api/v0/logs # list logs of the node id filtered by the given glob. 
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.

As a follow-up after this PR, there will be 2 PRs.

PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs
2022-06-04 05:10:23 -07:00
SangBin Cho
54496d7705
[State Observability API] Support Filtering (#25281)
This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs).

We will also support limited types of columns for each API.

The API is as follows

ray list [resources] -- filter [key] [value] => filter data that's key==value. 
In the future, we can also support more complicated filtering like !=, And, Or , or etc.
2022-06-03 17:17:30 -07:00
shrekris-anyscale
16bdfe6a39
Restore "[Serve] Deploy Serve deployment graphs via REST API" (#25073) (#25333) 2022-06-02 11:06:53 -07:00
SangBin Cho
cb151d5ad6
Revert "Revert "use an agent-id rather than the process PID (#24968)"… (#25376) 2022-06-01 16:28:48 -07:00
Simon Mo
61099faa58
[CI] Fix dashboard tests broken due to dep version upgrade (#25357) 2022-06-01 12:14:49 -07:00
Eric Liang
905258dbc1
Clean up docstyle in python modules and add LINT rule (#25272) 2022-06-01 11:27:54 -07:00
Eric Liang
517f78e2b8
[minor] Add a job submission hook by env var (#25343) 2022-06-01 11:15:43 -07:00
SangBin Cho
3385d19cbb
Revert "use an agent-id rather than the process PID (#24968)" (#25342)
This reverts commit 02f220b755.

<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

Looks like this commit makes `test_ray_shutdown` way more flaky.  cc @mattip for further investigation after revert
<img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png">


## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2022-06-01 05:35:30 -07:00
shrekris-anyscale
7754645c83
Revert "[Serve] Deploy Serve deployment graphs via REST API (#25073)" (#25330)
This reverts commit 47709b3300.
2022-05-31 15:37:55 -07:00
shrekris-anyscale
47709b3300
[Serve] Deploy Serve deployment graphs via REST API (#25073) 2022-05-31 10:57:08 -07:00
Matti Picus
02f220b755
use an agent-id rather than the process PID (#24968)
When using ray inside a virtualenv on windows, python.exe as reported by sys.executable is a PEP397 launcher to the actual python as reported by os.getpid():

>>> import sys, os, psutil
>>> >>> print(sys.executable)
C:\temp\issue24361\Scripts\python.exe
>>> os.getpid()
2208
>>> child = psutil.Process(2208)
>>> child.cmdline()
['C:\\oss\\CPython38\\python.exe']
>>> child.parent().cmdline()
['C:\\temp\\issue24361\\Scripts\\python.exe']
>>> child.parent().pid
6424
When the agent_manager launches the agent process via Process::Process(), it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But inside agent.py, the child process reports the PID via os.getpid(), which is 2208, and this is the wrong PID to register the agent.

The solution proposed here is another version of #24905 that creates a int agent_id = rand(); before starting the python process, and passes the agent_id to the process.
2022-05-26 22:10:35 -07:00
mwtian
fa32cb7c40
Revert "[core] Resubscribe GCS in python when GCS restarts. (#24887)" (#25168)
This reverts commit 7cf4233858.
2022-05-24 18:13:40 -07:00
mwtian
f79b826f31
[Dashboard] avoid showing disk info when it is unavailable (#24992) 2022-05-24 17:13:47 -07:00
Philipp Moritz
323605d169
Support file:// for runtime_env working directories in jobs (#25062)
This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories.

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-05-24 16:17:18 -07:00
shrekris-anyscale
8b3451318c
[Serve] Update Serve status formatting and processing (#24839) 2022-05-24 11:07:41 -07:00
Edward Oakes
65d21b7ae6
[job submission] Handle env_vars: None case properly in supervisor runtime_env logic (#25087) 2022-05-24 11:01:19 -05:00