Commit graph

474 commits

Author SHA1 Message Date
shrekris-anyscale
4ab97399cd
[Serve] Only start Serve in the CLI through the serve deploy command (#27063)
These Serve CLI commands start Serve if it's not already running:

* `serve deploy`
* `serve config`
* `serve status`
* `serve shutdown`

#27026 introduces the ability to specify a `host` and `port` in the Serve config file. However, once Serve starts running, changing these options requires tearing down the entire Serve application and relaunching it. This limitation is an issue because users can inadvertently start Serve by running one of the `GET`-based CLI commands (i.e. `serve config` or `serve status`) before running `serve deploy`.

This change makes `serve deploy` the only CLI command that can start a Serve application on a Ray cluster. The other commands have updated behavior when Serve is not yet running on the cluster.

* `serve config`: prints an empty config body.

```yaml
import_path: ''
runtime_env: {}
deployments: []
```

* `serve status`: prints an empty status body, with a new `app_status` `status` value: `NOT_STARTED`.

```yaml
app_status:
  status: NOT_STARTED
  message: ''
  deployment_timestamp: 0
deployment_statuses: []
```

* `serve shutdown`: performs a no-op.
2022-07-27 13:21:19 -05:00
Archit Kulkarni
60f33777a2
[runtime env] Add URI support for plugins (#26746) 2022-07-28 00:28:19 +08:00
Simon Mo
e5a8b1dd55
[Serve] Add API Annotations And Move to _private (#27058) 2022-07-27 09:08:26 -07:00
Archit Kulkarni
0e47fb4ed9
[Jobs] [runtime env] Allow RuntimeEnvConfig object in Job Submission (#26989) 2022-07-27 11:06:23 -05:00
Alan Guo
a7dca17973
Make New Dashboard the default dashboard (#26996)
Add UsageStats alert to new dashboard
Update wording of "back to legacy dashboard", "try new dashboard" buttons

Signed-off-by: Alan Guo aguo@anyscale.com
2022-07-27 07:04:34 -07:00
SangBin Cho
028684032b
[State Observability] Add warnings for data truncation + order columns as it is defined in StateSchema (#27018)
# Why are these changes needed?

This PR does 3 things

Add warnings for data truncation (which is a follow-up)
Improve some of confusing warning messages
order columns as it is defined in StateSchema (so that we can customize the column order for better usability). I did this only for list because i thought it wasn't that important for summary, but I might be wrong
2022-07-27 06:56:30 -07:00
Alan Guo
5d6bc5360d
Fix the jobs tab in the beta dashboard and fill it with data from both "submission" jobs and "driver" jobs (#25902)
## Why are these changes needed?
- Fixes the jobs tab in the new dashboard. Previously it didn't load.
- Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs.

- Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input.

- Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs.
- Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver)

This PR has 2 deprecations
- The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated.
- The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated.

**This PR has 4 backwards incompatible changes:**
- list_jobs sdk now returns a list instead of a dictionary
- the `ray job list` CLI now prints a list instead of a dictionary
- The `/api/jobs` endpoint returns a list instead of a dictionary
- The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.
2022-07-27 02:39:52 -07:00
SangBin Cho
2ca11d61b3
[State Observability] Set the default detail formatting as yaml + quicker head node register (#26946)
## Why are these changes needed?

This PR does 2 things.

1. When `--detail` is specified, set the default formatting as yaml. 
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/26939
2022-07-26 13:49:30 -07:00
SangBin Cho
39b9c44c8d
[State Observability] pre-alpha documentation (#26560)
Adds

Documentation for state APIs
API reference
2022-07-26 05:49:28 -07:00
Alan Guo
50b20809b8
[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818)
Signed-off-by: Alan Guo <aguo@anyscale.com>

## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.

Also cleans up actor cache by removing dead actors from the cache.

Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">


## Related issue number
fixes #23680 
fixes #22027
fixes #24272
2022-07-26 03:10:57 -07:00
Archit Kulkarni
084f06f49a
[Doc] [Job submission] [Dashboard] Add tip for long runtime_env installation and improve error (#26911)
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.

# Related issue number
Closes #26332
2022-07-25 23:32:27 -07:00
Ricky Xu
259473c221
[Core][State Observability] Truncate warning message is incorrect when filter is used (#26801)
Signed-off-by: rickyyx rickyx@anyscale.com

# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:

Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.

This PR distinguishes these scenarios and prompt warnings accordingly.

# Related issue number
Closes #26570
Closes #26923
2022-07-25 23:31:49 -07:00
Alan Guo
e8222ff600
[dashboard] Update cluster_activities endpoint to use pydantic. (#26609)
Update cluster_activities endpoint to use pydantic so we have better data validation.

Make timestamp a required field.
Add pydantic to ray[default] requirements
2022-07-25 10:54:22 -07:00
Guyang Song
bf97a6944b
[Dashboard] Actor Table UI Optimize (#26785)
Co-authored-by: 多牧 <xuzhi.mxz@antfin.com>
2022-07-25 18:49:48 +08:00
SangBin Cho
15b711ae6a
[State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880)
This PR does 3 things.
1. Warn if callsite is disabled when `ray list objects` and `ray summary objects`
2. Decode owner_id for ray list actors
3. Support raise_on_missing_output
2022-07-24 19:55:36 -07:00
SangBin Cho
37f4692aa8
[State Observability] Fix "No result for get crashing the formatting" and "Filtering not handled properly when key missing in the datum" #26881
Fix two issues

No result for get crashing the formatting
Filtering not handled properly when key missing in the datum
2022-07-23 21:33:07 -07:00
Stephanie Wang
55a0f7bb2d
[core] ray.init defaults to an existing Ray instance if there is one (#26678)
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```

Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-07-23 11:27:22 -07:00
Jiajun Yao
3a48a79fd7
[Usage stats] Report total number of running jobs for usage stats purpose. (#26787)
- Report total number of running jobs
- Fix total number of nodes to include only alive nodes

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-07-21 01:37:58 -07:00
Ricky Xu
6ee37d4ad7
[Core][State Observability] Fix is_alive column with wrong column type that breaks filtering (#26739)
is_alive column of the WorkerState has wrong column type that breaks filtering on is_alive
2022-07-20 16:38:15 -07:00
Matti Picus
b835cb944d
redo agent_pid -> agent_id (#25806)
Redo the agent-id changes from #24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.
2022-07-19 20:26:49 -07:00
Guyang Song
f96f5a1c18
[runtime env] plugin refactor [5/n]: support priority (#26659) 2022-07-20 10:07:06 +08:00
Jiajun Yao
2b37c32d43
Auto reconnect for gcs aio client (#26673)
#20299 adds auto reconnect for sync gcs client and this PR does the same thing for async gcs client.
2022-07-19 13:11:09 -07:00
SangBin Cho
adf24bfa97
[State Observability] Use a table format by default (#26159)
NOTE: tabulate is copied/pasted to the codebase for table formatting.

This PR changes the default layout to be the table format for both summary and list APIs.
2022-07-19 00:54:16 -07:00
Riatre
591cd22be7
Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525)
* Revert "Revert "Bump pytest from 5.4.3 to 7.0.1""

This reverts commit ab10890e90.

Signed-off-by: Riatre Foo <foo@riat.re>

* Fix missing test data files dependency in rllib/BUILD

See # 26334 and # 26517 for context.

Once this is in, it should be good to roll-forwrad again.

Signed-off-by: Riatre Foo <foo@riat.re>

* debug: run all tests

Signed-off-by: Riatre Foo <foo@riat.re>

* Revert "debug: run all tests"

This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970.

Signed-off-by: Riatre Foo <foo@riat.re>

* fix new tests since last rebase

Signed-off-by: Riatre Foo <foo@riat.re>
2022-07-18 21:21:19 -07:00
Jules S. Damji
55368402ee
added summary why and when to use bulk vs streaming data ingest (#26637) 2022-07-17 18:46:58 -07:00
Simon Mo
63d3ccf81e
[Serve] Default to EveryNode when starting Serve from REST API (#26588) 2022-07-15 15:47:54 -07:00
Guyang Song
1949f35901
[runtime env] plugin refactor[4/n]: remove runtime env protobuf (#26522) 2022-07-15 13:56:12 +08:00
brucez-anyscale
d98a2482de
[Dashboard] Fix test dashboard flaky by catch an expected exception (#26555) 2022-07-14 20:57:46 -07:00
SangBin Cho
e9f6ffc5a5
[Core][State Observability] Use address arg + print warning if API responds slowly (#26008)
This PR is doing 2 things.

(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.

(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
2022-07-14 06:44:07 -07:00
Sven Mika
ab10890e90
Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517) 2022-07-13 11:19:30 -07:00
Ricky Xu
365ffe21e5
[Core | State Observability] Implement API Server (Dashboard) HTTP Requests Throttling (#26257)
This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests.
This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.
2022-07-13 09:05:26 -07:00
Riatre
2cdb76789e
Bump pytest from 5.4.3 to 7.0.1 (#26334)
See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828.

Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future.

The change contains three commits with rather descriptive commit message, which I repeat here:

Pass deps to py_test in py_test_module_list

    Bazel macro py_test_module_list takes a `deps` argument, but completely
    ignores it instead of passes it to `native.py_test`. Fixing that as we
    are going to use deps of py_test_module_list in BUILD in later changes.

    cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library
    from a py_test, which isn't working, see upstream issue:
    https://github.com/bazelbuild/bazel/issues/701.
    This is fixed by simply removing the (non-working) deps.

Depend on conftest and data files in Python tests BUILD files

    Bazel requires that all the files used in a test run should be
    represented in the transitive dependencies specified for the test
    target. For py_test, it means srcs, deps and data.

    Bazel enforces this constraint by creating a "runfiles" directory,
    symbolic links files in the dependency closure and run the test in the
    "runfiles" directory, so that the test shouldn't see files not in the
    dependency graph.

    Unfortunately, the constraint does not apply for a large number of
    Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic
    links during test collection and effectively "breaks out" of the
    runfiles tree.

    pytest >= 6.0 introduces a breaking change and removed the symbolic link
    resolving behaviour, see pytest pull request
    https://github.com/pytest-dev/pytest/pull/6523 for more context.

    Currently, we are underspecifying dependencies in a lot of BUILD files
    and thus blocking us from updating to newer pytest (for Python 3.10
    support). This change hopefully fixes all of them, and at least those in
    CI, by adding data or source dependencies (mostly for conftest.py-s)
    where needed.

Bump pytest version from 5.4.3 to 7.0.1

    We want at least pytest 6.2.5 for Python 3.10 support, but not past
    7.1.0 since it drops Python 3.6 support (which Ray still supports), thus
    the version constraint is set to <7.1.

    Updating pytest, combined with earlier BUILD fixes, changed the ground
    truth of a few error message based unit test, these tests are updated to
    reflect the change.

    There are also two small drive-by changes for making test_traceback and
    test_cli pass under Python 3.10. These are discovered while debugging CI
    failures (on earlier Python) with a Python 3.10 install locally.  Expect
    more such issues when adding Python 3.10 to CI.
2022-07-12 21:14:35 -07:00
brucez-anyscale
57258335bd
[Serve] Fix test_cli flakiness (#26471) 2022-07-12 17:57:08 -07:00
Alan Guo
7ad3a247bf
[Dashboard] [Frontend] Add workers to the main node tab in the New Dashboard UI (#26274)
The old dashboard UI was much easier at seeing all the work across all workers because workers were shown along side nodes in the main nodes page. This change brings the same functionality to the new Dashboard UI.

Some changes in this PR:

Factor out the NodeRow into its own component and into its own file.
Introduce WorkerRow which shows information about a worker
Updates the heading of the table column because the column will show different data depending on if its a node row or a worker row.
Makes sure we're rounding percentages to a single decimal place.
Logs button for worker row will go to the logs page and filter out just the log files related to that worker.
Update the api for fetching nodes into fetching nodes + workers.
fix bug where object store memory was not showing the total size but instead the remaining size
2022-07-12 16:28:08 -07:00
Yi Cheng
a68c02a15d
[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408)
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.

This PR expose them in the endpoint in dashboard and dashboard agent.

For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.

For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.

We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead

If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
2022-07-09 13:09:48 -07:00
Nikita Vemuri
56716a1c1b
[dashboard] Add RAY_CLUSTER_ACTIVITY_HOOK to /api/component_activities (#26297)
Add external hook to /api/component_activities endpoint in dashboard snapshot router
Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.
2022-07-08 10:51:59 -07:00
SangBin Cho
2dd5fdfdf1
[Usage stats] Add tags & number of nodes to the report. (#25852)
This PR adds the RAY_EXTRA_USAGE_TAGS to add additional tag metadata + number of nodes to the report.
2022-07-07 08:31:04 -07:00
brucez-anyscale
f76d7b23f2
Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
Yi Cheng
12d147ff1f
Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)" (#26333)
This reverts commit 84166ccb04.
2022-07-06 13:30:33 -07:00
brucez-anyscale
84166ccb04
[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)
In Ray 2.0, we want to achieve api server HA.
Originally serve endpoints are in head node.
This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.
2022-07-06 10:58:00 -07:00
xwjiang2010
d0dfbe09e3
[tune] fix set_tune_experiment (#26298) 2022-07-05 15:04:51 -07:00
shrekris-anyscale
010a3566e6
[Serve] Allow and remove trailing slashes in Ray submission address (#26093) 2022-06-30 16:04:53 -07:00
Nikita Vemuri
8fc3409676
[dashboard] Add component_activities API (#25996)
Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active
This currently only contains a response entry for drivers, but will add entries for other components on request as followups
2022-06-30 13:39:01 -07:00
shrekris-anyscale
6e800cc2df
[Serve] Disable test_serve_head.py on OSX (#26178)
`test_serve_head.py` has been very flaky recently on OSX, so this change disables it there.
2022-06-29 11:21:53 -07:00
SangBin Cho
8837a4593f
[State Observability] Truncate data when there are too many entries to return (#26124)
## Why are these changes needed?

This PR adds data truncation when there are more than N number of entries. The policy is as follow;

By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.

By default, all internal RPCs truncate data if it's > 10K. 

For distributed sources, we query each source with 10K limit and we apply limit again at the end. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
2022-06-28 18:33:57 -07:00
SangBin Cho
def02bd4c9
Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163)
* Revert "Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)"

This reverts commit 3017128d5e.
2022-06-28 16:07:32 -07:00
Stephanie Wang
3017128d5e
Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)
This reverts commit 2d58bd5a50.
2022-06-28 10:04:58 -07:00
SangBin Cho
68336abf13
[State Observability] Support --detail flag. (#26071)
## Why are these changes needed?

This PR adds --detail flag to the list APIs.
2022-06-28 07:56:44 -07:00
SangBin Cho
2d58bd5a50
[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)
## Why are these changes needed?

This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set.

This fixes the issue by setting timeout=None when keepalive==True

## Related issue number

Closes https://github.com/ray-project/ray/issues/25721
2022-06-28 05:48:25 -07:00
SangBin Cho
4b957e99b5
[State Observability] != predicate for filtering. (#26079)
## Why are these changes needed?

This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed.

```
--filter key value -> --filter "key=val" or ---filter "key!=val"

list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)])
```
2022-06-28 05:42:19 -07:00