Commit graph

500 commits

Author SHA1 Message Date
Alan Guo
91cacd6214
Don't unfold first node in dashboard unless there is only one node in the cluster (#28108)
fixes #28107

Also moves the Host / Cmd Line column to be the first column so nodes and workers can be more easily distinguished.
2022-08-31 19:05:24 -07:00
Chen Shen
6be4bf8be3
[hotfix] Fix pytest dependency in test_utils (#27956)
import pytest in test_utils breaks a bunch of test.
2022-08-17 12:16:08 -07:00
Nikita Vemuri
4692e8d802
[core] Don't override external dashboard URL in internal KV store (#27901)
Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
2022-08-16 22:48:05 -07:00
Archit Kulkarni
058c239cf1
[runtime env] Test common failure scenarios (#25977)
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)

Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup

Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)

The approach is to use environment variables to induce failures in various places.  The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be.  I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
2022-08-15 11:35:56 -05:00
Alan Guo
be92dd72d5
[Dashboard] Fix edge cases for log file names in the dashboard log viewer (#27772) 2022-08-12 09:39:54 -07:00
Alan Guo
c3a8ba0f8a
Add maximum number of characters in logs output for jobs status message (#27581)
We've seen the API server go down from trying to return 500mb of log output
2022-08-08 20:24:51 -07:00
Yi Cheng
dac7bf17d9
[serve] Make serve agent not blocking when GCS is down. (#27526)
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.

- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.

To enable Serve HA, we also need to setup:

- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5

which we should set in KubeRay.
2022-08-08 16:29:42 -07:00
SangBin Cho
6084ee5a63
Revert "Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308)" (#27613)
This reverts commit ccf411604e.
2022-08-08 06:38:19 -07:00
Simon Mo
efee158cec
[Serve] Use Async Handle for DAG Execution (#27411) 2022-08-06 22:23:44 -07:00
Alan Guo
326b5bd1ac
Convert job_manager to be async (#27123)
Updates jobs api
Updates snapshot api
Updates state api

Increases jobs api version to 2

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
follow-up for #25902 (comment)
2022-08-05 19:33:49 -07:00
Nikita Vemuri
a82af8602c
[core] Support external ray dashboard URL (#27396)
Signed-off-by: Nikita Vemuri nikitavemuri@gmail.com

Why are these changes needed?
Support printing a Ray dashboard URL that the user specifies through environment variable. This can be helpful if the Ray dashboard is hosted externally.
2022-08-05 19:33:10 -07:00
Alan Guo
05fca09f2d
Add query param to limit number of actors in api/snapshot (#27489)
Default the value to 1000 actors

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?
Reduces the latency of the api/snapshot, especially in cases where there is a ton of actors.
2022-08-05 16:48:46 -07:00
Jialing He
ccf411604e
Revert "Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (…" (#27308) 2022-08-05 16:32:48 +08:00
Alan Guo
2cf9ecf48e
Make it so pydantic is required before we launch dashboard api server (#27345)
* Make it so pydantic is required before we launch dashboard api server

Signed-off-by: Alan Guo <aguo@anyscale.com>
2022-08-03 14:24:51 -07:00
Alan Guo
c083ca5871
Add GPU info to new dashboard (#27074)
Support a GPU column for the new dashboard

Have first node be default expanded

Signed-off-by: Alan Guo aguo@anyscale.com

fixes #13889

Addresses comment from #26996
2022-08-02 15:32:55 -07:00
Nikita Vemuri
9a0b9918e5
[dashboard] Add last_activity_at field to /api/component_activities (#27284)
Add optional last_activity_at field to /api/component_activities to record end time of most recently finished activity

Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>
2022-08-02 11:02:15 -07:00
Ricky Xu
82a24f9319
[Doc][Core][State Observability] Adding Python SDK doc and docstring (#26997)
1. Add doc for python SDK and docstrings on public SDK
2. Rename list -> ray_list and get -> ray_get for better naming 
3. Fix some typos 
4. Auto translate address to api server url.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-08-02 11:24:59 -05:00
Alan Guo
729566d8ff
bump jobs version after making a backwards-incompatible change (#27281)
Backwards incompatible change was #25902

2.0.0 cherry-pick but not a rc0 blocker

Signed-off-by: Alan Guo <aguo@anyscale.com>
2022-07-30 00:11:29 -07:00
SangBin Cho
ec69fec1e0
Revert "[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302)" (#27242)
This reverts commit 14dee5f6a3.
2022-07-30 00:08:23 -07:00
SangBin Cho
16aa102984
[Usage Stats] Record usage stats when dashboard disabled (#26042)
Since usage stats are recorded from the dashboard (which will become API server), it is not collected when the dashboard is not included (include_dashboard=False).

This PR fixes the issues by

change dashboard -> API server (to avoid confusing users that dashboard is still started when include_dashboard=False)
Only load modules that are irrelevant to the dashboard from the API server, so it will have the same impact as no dashboard.
2022-07-28 23:01:49 -07:00
SangBin Cho
c1ac2bb80f
[Test] Try fixing a flaky gcs heartbeat manager test. (#27096)
Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread
2022-07-28 22:42:13 -07:00
Alan Guo
d25a3ff80a
[Dashboard] Fix node rows not being removed correctly when using filters (#27205) 2022-07-28 13:53:47 -07:00
shrekris-anyscale
510a0e038c
[Serve] Add host and port options to the Serve config file (#27026)
The Serve CLI and REST API always sets the host to `0.0.0.0` and the port to Serve's default. This change adds `host` and `port` as top level options in the Serve config file, so users can manually set the host and port of their Serve application to different values.

This change introduces a new Serve config file format:

```yaml
import_path: ...

runtime_env: ...

host: ...

port: ...

deployments: ...
    ...
```

`host` and `port` are optional and can be omitted. A running Serve application's `host` and `port` cannot be changed. If a user tries to `serve deploy` a config file with different `host` and `port` options than an already-running Serve application, `serve deploy` will fail without making any changes to the application. The user must `serve shutdown` their application and restart it with `serve deploy` to change their `host` and `port`.

**Follow-Up Items**
* The following CLI commands should **not** start Serve automatically. They should check whether Serve is running and perform some sort of no-op if it's not. That would alleviate the concern that the user starts Serve by accident through a `GET` request and needs to deal with default `host`/`port` options. Corresponding docs should also be updated.
    * `serve status`
    * `serve config`
    * `serve shutdown`
2022-07-28 11:26:46 -05:00
Alan Guo
c624d04842
Add back job_id to submit_job API to maintain backwards-compatibility (#27110)
Fix for a unintentional backwards-compatibility breakage for #25902
job submit api should still accept job_id as a parameter

Signed-off-by: Alan Guo aguo@anyscale.com
2022-07-28 08:20:53 -07:00
Jialing He
14dee5f6a3
[Job Submission][refactor 1/N] Add AgentInfo to GCSNodeInfo (#26302)
This is the first PR of #25963 :
1. Moved the agent information from `internal KV to `GCSNodeInfo`,
2. raylet registers itself after the agent process finished register.

Motivation:
Storing agent information in `internal KV` and registering nodes in GCS (write node information to `GCSNodeInfo`) are two asynchronous operations, which will bring some complex timing problems, especially after `raylet` failover
2022-07-28 22:20:28 +08:00
Eric Liang
a4434fac7f
[docs] Fix the remaining style violations in docstrings and add lint rule (#27033) 2022-07-27 22:24:20 -07:00
shrekris-anyscale
4ab97399cd
[Serve] Only start Serve in the CLI through the serve deploy command (#27063)
These Serve CLI commands start Serve if it's not already running:

* `serve deploy`
* `serve config`
* `serve status`
* `serve shutdown`

#27026 introduces the ability to specify a `host` and `port` in the Serve config file. However, once Serve starts running, changing these options requires tearing down the entire Serve application and relaunching it. This limitation is an issue because users can inadvertently start Serve by running one of the `GET`-based CLI commands (i.e. `serve config` or `serve status`) before running `serve deploy`.

This change makes `serve deploy` the only CLI command that can start a Serve application on a Ray cluster. The other commands have updated behavior when Serve is not yet running on the cluster.

* `serve config`: prints an empty config body.

```yaml
import_path: ''
runtime_env: {}
deployments: []
```

* `serve status`: prints an empty status body, with a new `app_status` `status` value: `NOT_STARTED`.

```yaml
app_status:
  status: NOT_STARTED
  message: ''
  deployment_timestamp: 0
deployment_statuses: []
```

* `serve shutdown`: performs a no-op.
2022-07-27 13:21:19 -05:00
Archit Kulkarni
60f33777a2
[runtime env] Add URI support for plugins (#26746) 2022-07-28 00:28:19 +08:00
Simon Mo
e5a8b1dd55
[Serve] Add API Annotations And Move to _private (#27058) 2022-07-27 09:08:26 -07:00
Archit Kulkarni
0e47fb4ed9
[Jobs] [runtime env] Allow RuntimeEnvConfig object in Job Submission (#26989) 2022-07-27 11:06:23 -05:00
Alan Guo
a7dca17973
Make New Dashboard the default dashboard (#26996)
Add UsageStats alert to new dashboard
Update wording of "back to legacy dashboard", "try new dashboard" buttons

Signed-off-by: Alan Guo aguo@anyscale.com
2022-07-27 07:04:34 -07:00
SangBin Cho
028684032b
[State Observability] Add warnings for data truncation + order columns as it is defined in StateSchema (#27018)
# Why are these changes needed?

This PR does 3 things

Add warnings for data truncation (which is a follow-up)
Improve some of confusing warning messages
order columns as it is defined in StateSchema (so that we can customize the column order for better usability). I did this only for list because i thought it wasn't that important for summary, but I might be wrong
2022-07-27 06:56:30 -07:00
Alan Guo
5d6bc5360d
Fix the jobs tab in the beta dashboard and fill it with data from both "submission" jobs and "driver" jobs (#25902)
## Why are these changes needed?
- Fixes the jobs tab in the new dashboard. Previously it didn't load.
- Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs.

- Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input.

- Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs.
- Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver)

This PR has 2 deprecations
- The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated.
- The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated.

**This PR has 4 backwards incompatible changes:**
- list_jobs sdk now returns a list instead of a dictionary
- the `ray job list` CLI now prints a list instead of a dictionary
- The `/api/jobs` endpoint returns a list instead of a dictionary
- The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.
2022-07-27 02:39:52 -07:00
SangBin Cho
2ca11d61b3
[State Observability] Set the default detail formatting as yaml + quicker head node register (#26946)
## Why are these changes needed?

This PR does 2 things.

1. When `--detail` is specified, set the default formatting as yaml. 
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server. 

## Related issue number

Closes https://github.com/ray-project/ray/issues/26939
2022-07-26 13:49:30 -07:00
SangBin Cho
39b9c44c8d
[State Observability] pre-alpha documentation (#26560)
Adds

Documentation for state APIs
API reference
2022-07-26 05:49:28 -07:00
Alan Guo
50b20809b8
[Dashboard] Stop caching logs in memory. Use state observability api to fetch on demand. (#26818)
Signed-off-by: Alan Guo <aguo@anyscale.com>

## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.

Also cleans up actor cache by removing dead actors from the cache.

Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">


## Related issue number
fixes #23680 
fixes #22027
fixes #24272
2022-07-26 03:10:57 -07:00
Archit Kulkarni
084f06f49a
[Doc] [Job submission] [Dashboard] Add tip for long runtime_env installation and improve error (#26911)
# Why are these changes needed?
The dashboard can display the message <actor> cannot be created because the Ray cluster cannot satisfy its resource requirements in the case where the runtime env setup is stalled. This PR updates this message to include the possibility of the runtime env setup failing.
This PR adds a tip to the Job Submission doc saying that if a job is stalled in PENDING, the runtime env setup may have stalled. It adds a pointer to the log files which should have more information.
The runtime env cannot stall forever, it fails after 10 minutes. This is a new feature added after the Ray 1.13 branch cut. In Ray <= 1.13, the runtime env can still stall forever.

# Related issue number
Closes #26332
2022-07-25 23:32:27 -07:00
Ricky Xu
259473c221
[Core][State Observability] Truncate warning message is incorrect when filter is used (#26801)
Signed-off-by: rickyyx rickyx@anyscale.com

# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:

Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.

This PR distinguishes these scenarios and prompt warnings accordingly.

# Related issue number
Closes #26570
Closes #26923
2022-07-25 23:31:49 -07:00
Alan Guo
e8222ff600
[dashboard] Update cluster_activities endpoint to use pydantic. (#26609)
Update cluster_activities endpoint to use pydantic so we have better data validation.

Make timestamp a required field.
Add pydantic to ray[default] requirements
2022-07-25 10:54:22 -07:00
Guyang Song
bf97a6944b
[Dashboard] Actor Table UI Optimize (#26785)
Co-authored-by: 多牧 <xuzhi.mxz@antfin.com>
2022-07-25 18:49:48 +08:00
SangBin Cho
15b711ae6a
[State Observability] Warn if callsite is disabled when ray list objects + raise exception on missing output (#26880)
This PR does 3 things.
1. Warn if callsite is disabled when `ray list objects` and `ray summary objects`
2. Decode owner_id for ray list actors
3. Support raise_on_missing_output
2022-07-24 19:55:36 -07:00
SangBin Cho
37f4692aa8
[State Observability] Fix "No result for get crashing the formatting" and "Filtering not handled properly when key missing in the datum" #26881
Fix two issues

No result for get crashing the formatting
Filtering not handled properly when key missing in the datum
2022-07-23 21:33:07 -07:00
Stephanie Wang
55a0f7bb2d
[core] ray.init defaults to an existing Ray instance if there is one (#26678)
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.

This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.

This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```

Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-07-23 11:27:22 -07:00
Jiajun Yao
3a48a79fd7
[Usage stats] Report total number of running jobs for usage stats purpose. (#26787)
- Report total number of running jobs
- Fix total number of nodes to include only alive nodes

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-07-21 01:37:58 -07:00
Ricky Xu
6ee37d4ad7
[Core][State Observability] Fix is_alive column with wrong column type that breaks filtering (#26739)
is_alive column of the WorkerState has wrong column type that breaks filtering on is_alive
2022-07-20 16:38:15 -07:00
Matti Picus
b835cb944d
redo agent_pid -> agent_id (#25806)
Redo the agent-id changes from #24968. The original PR is in the first commit, the second commit fixes a fatal flaw when using RAY_BACKEND_LOG_LEVEL=debug, which caused the "Ray C++, Java" tests to fail on macOS.
2022-07-19 20:26:49 -07:00
Guyang Song
f96f5a1c18
[runtime env] plugin refactor [5/n]: support priority (#26659) 2022-07-20 10:07:06 +08:00
Jiajun Yao
2b37c32d43
Auto reconnect for gcs aio client (#26673)
#20299 adds auto reconnect for sync gcs client and this PR does the same thing for async gcs client.
2022-07-19 13:11:09 -07:00
SangBin Cho
adf24bfa97
[State Observability] Use a table format by default (#26159)
NOTE: tabulate is copied/pasted to the codebase for table formatting.

This PR changes the default layout to be the table format for both summary and list APIs.
2022-07-19 00:54:16 -07:00
Riatre
591cd22be7
Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525)
* Revert "Revert "Bump pytest from 5.4.3 to 7.0.1""

This reverts commit ab10890e90.

Signed-off-by: Riatre Foo <foo@riat.re>

* Fix missing test data files dependency in rllib/BUILD

See # 26334 and # 26517 for context.

Once this is in, it should be good to roll-forwrad again.

Signed-off-by: Riatre Foo <foo@riat.re>

* debug: run all tests

Signed-off-by: Riatre Foo <foo@riat.re>

* Revert "debug: run all tests"

This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970.

Signed-off-by: Riatre Foo <foo@riat.re>

* fix new tests since last rebase

Signed-off-by: Riatre Foo <foo@riat.re>
2022-07-18 21:21:19 -07:00