Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)
Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup
Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)
The approach is to use environment variables to induce failures in various places. The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be. I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status.
- internal kv used in dashboard/agent blocks the agent. We use the async one instead
- serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout
- agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back.
To enable Serve HA, we also need to setup:
- RAY_gcs_server_request_timeout_seconds=5
- RAY_SERVE_KV_TIMEOUT_S=5
which we should set in KubeRay.
Updates jobs api
Updates snapshot api
Updates state api
Increases jobs api version to 2
Signed-off-by: Alan Guo aguo@anyscale.com
Why are these changes needed?
follow-up for #25902 (comment)
Default the value to 1000 actors
Signed-off-by: Alan Guo aguo@anyscale.com
Why are these changes needed?
Reduces the latency of the api/snapshot, especially in cases where there is a ton of actors.
Add optional last_activity_at field to /api/component_activities to record end time of most recently finished activity
Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>
Since usage stats are recorded from the dashboard (which will become API server), it is not collected when the dashboard is not included (include_dashboard=False).
This PR fixes the issues by
change dashboard -> API server (to avoid confusing users that dashboard is still started when include_dashboard=False)
Only load modules that are irrelevant to the dashboard from the API server, so it will have the same impact as no dashboard.
Heartbeat manager starts its own thread to run its background task and that shares the same data structured used within HandleReportHeartbeat (heartbeats_). That said, both methods should run in the same thread. This achieves it by running HandleReportHeartbeat within the io_service thread
The Serve CLI and REST API always sets the host to `0.0.0.0` and the port to Serve's default. This change adds `host` and `port` as top level options in the Serve config file, so users can manually set the host and port of their Serve application to different values.
This change introduces a new Serve config file format:
```yaml
import_path: ...
runtime_env: ...
host: ...
port: ...
deployments: ...
...
```
`host` and `port` are optional and can be omitted. A running Serve application's `host` and `port` cannot be changed. If a user tries to `serve deploy` a config file with different `host` and `port` options than an already-running Serve application, `serve deploy` will fail without making any changes to the application. The user must `serve shutdown` their application and restart it with `serve deploy` to change their `host` and `port`.
**Follow-Up Items**
* The following CLI commands should **not** start Serve automatically. They should check whether Serve is running and perform some sort of no-op if it's not. That would alleviate the concern that the user starts Serve by accident through a `GET` request and needs to deal with default `host`/`port` options. Corresponding docs should also be updated.
* `serve status`
* `serve config`
* `serve shutdown`
Fix for a unintentional backwards-compatibility breakage for #25902
job submit api should still accept job_id as a parameter
Signed-off-by: Alan Guo aguo@anyscale.com
This is the first PR of #25963 :
1. Moved the agent information from `internal KV to `GCSNodeInfo`,
2. raylet registers itself after the agent process finished register.
Motivation:
Storing agent information in `internal KV` and registering nodes in GCS (write node information to `GCSNodeInfo`) are two asynchronous operations, which will bring some complex timing problems, especially after `raylet` failover
These Serve CLI commands start Serve if it's not already running:
* `serve deploy`
* `serve config`
* `serve status`
* `serve shutdown`
#27026 introduces the ability to specify a `host` and `port` in the Serve config file. However, once Serve starts running, changing these options requires tearing down the entire Serve application and relaunching it. This limitation is an issue because users can inadvertently start Serve by running one of the `GET`-based CLI commands (i.e. `serve config` or `serve status`) before running `serve deploy`.
This change makes `serve deploy` the only CLI command that can start a Serve application on a Ray cluster. The other commands have updated behavior when Serve is not yet running on the cluster.
* `serve config`: prints an empty config body.
```yaml
import_path: ''
runtime_env: {}
deployments: []
```
* `serve status`: prints an empty status body, with a new `app_status` `status` value: `NOT_STARTED`.
```yaml
app_status:
status: NOT_STARTED
message: ''
deployment_timestamp: 0
deployment_statuses: []
```
* `serve shutdown`: performs a no-op.
# Why are these changes needed?
This PR does 3 things
Add warnings for data truncation (which is a follow-up)
Improve some of confusing warning messages
order columns as it is defined in StateSchema (so that we can customize the column order for better usability). I did this only for list because i thought it wasn't that important for summary, but I might be wrong
## Why are these changes needed?
- Fixes the jobs tab in the new dashboard. Previously it didn't load.
- Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs.
- Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input.
- Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs.
- Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver)
This PR has 2 deprecations
- The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated.
- The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated.
**This PR has 4 backwards incompatible changes:**
- list_jobs sdk now returns a list instead of a dictionary
- the `ray job list` CLI now prints a list instead of a dictionary
- The `/api/jobs` endpoint returns a list instead of a dictionary
- The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.
## Why are these changes needed?
This PR does 2 things.
1. When `--detail` is specified, set the default formatting as yaml.
2. It seems like it takes 5 seconds to register the head node to the API server (because it gets node info every 5 second, and when the API server just starts, the head node is not registered to GCS). It decreases the node ping frequency until the head node is registered to API server.
## Related issue number
Closes https://github.com/ray-project/ray/issues/26939
Signed-off-by: Alan Guo <aguo@anyscale.com>
## Why are these changes needed?
Reduces memory footprint of the dashboard.
Also adds some cleanup to the errors data.
Also cleans up actor cache by removing dead actors from the cache.
Dashboard UI no longer allows you to see logs for all workers in a node. You must click into each worker's logs individually.
<img width="1739" alt="Screen Shot 2022-07-20 at 9 13 00 PM" src="https://user-images.githubusercontent.com/711935/180128633-1633c187-39c9-493e-b694-009fbb27f73b.png">
## Related issue number
fixes#23680fixes#22027fixes#24272
Signed-off-by: rickyyx rickyx@anyscale.com
# Why are these changes needed?
When we returned less/incomplete results to users, there could be 3 reasons:
Data being truncated at the data source (raylets -> API server)
Data being filtered at the API server
Data being limited at the API server
We are not distinguishing the those 3 scenarios, but we should. This is why we thought data being truncated when it's actually filtered/limited.
This PR distinguishes these scenarios and prompt warnings accordingly.
# Related issue number
Closes#26570Closes#26923
Update cluster_activities endpoint to use pydantic so we have better data validation.
Make timestamp a required field.
Add pydantic to ray[default] requirements
ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one.
This makes two changes to the ray.init() resolution order:
1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None.
2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases.
This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init:
```
(Connecting to existing Ray cluster at address: <IP>...)
...connection...
(Started a local Ray cluster.| Connected to Ray Cluster.)( View the dashboard at <URL>)
```
Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
This PR is doing 2 things.
(1) Use api_server_url to address which is consistent to other submission APIs.
(2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX.
(0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.
This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests.
This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.
## Why are these changes needed?
As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets.
This PR expose them in the endpoint in dashboard and dashboard agent.
For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not.
For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not.
We think raylet is live if
- GCS is dead
- GCS is alive but GCS think the raylet is dead
If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.
Add external hook to /api/component_activities endpoint in dashboard snapshot router
Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.
In Ray 2.0, we want to achieve api server HA.
Originally serve endpoints are in head node.
This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.
Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active
This currently only contains a response entry for drivers, but will add entries for other components on request as followups
## Why are these changes needed?
This PR adds data truncation when there are more than N number of entries. The policy is as follow;
By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.
By default, all internal RPCs truncate data if it's > 10K.
For distributed sources, we query each source with 10K limit and we apply limit again at the end.
## Related issue number
Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400