hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Chen Shen	6be4bf8be3	[hotfix] Fix pytest dependency in test_utils (#27956 ) import pytest in test_utils breaks a bunch of test.	2022-08-17 12:16:08 -07:00
Nikita Vemuri	4692e8d802	[core] Don't override external dashboard URL in internal KV store (#27901 ) Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user. images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">	2022-08-16 22:48:05 -07:00
Yi Cheng	dac7bf17d9	[serve] Make serve agent not blocking when GCS is down. (#27526 ) This PR fixed several issue which block serve agent when GCS is down. We need to make sure serve agent is always alive and can make sure the external requests can be sent to the agent and check the status. - internal kv used in dashboard/agent blocks the agent. We use the async one instead - serve controller use ray.nodes which is a blocking call and blocking forever. change to use gcs client with timeout - agent use serve controller client which is a blocking call with max retries = -1. This blocks until controller is back. To enable Serve HA, we also need to setup: - RAY_gcs_server_request_timeout_seconds=5 - RAY_SERVE_KV_TIMEOUT_S=5 which we should set in KubeRay.	2022-08-08 16:29:42 -07:00
Alan Guo	326b5bd1ac	Convert job_manager to be async (#27123 ) Updates jobs api Updates snapshot api Updates state api Increases jobs api version to 2 Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? follow-up for #25902 (comment)	2022-08-05 19:33:49 -07:00
Alan Guo	05fca09f2d	Add query param to limit number of actors in api/snapshot (#27489 ) Default the value to 1000 actors Signed-off-by: Alan Guo aguo@anyscale.com Why are these changes needed? Reduces the latency of the api/snapshot, especially in cases where there is a ton of actors.	2022-08-05 16:48:46 -07:00
Nikita Vemuri	9a0b9918e5	[dashboard] Add `last_activity_at` field to `/api/component_activities` (#27284 ) Add optional last_activity_at field to /api/component_activities to record end time of most recently finished activity Signed-off-by: Nikita Vemuri <nikitavemuri@gmail.com>	2022-08-02 11:02:15 -07:00
Simon Mo	e5a8b1dd55	[Serve] Add API Annotations And Move to _private (#27058 )	2022-07-27 09:08:26 -07:00
Alan Guo	e8222ff600	[dashboard] Update cluster_activities endpoint to use pydantic. (#26609 ) Update cluster_activities endpoint to use pydantic so we have better data validation. Make timestamp a required field. Add pydantic to ray[default] requirements	2022-07-25 10:54:22 -07:00
Nikita Vemuri	56716a1c1b	[dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (#26297 ) Add external hook to /api/component_activities endpoint in dashboard snapshot router Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.	2022-07-08 10:51:59 -07:00
shrekris-anyscale	010a3566e6	[Serve] Allow and remove trailing slashes in Ray submission address (#26093 )	2022-06-30 16:04:53 -07:00
Nikita Vemuri	8fc3409676	[dashboard] Add `component_activities` API (#25996 ) Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active This currently only contains a response entry for drivers, but will add entries for other components on request as followups	2022-06-30 13:39:01 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
shrekris-anyscale	3278763dd7	[Serve] Start all Serve actors in the `"serve"` namespace only (#25575 )	2022-06-13 10:31:28 -07:00
Archit Kulkarni	1b67e6a8ae	[Jobs] [Dashboard] Add job submission id as field to job snapshot (#24303 ) Closes https://github.com/ray-project/ray/issues/24300 Adds a field to the job submission snapshot that matches the job name in the existing snapshot. Before this PR, the job submission name was camelcased because all snapshot keys are automatically camelcased. This PR allows jobs from the old job field to be linked to ones in the new job submission snapshot.	2022-04-29 10:10:24 -05:00
Archit Kulkarni	27e7c284ee	[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123 ) In the snapshot, all timestamps are given in ms except for Jobs: ``` wget -q -O - http://127.0.0.1:8265/api/snapshot { "result":true, "msg":"hello", "data":{ "snapshot":{ "jobs":{ "01000000":{ "status":null, "statusMessage":null, "isDead":false, "startTime":1650315791249, "endTime":0, "config":{ "namespace":"_ray_internal_dashboard", "metadata":{ }, "runtimeEnv":{ } } } }, "jobSubmission":{ "raysubmit9Bsej1Rtxqqetxup":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650315925, "endTime":1650315926, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"ls" }, "raysubmitEibragqkyg16Hpcj":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316039, "endTime":1650316041, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" }, "raysubmitSh1U7Grdsbqrf6Je":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316354, "endTime":1650316355, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" } }, "actors":{ "8c8e28e642ba2cfd0457d45e01000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650315926620, "endTime":1650315927499, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10003, "metadata":{ } }, "a7fd8354567129910c44298401000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316355718, "endTime":1650316356620, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10005, "metadata":{ } }, "19ca9ad190f47bae963592d601000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316041089, "endTime":1650316041978, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10004, "metadata":{ } } }, "deployments":{ }, "sessionName":"session_2022-04-18_13-49-44_814862_139", "rayVersion":"1.12.0", "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d" } } } ``` This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.	2022-04-26 08:37:41 -07:00
Tao Wang	6aefe9b36e	[Core]Save task spec in separate table (#22650 ) This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs. As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.	2022-04-12 12:24:26 -07:00
Archit Kulkarni	77090144a2	[jobs] Add `entrypoint` field to JobInfo (#23253 )	2022-03-16 22:02:22 -05:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
shrekris-anyscale	8548affdc2	Increase `test_failed_job_status` timeout in `test_job_submission` (#22643 ) `test_job_submission` has become [flakey](https://flakey-tests.ray.io/) due to timeout. This change increases the timeout in `test_failed_job_status` from 10 to 25 seconds.	2022-02-25 10:08:55 -08:00
Edward Oakes	58e5f0140d	[jobs] Rename JobData -> JobInfo (#22499 ) `JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.	2022-02-22 16:18:16 -06:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	50e2bef9d0	[Jobs] Hide `dashboard` from Job Submission import path (#22223 ) For public SDK APIs, change the import path from ```python from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo from ray.dashboard.modules.job.sdk import JobSubmissionClient ``` to ```python from ray.job_submission import JobStatus, JobSubmissionClient ``` `JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.	2022-02-09 13:55:32 -06:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	e62c0052a0	[Dashboard] Agent in minimal ray installation (#21817 ) This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation. Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.	2022-01-26 04:03:54 -08:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
mwtian	e8ce01c525	[Dashboard] offload blocking work to a thread pool (#21762 ) Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.	2022-01-21 17:55:11 -08:00
mwtian	70db5c5592	[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232 ) After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster. Co-authored-by: Yi Cheng <chengyidna@gmail.com> Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>	2022-01-04 23:06:44 -08:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
iasoon	1c93beb490	[serve] use true nulls in snapshot (#21062 )	2021-12-20 16:07:09 -08:00
iasoon	33059cff3d	[serve] support not exposing deployments over http (#21042 )	2021-12-13 09:43:55 -08:00
Guyang Song	53630ee03b	Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692 ) - Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641 - Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.	2021-11-24 09:01:01 -08:00
Alex Wu	9388d28233	Revert "[runtime env] redefine runtime env to protobuf" (#20641 ) Reverts #19511 Breaks windows compilation	2021-11-22 13:11:30 -08:00
Guyang Song	ad56b9b432	[runtime env] redefine runtime env to protobuf (#19511 )	2021-11-20 16:54:42 +08:00
Edward Oakes	d26c9e67e8	[job submission] Add a `message` to the JobStatus to return more detailed errors (#20491 )	2021-11-18 10:15:23 -06:00
Yi Cheng	a4e187c0e7	[gcs] Update function table to use internal kv (#20152 ) ## Why are these changes needed? This is a part of redis removal. This PR remove redis kv in function table. rpush related code is not updated in this PR. ## Related issue number	2021-11-15 23:34:41 -08:00
Yi Cheng	e54d3117a4	[gcs] Update all redis kv usage in python except function table (#20014 ) ## Why are these changes needed? This is part of redis removal project. In this PR all direct usage of redis got removed except function table. Function table will be migrated in the next PR ## Related issue number #19443	2021-11-10 20:24:53 -08:00
Edward Oakes	81f036d078	[job submission] Move job_manager to dashboard module, common parts to common.py (#20209 )	2021-11-10 14:14:55 -08:00
Edward Oakes	b2ddea255d	[job submission] Add job submission ID + status to /api/snapshot (#19994 )	2021-11-03 09:49:28 -05:00
Guyang Song	ab55b808c5	[runtime env] move worker env to runtime env in Java (#19060 )	2021-10-11 17:25:09 +08:00
Edward Oakes	73b8936aa8	[runtime_env] Unify rpc::RuntimeEnv with serialized_runtime_env field (#18641 )	2021-09-28 15:13:15 -05:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Tanmay Chordia	bf1176311f	[dashboard] add an endpoint to force kill an actor (#18508 )	2021-09-13 20:03:15 -07:00
Edward Oakes	17dded543c	Support passing gcs_client to internal_kv (#18235 )	2021-08-31 12:46:41 -05:00
Nikita Vemuri	a9c731edd3	[serve] Remove requirement to specify namespace for serve.start(detached=True) (#17470 )	2021-08-25 10:39:32 -05:00
architkulkarni	97dd13be09	[Serve] [dashboard] Fix formatting bugs in cluster snapshot (#17977 ) * show "unversioned" in actor metadata * hash deployment names * update test * replace "Unversioned" with "None" * bypass convert to camelCase for deployment names * fix convert_case default to match previous setting * lint * replace deployment_name_hash with underscore	2021-08-24 12:06:26 -07:00
architkulkarni	5ed3f0ce35	[Serve] [Dashboard] Add end times and DELETED state for endpoints (#17898 )	2021-08-19 11:10:42 -05:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
architkulkarni	fcac416933	[Serve] [Dashboard] Add start times and replica tags to cluster snapshot (#17749 )	2021-08-13 09:49:12 -07:00
architkulkarni	00f6b30684	[Serve] [Dashboard] Support nondetached and multiple Serve instances in cluster snapshot (#17747 )	2021-08-11 22:26:54 -05:00
Jiao	e38db5875b	Add serve external kv store (#17622 )	2021-08-11 12:06:14 -07:00

1 2

58 commits