hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Dmitri Gekhtman	1055eadde0	[Dashboard] Fix dashboard RAM and CPU with cgroups2 (#25710 ) Closes #25283. The dashboard shows inaccurate memory and cpu data when run inside of a docker container, in particular when using cgroups v2. This PR fixes that.	2022-06-26 14:01:26 -07:00
SangBin Cho	6552e096e6	[State Observability] Summary APIs (#25672 ) Task/actor/object summary Tasks: Group by the func name. In the future, we will also allow to group by task_group. Actors: Group by actor class name. In the future, we will also allow to group by actor_group. Object: Group by callsite. In the future, we will allow to group by reference type or task state.	2022-06-22 06:21:50 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Archit Kulkarni	565e366529	[runtime env] Use async internal kv in package download and plugins (#25788 ) Uses the async KV API for downloading in the runtime env agent. This avoids the complexity of running the runtime env creation functions in a separate thread. Some functions are still sync, including the working_dir/py_modules upload, installing wheels, and possibly others.	2022-06-21 15:02:36 -07:00
shrekris-anyscale	3d6a5450c9	[Serve] Stop Ray in test_serve_head.py fixture (#25893 )	2022-06-21 11:28:07 -07:00
shrekris-anyscale	ad12f0cd02	[Serve] Deprecate outdated REST API settings (#25932 )	2022-06-21 11:06:45 -07:00
SangBin Cho	411b1d8d2d	[State Observability] Return list instead of dict (#25888 ) I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order) Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory) Also, users can easily build dict index on their own if it is necessary.	2022-06-20 22:49:29 -07:00
Archit Kulkarni	85be093a84	[runtime env] Make all plugins return a `List` of URIs (#25825 ) Followup from #24622. This is another step towards pluggability for runtime_env. Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list. This PR makes all plugins use `get_uris`, which simplifies the code overall. Most of the lines in the diff just come from the new `format.sh` which sorts the imports.	2022-06-17 14:13:44 -05:00
Archit Kulkarni	23030dbcaa	[runtime env] Hide URI cache behind class (#24622 ) Followup PR to https://github.com/ray-project/ray/pull/20273. - Hides cache logic behind a class. - Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class. Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.	2022-06-15 16:14:06 -05:00
shrekris-anyscale	a371756b3c	[Serve] Update Serve CLI and REST API behavior to use new config (#25691 )	2022-06-14 19:01:51 -07:00
Ricky Xu	b1d0b12b4e	[Core \ State Observability] Use Submission client (#25557 ) ## Why are these changes needed? This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. See #24956 for more details. ## Summary <!-- Please give a short summary of the change and the problem this solves. --> - Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. ## Related issue number Closes #24956 Closes #25578	2022-06-13 17:11:19 -07:00
shrekris-anyscale	3278763dd7	[Serve] Start all Serve actors in the `"serve"` namespace only (#25575 )	2022-06-13 10:31:28 -07:00
SangBin Cho	856bea31fb	[State Observability] Ray log CLI / API (#25481 ) This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done. # If there's only 1 match, print a file content. Otherwise, print all files that match glob. ray logs [glob_filter] --node-id=[head node by default] Args: --tail: Tail the last X lines --follow: Follow the new logs --actor-id: The actor id --pid --node-ip: For worker logs --node-id: The node id of the log --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)	2022-06-13 05:52:57 -07:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
shrekris-anyscale	f3c2bd6718	[Serve] Make REST API deployments inherit top-level runtime_env (#25502 )	2022-06-08 15:58:00 -07:00
Archit Kulkarni	6d2806f951	[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562 ) The current inheritance behavior for runtime_envs enables the following workflow for Jobs: A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir. There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior. This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following: - Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API - Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test	2022-06-08 13:54:06 -07:00
mwtian	1ce0ab7b7c	[Core] Export additional metrics for workers and Raylet memory (#25418 ) Add visibility into the following to help Ray users and developers debug performance and OOM issues: Raylet memory usage broken down by USS vs remaining RSS. Total workers' count, CPU percentage usage, and memory usage.	2022-06-06 10:58:14 -07:00
SangBin Cho	00e3fd75f3	[State Observability] Ray log alpha API (#24964 ) This is the PR to implement ray log to the server side. The PR is continued from #24068. The PR supports two endpoints; /api/v0/logs # list logs of the node id filtered by the given glob. /api/v0/logs/{[file \| stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id Some tests need to be re-written, I will do it soon. As a follow-up after this PR, there will be 2 PRs. PR to add actual CLI PR to remove in-memory cached logs and do on-demand query for actor/worker logs	2022-06-04 05:10:23 -07:00
SangBin Cho	54496d7705	[State Observability API] Support Filtering (#25281 ) This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs). We will also support limited types of columns for each API. The API is as follows ray list [resources] -- filter [key] [value] => filter data that's key==value. In the future, we can also support more complicated filtering like !=, And, Or , or etc.	2022-06-03 17:17:30 -07:00
shrekris-anyscale	16bdfe6a39	Restore "[Serve] Deploy Serve deployment graphs via REST API" (#25073 ) (#25333 )	2022-06-02 11:06:53 -07:00
Eric Liang	905258dbc1	Clean up docstyle in python modules and add LINT rule (#25272 )	2022-06-01 11:27:54 -07:00
Eric Liang	517f78e2b8	[minor] Add a job submission hook by env var (#25343 )	2022-06-01 11:15:43 -07:00
shrekris-anyscale	7754645c83	Revert "[Serve] Deploy Serve deployment graphs via REST API (#25073 )" (#25330 ) This reverts commit `47709b3300`.	2022-05-31 15:37:55 -07:00
shrekris-anyscale	47709b3300	[Serve] Deploy Serve deployment graphs via REST API (#25073 )	2022-05-31 10:57:08 -07:00
Philipp Moritz	323605d169	Support file:// for runtime_env working directories in jobs (#25062 ) This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories. Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-24 16:17:18 -07:00
shrekris-anyscale	8b3451318c	[Serve] Update Serve status formatting and processing (#24839 )	2022-05-24 11:07:41 -07:00
Edward Oakes	65d21b7ae6	[job submission] Handle `env_vars: None` case properly in supervisor runtime_env logic (#25087 )	2022-05-24 11:01:19 -05:00
SangBin Cho	a7e759317b	[State Observability API] Error handling (#24413 ) This improves error handling per https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.pdzl9cil9e8z (the RPC part). Semantics If all queries to the source failed, raise a RayStateApiException. If partial queries are failed, warnings.warn the partial failure when print_api_stats=True. It is true for CLI. It is false when it is used within Python API or json / yaml format is required.	2022-05-24 03:56:49 -07:00
Archit Kulkarni	a67c8a0739	[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719 ) Packages are uploaded to the GCS for `runtime_env`. These packages are garbage collected when their refcount becomes zero. The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded. It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job. It's likely the cause of https://github.com/ray-project/ray/issues/23423. We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted. The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s). This should be enough time for the job to start. When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-05-23 10:25:04 -05:00
Edward Oakes	cb7bcbd651	[job submission] Fix address defaulting behavior (#24970 ) Per the discussion in https://github.com/ray-project/ray/issues/24858: - If an address without a port is provided, don't append a port. - Default to `http://localhost:8265` if nothing is provided.	2022-05-20 14:10:36 -05:00
Jiajun Yao	628f886af4	Don't show usage stats prompt in dashboard if prompt is disabled (#24700 )	2022-05-12 07:55:28 -07:00
Qing Wang	259661042c	[runtime env] [java] Support jars in runtime env for Java (#24170 ) This PR supports setting the jars for an actor in Ray API. The API looks like: ```java class A { public boolean findClass(String className) { try { Class.forName(className); } catch (ClassNotFoundException e) { return false; } return true; } } RuntimeEnv runtimeEnv = new RuntimeEnv.Builder() .addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar")) .build(); ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote(); boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get(); System.out.println(ret); // true ```	2022-05-12 09:34:40 +08:00
Edward Oakes	4c1f27118a	[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546 ) Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`. This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.	2022-05-10 11:43:04 -05:00
Kai Yang	4a999777fa	[Core] Allow accepting gRPC HTTP proxy via env variable (#23526 )	2022-05-10 11:30:46 +08:00
Dmitri Gekhtman	6d09244a7e	[Dashboard][K8s] Add toggle to enable showing node disk usage on K8s (#24416 ) https://github.com/ray-project/ray/pull/14676 disabled the disk usage/total display for Ray nodes on K8s, because Ray nodes on K8s are run as pods, which in general do not use up the entire machine. However, in some situations, it is useful to run one Ray pod per K8s node and report the disk usage. This PR adds a flag to enable displaying disk usage in those situations.	2022-05-03 10:58:05 -05:00
SangBin Cho	2bce07d4ce	[State API] List runtime env API (#24126 ) This PR supports list runtime env API	2022-05-02 14:01:00 -07:00
Sihan Wang	59debac670	[Serve] Move deployment clean up under serve.run() api (#24306 ) On the ServeHead level, it is talking to serve api and controller to do deployment and clean up now. With this pr, it hides the deployment clean up logic into server.run() for code cleanness and easy to refactor in the future.	2022-05-02 12:10:11 -05:00
SangBin Cho	6f192b6e17	[Metrics] Allow to completely disable metrics collection (#24333 ) This PR allows for Ray to disable metrics collection. It was possible with RAY_enable_metrics_collection, but it didn't fully disable collection because there was a metrics collection happening from agent that wasn't properly disabled. This PR also adds tests.	2022-05-02 05:33:03 -07:00
Philipp Moritz	27917f570d	[runtime_env] Extend runtime_env hook to also cover jobs (#24328 ) This extends https://github.com/ray-project/ray/pull/24036 to also cover job submission. Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-04-30 09:15:51 -07:00
Archit Kulkarni	1b67e6a8ae	[Jobs] [Dashboard] Add job submission id as field to job snapshot (#24303 ) Closes https://github.com/ray-project/ray/issues/24300 Adds a field to the job submission snapshot that matches the job name in the existing snapshot. Before this PR, the job submission name was camelcased because all snapshot keys are automatically camelcased. This PR allows jobs from the old job field to be linked to ones in the new job submission snapshot.	2022-04-29 10:10:24 -05:00
Jiajun Yao	8fdde12e9e	Delay 1 minutes for the first usage stats report (#24291 ) Delay the first report for 1 minutes so the system is probably set up and we can get the information to report.	2022-04-28 22:53:33 -07:00
Archit Kulkarni	cc864401fb	[Dashboard] Add environment variable flag to skip dashboard log processing (#24263 )	2022-04-27 15:33:08 -07:00
Archit Kulkarni	12b9383d52	[Jobs] Reenable `test_backwards_compatibility` using Ray 1.12 (#24124 ) Closes https://github.com/ray-project/ray/issues/23258	2022-04-26 13:53:51 -05:00
Archit Kulkarni	27e7c284ee	[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123 ) In the snapshot, all timestamps are given in ms except for Jobs: ``` wget -q -O - http://127.0.0.1:8265/api/snapshot { "result":true, "msg":"hello", "data":{ "snapshot":{ "jobs":{ "01000000":{ "status":null, "statusMessage":null, "isDead":false, "startTime":1650315791249, "endTime":0, "config":{ "namespace":"_ray_internal_dashboard", "metadata":{ }, "runtimeEnv":{ } } } }, "jobSubmission":{ "raysubmit9Bsej1Rtxqqetxup":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650315925, "endTime":1650315926, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"ls" }, "raysubmitEibragqkyg16Hpcj":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316039, "endTime":1650316041, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" }, "raysubmitSh1U7Grdsbqrf6Je":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316354, "endTime":1650316355, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" } }, "actors":{ "8c8e28e642ba2cfd0457d45e01000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650315926620, "endTime":1650315927499, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10003, "metadata":{ } }, "a7fd8354567129910c44298401000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316355718, "endTime":1650316356620, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10005, "metadata":{ } }, "19ca9ad190f47bae963592d601000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316041089, "endTime":1650316041978, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10004, "metadata":{ } } }, "deployments":{ }, "sessionName":"session_2022-04-18_13-49-44_814862_139", "rayVersion":"1.12.0", "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d" } } } ``` This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.	2022-04-26 08:37:41 -07:00
Jiajun Yao	3fb63847e2	Show usage stats prompt (#23822 ) Show usage stats prompt when it's enabled. Current UX are: * The usage stats enabled or disabled message is shown every time in both terminal and dashboard. * If users don't explicitly enable or disable usage stats, the first time they start a ray cluster interactively, they will be asked to confirm and will enable if no user action within 10s. If it's non-interactive, collection is enabled by default without confirmation. * ray.init() doesn't collect usage stats * Usage stats can be disabled via three approaches: 1. RAY_USAGE_STATS_ENABLED env var, 2. ray xxx --disable-usage-stats, 3. ray disable-usage-stats	2022-04-25 16:01:24 -07:00
SangBin Cho	73ed67e9e6	[State API] State api limit + Removing unnecessary modules (#24098 ) This PR does Move all routes into the same module, state_head.py Support a limit feature.	2022-04-22 15:59:46 -07:00
SangBin Cho	30ab5458a7	[State Observability] Tasks and Objects API (#23912 ) This PR implements ray list tasks and ray list objects APIs. NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.	2022-04-21 18:45:03 -07:00
shrekris-anyscale	b51d0aa8b1	[serve] Introduce `context.py` and `client.py` (#24067 ) Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files: * `context.py` * Intended to expose process-wide state to internal Serve code as well as `api.py` * Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables * `client.py` * Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`	2022-04-21 18:35:09 -05:00
jon-chuang	ddcc252b51	[Core] Ray logs API (1/n) (#23435 ) Expose HTTP endpoint to retrieve logs from ray cluster	2022-04-20 23:11:02 -07:00
Chu Xiangyang	6f74040b15	[Job] Fix typo in job sdk docstring (#23940 )	2022-04-20 12:30:32 -05:00

1 2 3 4 5 ...

306 commits