hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Guyang Song	58bfad84d3	[runtime env] plugin refactor[1/n] (#26077 )	2022-06-28 14:09:05 +08:00
Ricky Xu	44daf3ecd7	[Core][State Observability] Get API using List endpoints + filtering on ids (#25894 ) ## Why are these changes needed? This is a first implementation of GET APIs for nodes actors placement groups workers tasks objects E.g. # CLI (dev) ➜ ray git:(ricky/obs-get) ray get nodes cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9 --- node_id: cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9 node_ip: 172.31.47.143 node_name: 172.31.47.143 resources_total: CPU: 8.0 memory: 16700517582.0 node:172.31.47.143: 1.0 object_store_memory: 8350258790.0 state: ALIVE # Python from ray.experimental.state.api import get_node from ray.experimental.state.common import NodeState node :NodeState = get_node(<id>) print(node) We currently do not support getting specific resources by id for 'jobs' and 'runtime-envs' jobs: it is not exposing id to be queried easily yet runtime envs: it doesn't have an id associated. TODO: it uses list endpoints + filtering as for now, future iterations will implement GET-specific endpoints and interaction with raylet/GCS with point query APIs. Unit testing for state_manager for GET endpoints when implemented. Getting jobs by id	2022-06-27 17:14:29 -07:00
Ricky Xu	3d8ca6cf0f	[Core][cli][usability] ray stop prints errors during graceful shutdown (#25686 ) Why are these changes needed? This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM. What has been changed? Added signal handlers for some of the subprocesses: dashboard (head) log monitor ray client server Changed the --block semantics and prompt messages. Related issue number Closes #25518	2022-06-27 08:14:59 -07:00
Dmitri Gekhtman	1055eadde0	[Dashboard] Fix dashboard RAM and CPU with cgroups2 (#25710 ) Closes #25283. The dashboard shows inaccurate memory and cpu data when run inside of a docker container, in particular when using cgroups v2. This PR fixes that.	2022-06-26 14:01:26 -07:00
SangBin Cho	6552e096e6	[State Observability] Summary APIs (#25672 ) Task/actor/object summary Tasks: Group by the func name. In the future, we will also allow to group by task_group. Actors: Group by actor class name. In the future, we will also allow to group by actor_group. Object: Group by callsite. In the future, we will allow to group by reference type or task state.	2022-06-22 06:21:50 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Archit Kulkarni	565e366529	[runtime env] Use async internal kv in package download and plugins (#25788 ) Uses the async KV API for downloading in the runtime env agent. This avoids the complexity of running the runtime env creation functions in a separate thread. Some functions are still sync, including the working_dir/py_modules upload, installing wheels, and possibly others.	2022-06-21 15:02:36 -07:00
shrekris-anyscale	3d6a5450c9	[Serve] Stop Ray in test_serve_head.py fixture (#25893 )	2022-06-21 11:28:07 -07:00
shrekris-anyscale	ad12f0cd02	[Serve] Deprecate outdated REST API settings (#25932 )	2022-06-21 11:06:45 -07:00
Guyang Song	d1d5fe61c2	[Dashboard][Frontend] Worker table enhancement (#25934 )	2022-06-21 14:09:48 +08:00
SangBin Cho	411b1d8d2d	[State Observability] Return list instead of dict (#25888 ) I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order) Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory) Also, users can easily build dict index on their own if it is necessary.	2022-06-20 22:49:29 -07:00
Guyang Song	e13cc4088a	[Dashboard] Don't sort node list by defult (#25884 )	2022-06-20 11:35:12 +08:00
Archit Kulkarni	85be093a84	[runtime env] Make all plugins return a `List` of URIs (#25825 ) Followup from #24622. This is another step towards pluggability for runtime_env. Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list. This PR makes all plugins use `get_uris`, which simplifies the code overall. Most of the lines in the diff just come from the new `format.sh` which sorts the imports.	2022-06-17 14:13:44 -05:00
Simon Mo	e560bce3a4	[Serve] bind to 0.0.0.0 in serve_head (#25862 )	2022-06-16 11:45:11 -07:00
Archit Kulkarni	23030dbcaa	[runtime env] Hide URI cache behind class (#24622 ) Followup PR to https://github.com/ray-project/ray/pull/20273. - Hides cache logic behind a class. - Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class. Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.	2022-06-15 16:14:06 -05:00
shrekris-anyscale	a371756b3c	[Serve] Update Serve CLI and REST API behavior to use new config (#25691 )	2022-06-14 19:01:51 -07:00
clarng	badf444eda	Respect import order for psutil and setproctitle (#25780 ) Sort imports in a way that preserves the ordering requirements. This PR is needed for any file changes that imports psutil or setproctitle.	2022-06-14 17:44:41 -07:00
Ricky Xu	b1d0b12b4e	[Core \ State Observability] Use Submission client (#25557 ) ## Why are these changes needed? This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. See #24956 for more details. ## Summary <!-- Please give a short summary of the change and the problem this solves. --> - Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. ## Related issue number Closes #24956 Closes #25578	2022-06-13 17:11:19 -07:00
shrekris-anyscale	3278763dd7	[Serve] Start all Serve actors in the `"serve"` namespace only (#25575 )	2022-06-13 10:31:28 -07:00
Simon Mo	feb8c29063	Revert "Revert "Revert "use an agent-id rather than the process PID (#24968 )"… (#25376 )" (#25669 ) This reverts commit `cb151d5ad6`.	2022-06-13 09:22:52 -07:00
SangBin Cho	856bea31fb	[State Observability] Ray log CLI / API (#25481 ) This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done. # If there's only 1 match, print a file content. Otherwise, print all files that match glob. ray logs [glob_filter] --node-id=[head node by default] Args: --tail: Tail the last X lines --follow: Follow the new logs --actor-id: The actor id --pid --node-ip: For worker logs --node-id: The node id of the log --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)	2022-06-13 05:52:57 -07:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
shrekris-anyscale	f3c2bd6718	[Serve] Make REST API deployments inherit top-level runtime_env (#25502 )	2022-06-08 15:58:00 -07:00
Archit Kulkarni	6d2806f951	[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562 ) The current inheritance behavior for runtime_envs enables the following workflow for Jobs: A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir. There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior. This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following: - Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API - Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test	2022-06-08 13:54:06 -07:00
mwtian	1ce0ab7b7c	[Core] Export additional metrics for workers and Raylet memory (#25418 ) Add visibility into the following to help Ray users and developers debug performance and OOM issues: Raylet memory usage broken down by USS vs remaining RSS. Total workers' count, CPU percentage usage, and memory usage.	2022-06-06 10:58:14 -07:00
SangBin Cho	00e3fd75f3	[State Observability] Ray log alpha API (#24964 ) This is the PR to implement ray log to the server side. The PR is continued from #24068. The PR supports two endpoints; /api/v0/logs # list logs of the node id filtered by the given glob. /api/v0/logs/{[file \| stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id Some tests need to be re-written, I will do it soon. As a follow-up after this PR, there will be 2 PRs. PR to add actual CLI PR to remove in-memory cached logs and do on-demand query for actor/worker logs	2022-06-04 05:10:23 -07:00
SangBin Cho	54496d7705	[State Observability API] Support Filtering (#25281 ) This PR adds a filtering support. The filtering is done from the API server side (not from the source side). Source side filtering is a bit complicated to write an elegant solution, and we will handle it in the future (no optimization for alpha APIs). We will also support limited types of columns for each API. The API is as follows ray list [resources] -- filter [key] [value] => filter data that's key==value. In the future, we can also support more complicated filtering like !=, And, Or , or etc.	2022-06-03 17:17:30 -07:00
shrekris-anyscale	16bdfe6a39	Restore "[Serve] Deploy Serve deployment graphs via REST API" (#25073 ) (#25333 )	2022-06-02 11:06:53 -07:00
SangBin Cho	cb151d5ad6	Revert "Revert "use an agent-id rather than the process PID (#24968 )"… (#25376 )	2022-06-01 16:28:48 -07:00
Simon Mo	61099faa58	[CI] Fix dashboard tests broken due to dep version upgrade (#25357 )	2022-06-01 12:14:49 -07:00
Eric Liang	905258dbc1	Clean up docstyle in python modules and add LINT rule (#25272 )	2022-06-01 11:27:54 -07:00
Eric Liang	517f78e2b8	[minor] Add a job submission hook by env var (#25343 )	2022-06-01 11:15:43 -07:00
SangBin Cho	3385d19cbb	Revert "use an agent-id rather than the process PID (#24968 )" (#25342 ) This reverts commit `02f220b755`. <!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? Looks like this commit makes `test_ray_shutdown` way more flaky. cc @mattip for further investigation after revert <img width="760" alt="Screen Shot 2022-05-31 at 11 14 48 PM" src="https://user-images.githubusercontent.com/18510752/171339737-f48e6e90-391a-4235-bfac-a0aa0e563eb7.png"> ## Related issue number <!-- For example: "Closes #1234" --> ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2022-06-01 05:35:30 -07:00
shrekris-anyscale	7754645c83	Revert "[Serve] Deploy Serve deployment graphs via REST API (#25073 )" (#25330 ) This reverts commit `47709b3300`.	2022-05-31 15:37:55 -07:00
shrekris-anyscale	47709b3300	[Serve] Deploy Serve deployment graphs via REST API (#25073 )	2022-05-31 10:57:08 -07:00
Matti Picus	02f220b755	use an agent-id rather than the process PID (#24968 ) When using ray inside a virtualenv on windows, python.exe as reported by sys.executable is a PEP397 launcher to the actual python as reported by os.getpid(): >>> import sys, os, psutil >>> >>> print(sys.executable) C:\temp\issue24361\Scripts\python.exe >>> os.getpid() 2208 >>> child = psutil.Process(2208) >>> child.cmdline() ['C:\\oss\\CPython38\\python.exe'] >>> child.parent().cmdline() ['C:\\temp\\issue24361\\Scripts\\python.exe'] >>> child.parent().pid 6424 When the agent_manager launches the agent process via Process::Process(), it gets the PID of the launcher process (6424), which is what is expected as an ID when registering the agent in the gRPC callback. But inside agent.py, the child process reports the PID via os.getpid(), which is 2208, and this is the wrong PID to register the agent. The solution proposed here is another version of #24905 that creates a int agent_id = rand(); before starting the python process, and passes the agent_id to the process.	2022-05-26 22:10:35 -07:00
mwtian	fa32cb7c40	Revert "[core] Resubscribe GCS in python when GCS restarts. (#24887 )" (#25168 ) This reverts commit `7cf4233858`.	2022-05-24 18:13:40 -07:00
mwtian	f79b826f31	[Dashboard] avoid showing disk info when it is unavailable (#24992 )	2022-05-24 17:13:47 -07:00
Philipp Moritz	323605d169	Support file:// for runtime_env working directories in jobs (#25062 ) This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories. Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com> Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-05-24 16:17:18 -07:00
shrekris-anyscale	8b3451318c	[Serve] Update Serve status formatting and processing (#24839 )	2022-05-24 11:07:41 -07:00
Edward Oakes	65d21b7ae6	[job submission] Handle `env_vars: None` case properly in supervisor runtime_env logic (#25087 )	2022-05-24 11:01:19 -05:00
SangBin Cho	a7e759317b	[State Observability API] Error handling (#24413 ) This improves error handling per https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.pdzl9cil9e8z (the RPC part). Semantics If all queries to the source failed, raise a RayStateApiException. If partial queries are failed, warnings.warn the partial failure when print_api_stats=True. It is true for CLI. It is false when it is used within Python API or json / yaml format is required.	2022-05-24 03:56:49 -07:00
Yi Cheng	7cf4233858	[core] Resubscribe GCS in python when GCS restarts. (#24887 ) This is a follow-up PRs of https://github.com/ray-project/ray/pull/24813 and https://github.com/ray-project/ray/pull/24628 Unlike the change in cpp layer, where the resubscription is done by GCS broadcast a request to raylet/core_worker and the client-side do the resubscription, in the python layer, we detect the failure in the client-side. In case of a failure, the protocol is: 1. call subscribe 2. if timeout when doing resubscribe, throw an exception and this will crash the system. This is ok because when GCS has been down for a time longer than expected, we expect the ray cluster to be down. 3. continue to poll once subscribe ok. However, there is an extreme case where things might be broken: the client might miss detecting a failure. This could happen if the long-polling has been returned and the python layer is doing its own work. And before it sends another long-polling, GCS restarts and recovered. Here we are not going to take care of this case because: 1. usually GCS is going to take several seconds to be up and the python layer's work is simply pushing data into a queue (sync version). For the async version, it's only used in Dashboard which is not a critical component. 2. pubsub in python layer is not doing critical work: it handles logs/errors for ray job; 3. for the dashboard, it can just restart to fix the issue. A known issue here is that we might miss logs in case of GCS failure due to the following reasons: - py's pubsub is only doing best effort publishing. If it failed too many times, it'll skip publishing the message (lose messages from producer side) - if message is pushed to GCS, but the worker hasn't done resubscription yet, the pushed message will be lost (lose messages from consumer side) We think it's reasonable and valid behavior given that the logs are not defined to be a critical component and we'd like to simplify the design of pubsub in GCS. Another things is `run_functions_on_all_workers`. We'll plan to stop using it within ray core and deprecate it in the longer term. But it won't cause a problem for the current cases because: 1. It's only set in driver and we don't support creating a new driver when GCS is down. 2. When GCS is down, we don't support starting new ray workers. And `run_functions_on_all_workers` is only used when we initialize driver/workers.	2022-05-23 13:06:33 -07:00
Archit Kulkarni	a67c8a0739	[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719 ) Packages are uploaded to the GCS for `runtime_env`. These packages are garbage collected when their refcount becomes zero. The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded. It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job. It's likely the cause of https://github.com/ray-project/ray/issues/23423. We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted. The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s). This should be enough time for the job to start. When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-05-23 10:25:04 -05:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
Edward Oakes	cb7bcbd651	[job submission] Fix address defaulting behavior (#24970 ) Per the discussion in https://github.com/ray-project/ray/issues/24858: - If an address without a port is provided, don't append a port. - Default to `http://localhost:8265` if nothing is provided.	2022-05-20 14:10:36 -05:00
SangBin Cho	b9c30529d8	[Core/Observability 1/N] Add a "running" state to task status (#24651 ) This PR adds 2 more states into TaskStatus enum TaskStatus { // The task is scheduled properly and waiting for execution. // It includes time to deliver the task to the remote worker + queueing time // from the execution side. WAITING_FOR_EXECUTION = 5; // The task that is running. RUNNING = 6; }	2022-05-16 05:39:05 -07:00
Jiajun Yao	628f886af4	Don't show usage stats prompt in dashboard if prompt is disabled (#24700 )	2022-05-12 07:55:28 -07:00
Qing Wang	259661042c	[runtime env] [java] Support jars in runtime env for Java (#24170 ) This PR supports setting the jars for an actor in Ray API. The API looks like: ```java class A { public boolean findClass(String className) { try { Class.forName(className); } catch (ClassNotFoundException e) { return false; } return true; } } RuntimeEnv runtimeEnv = new RuntimeEnv.Builder() .addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar")) .build(); ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote(); boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get(); System.out.println(ret); // true ```	2022-05-12 09:34:40 +08:00
Jiajun Yao	1daad65568	[Doc] Add doc for usage stats collection (#24522 )	2022-05-10 17:18:49 -07:00

1 2 3 4 5 ...

474 commits