hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Guyang Song	1949f35901	[runtime env] plugin refactor[4/n]: remove runtime env protobuf (#26522 )	2022-07-15 13:56:12 +08:00
brucez-anyscale	d98a2482de	[Dashboard] Fix test dashboard flaky by catch an expected exception (#26555 )	2022-07-14 20:57:46 -07:00
SangBin Cho	e9f6ffc5a5	[Core][State Observability] Use address arg + print warning if API responds slowly (#26008 ) This PR is doing 2 things. (1) Use api_server_url to address which is consistent to other submission APIs. (2) When the API is not responded timely, it prints a warning every 5 seconds. Below is an example. This is useful when the API is slowly responded (e.g., when there are partial failures). Without this users will see hanging API for 30 seconds, which is a pretty bad UX. (0.12 / 10 seconds) Waiting for the response from the API server address http://127.0.0.1:8265/api/v0/delay/5.	2022-07-14 06:44:07 -07:00
Sven Mika	ab10890e90	Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517 )	2022-07-13 11:19:30 -07:00
Ricky Xu	365ffe21e5	[Core \| State Observability] Implement API Server (Dashboard) HTTP Requests Throttling (#26257 ) This is to limit the max number of HTTP requests the dashboard (API server) will accept before rejecting more requests. This will make sure the observability requests do not overload the downstream systems (raylet/gcs) when delegating too many concurrent state observability requests to the cluster.	2022-07-13 09:05:26 -07:00
Riatre	2cdb76789e	Bump pytest from 5.4.3 to 7.0.1 (#26334 ) See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828. Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future. The change contains three commits with rather descriptive commit message, which I repeat here: Pass deps to py_test in py_test_module_list Bazel macro py_test_module_list takes a `deps` argument, but completely ignores it instead of passes it to `native.py_test`. Fixing that as we are going to use deps of py_test_module_list in BUILD in later changes. cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library from a py_test, which isn't working, see upstream issue: https://github.com/bazelbuild/bazel/issues/701. This is fixed by simply removing the (non-working) deps. Depend on conftest and data files in Python tests BUILD files Bazel requires that all the files used in a test run should be represented in the transitive dependencies specified for the test target. For py_test, it means srcs, deps and data. Bazel enforces this constraint by creating a "runfiles" directory, symbolic links files in the dependency closure and run the test in the "runfiles" directory, so that the test shouldn't see files not in the dependency graph. Unfortunately, the constraint does not apply for a large number of Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic links during test collection and effectively "breaks out" of the runfiles tree. pytest >= 6.0 introduces a breaking change and removed the symbolic link resolving behaviour, see pytest pull request https://github.com/pytest-dev/pytest/pull/6523 for more context. Currently, we are underspecifying dependencies in a lot of BUILD files and thus blocking us from updating to newer pytest (for Python 3.10 support). This change hopefully fixes all of them, and at least those in CI, by adding data or source dependencies (mostly for conftest.py-s) where needed. Bump pytest version from 5.4.3 to 7.0.1 We want at least pytest 6.2.5 for Python 3.10 support, but not past 7.1.0 since it drops Python 3.6 support (which Ray still supports), thus the version constraint is set to <7.1. Updating pytest, combined with earlier BUILD fixes, changed the ground truth of a few error message based unit test, these tests are updated to reflect the change. There are also two small drive-by changes for making test_traceback and test_cli pass under Python 3.10. These are discovered while debugging CI failures (on earlier Python) with a Python 3.10 install locally. Expect more such issues when adding Python 3.10 to CI.	2022-07-12 21:14:35 -07:00
brucez-anyscale	57258335bd	[Serve] Fix test_cli flakiness (#26471 )	2022-07-12 17:57:08 -07:00
Alan Guo	7ad3a247bf	[Dashboard] [Frontend] Add workers to the main node tab in the New Dashboard UI (#26274 ) The old dashboard UI was much easier at seeing all the work across all workers because workers were shown along side nodes in the main nodes page. This change brings the same functionality to the new Dashboard UI. Some changes in this PR: Factor out the NodeRow into its own component and into its own file. Introduce WorkerRow which shows information about a worker Updates the heading of the table column because the column will show different data depending on if its a node row or a worker row. Makes sure we're rounding percentages to a single decimal place. Logs button for worker row will go to the logs page and filter out just the log files related to that worker. Update the api for fetching nodes into fetching nodes + workers. fix bug where object store memory was not showing the total size but instead the remaining size	2022-07-12 16:28:08 -07:00
Yi Cheng	a68c02a15d	[dashboard][2/2] Add endpoints to dashboard and dashboard_agent for liveness check of raylet and gcs (#26408 ) ## Why are these changes needed? As in this https://github.com/ray-project/ray/pull/26405 we added the health check for gcs and raylets. This PR expose them in the endpoint in dashboard and dashboard agent. For dashboard, we added `http://host:port/api/gcs_healthz` and it'll send RPC to GCS directly to see whether the GCS is alive or not. For agent, we added `http://host:port/api/local_raylet_healthz` and it'll send RPC to GCS to check whether raylet is alive or not. We think raylet is live if - GCS is dead - GCS is alive but GCS think the raylet is dead If GCS is dead for more than X seconds (60 by default), raylet will just crash itself, so KubeRay can still catch it.	2022-07-09 13:09:48 -07:00
Nikita Vemuri	56716a1c1b	[dashboard] Add `RAY_CLUSTER_ACTIVITY_HOOK` to `/api/component_activities` (#26297 ) Add external hook to /api/component_activities endpoint in dashboard snapshot router Change is_active field of RayActivityResponse to take an enum RayActivityStatus instead of bool. This is a backward incompatible change, but should be ok because [dashboard] Add component_activities API #25996 wasn't included in any branch cuts. RayActivityResponse now supports informing when there was an error getting the activity observation and the reason.	2022-07-08 10:51:59 -07:00
SangBin Cho	2dd5fdfdf1	[Usage stats] Add tags & number of nodes to the report. (#25852 ) This PR adds the RAY_EXTRA_USAGE_TAGS to add additional tag metadata + number of nodes to the report.	2022-07-07 08:31:04 -07:00
brucez-anyscale	f76d7b23f2	Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336 )	2022-07-06 19:37:30 -07:00
Yi Cheng	12d147ff1f	Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107 )" (#26333 ) This reverts commit `84166ccb04`.	2022-07-06 13:30:33 -07:00
brucez-anyscale	84166ccb04	[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107 ) In Ray 2.0, we want to achieve api server HA. Originally serve endpoints are in head node. This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.	2022-07-06 10:58:00 -07:00
xwjiang2010	d0dfbe09e3	[tune] fix `set_tune_experiment` (#26298 )	2022-07-05 15:04:51 -07:00
shrekris-anyscale	010a3566e6	[Serve] Allow and remove trailing slashes in Ray submission address (#26093 )	2022-06-30 16:04:53 -07:00
Nikita Vemuri	8fc3409676	[dashboard] Add `component_activities` API (#25996 ) Add /api/component_activities to the dashboard snapshot router which returns whether various Ray components are considered active This currently only contains a response entry for drivers, but will add entries for other components on request as followups	2022-06-30 13:39:01 -07:00
shrekris-anyscale	6e800cc2df	[Serve] Disable `test_serve_head.py` on OSX (#26178 ) `test_serve_head.py` has been very flaky recently on OSX, so this change disables it there.	2022-06-29 11:21:53 -07:00
SangBin Cho	8837a4593f	[State Observability] Truncate data when there are too many entries to return (#26124 ) ## Why are these changes needed? This PR adds data truncation when there are more than N number of entries. The policy is as follow; By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K. By default, all internal RPCs truncate data if it's > 10K. For distributed sources, we query each source with 10K limit and we apply limit again at the end. ## Related issue number Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673 Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400	2022-06-28 18:33:57 -07:00
SangBin Cho	def02bd4c9	Revert Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds" #26162 (#26163 ) * Revert "Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080)" (#26162)" This reverts commit `3017128d5e`.	2022-06-28 16:07:32 -07:00
Stephanie Wang	3017128d5e	Revert "[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080 )" (#26162 ) This reverts commit `2d58bd5a50`.	2022-06-28 10:04:58 -07:00
SangBin Cho	68336abf13	[State Observability] Support --detail flag. (#26071 ) ## Why are these changes needed? This PR adds --detail flag to the list APIs.	2022-06-28 07:56:44 -07:00
SangBin Cho	2d58bd5a50	[Observability] Fix --follow lost connection when it is used for > 30 seconds (#26080 ) ## Why are these changes needed? This PR fixes the issue where --follow lost connection when it is used for > 30 seconds because the gRPC timeout is configured to be 30 seconds, and we don't reset it when --follow is set. This fixes the issue by setting timeout=None when keepalive==True ## Related issue number Closes https://github.com/ray-project/ray/issues/25721	2022-06-28 05:48:25 -07:00
SangBin Cho	4b957e99b5	[State Observability] != predicate for filtering. (#26079 ) ## Why are these changes needed? This PR implements `!=` predicate for filtering. As a result of this PR, two APIs are changed. ``` --filter key value -> --filter "key=val" or ---filter "key!=val" list_actors(filters=[(key, val), (key2, val2)]) -> list_actors(filters=[(key, "=", val), (key2, "=", val2)]) ```	2022-06-28 05:42:19 -07:00
Guyang Song	58bfad84d3	[runtime env] plugin refactor[1/n] (#26077 )	2022-06-28 14:09:05 +08:00
Ricky Xu	44daf3ecd7	[Core][State Observability] Get API using List endpoints + filtering on ids (#25894 ) ## Why are these changes needed? This is a first implementation of GET APIs for nodes actors placement groups workers tasks objects E.g. # CLI (dev) ➜ ray git:(ricky/obs-get) ray get nodes cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9 --- node_id: cab26304d105caa6f2100908f7b461ef9ed244984ec30b4b46f953f9 node_ip: 172.31.47.143 node_name: 172.31.47.143 resources_total: CPU: 8.0 memory: 16700517582.0 node:172.31.47.143: 1.0 object_store_memory: 8350258790.0 state: ALIVE # Python from ray.experimental.state.api import get_node from ray.experimental.state.common import NodeState node :NodeState = get_node(<id>) print(node) We currently do not support getting specific resources by id for 'jobs' and 'runtime-envs' jobs: it is not exposing id to be queried easily yet runtime envs: it doesn't have an id associated. TODO: it uses list endpoints + filtering as for now, future iterations will implement GET-specific endpoints and interaction with raylet/GCS with point query APIs. Unit testing for state_manager for GET endpoints when implemented. Getting jobs by id	2022-06-27 17:14:29 -07:00
Ricky Xu	3d8ca6cf0f	[Core][cli][usability] ray stop prints errors during graceful shutdown (#25686 ) Why are these changes needed? This is to address false alarms on subprocesses exiting when killed by ray stop with SIGTERM. What has been changed? Added signal handlers for some of the subprocesses: dashboard (head) log monitor ray client server Changed the --block semantics and prompt messages. Related issue number Closes #25518	2022-06-27 08:14:59 -07:00
Dmitri Gekhtman	1055eadde0	[Dashboard] Fix dashboard RAM and CPU with cgroups2 (#25710 ) Closes #25283. The dashboard shows inaccurate memory and cpu data when run inside of a docker container, in particular when using cgroups v2. This PR fixes that.	2022-06-26 14:01:26 -07:00
SangBin Cho	6552e096e6	[State Observability] Summary APIs (#25672 ) Task/actor/object summary Tasks: Group by the func name. In the future, we will also allow to group by task_group. Actors: Group by actor class name. In the future, we will also allow to group by actor_group. Object: Group by callsite. In the future, we will allow to group by reference type or task state.	2022-06-22 06:21:50 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Archit Kulkarni	565e366529	[runtime env] Use async internal kv in package download and plugins (#25788 ) Uses the async KV API for downloading in the runtime env agent. This avoids the complexity of running the runtime env creation functions in a separate thread. Some functions are still sync, including the working_dir/py_modules upload, installing wheels, and possibly others.	2022-06-21 15:02:36 -07:00
shrekris-anyscale	3d6a5450c9	[Serve] Stop Ray in test_serve_head.py fixture (#25893 )	2022-06-21 11:28:07 -07:00
shrekris-anyscale	ad12f0cd02	[Serve] Deprecate outdated REST API settings (#25932 )	2022-06-21 11:06:45 -07:00
Guyang Song	d1d5fe61c2	[Dashboard][Frontend] Worker table enhancement (#25934 )	2022-06-21 14:09:48 +08:00
SangBin Cho	411b1d8d2d	[State Observability] Return list instead of dict (#25888 ) I’d like to propose a bit changes to the API. Currently we are returning the dict of ID -> value mapping when the list API is returned. But I am thinking to change this to a list because the sort will become ineffective if we return the dictionary. So, it’s ideal we use the list to keep the order (it’s important for deterministic order) Also, for some APIs, each entry doesn’t have a unique id. For example, list objects will have duplicated object IDs from their entries, which is not working with dict return type (e.g., there can be more than 1 Object ID entry if the object is locally referenced & borrowed by task/pinned in memory) Also, users can easily build dict index on their own if it is necessary.	2022-06-20 22:49:29 -07:00
Guyang Song	e13cc4088a	[Dashboard] Don't sort node list by defult (#25884 )	2022-06-20 11:35:12 +08:00
Archit Kulkarni	85be093a84	[runtime env] Make all plugins return a `List` of URIs (#25825 ) Followup from #24622. This is another step towards pluggability for runtime_env. Previously some plugin classes had `get_uri` which returned a single URI, while others had `get_uris` which returned a list. This PR makes all plugins use `get_uris`, which simplifies the code overall. Most of the lines in the diff just come from the new `format.sh` which sorts the imports.	2022-06-17 14:13:44 -05:00
Simon Mo	e560bce3a4	[Serve] bind to 0.0.0.0 in serve_head (#25862 )	2022-06-16 11:45:11 -07:00
Archit Kulkarni	23030dbcaa	[runtime env] Hide URI cache behind class (#24622 ) Followup PR to https://github.com/ray-project/ray/pull/20273. - Hides cache logic behind a class. - Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class. Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.	2022-06-15 16:14:06 -05:00
shrekris-anyscale	a371756b3c	[Serve] Update Serve CLI and REST API behavior to use new config (#25691 )	2022-06-14 19:01:51 -07:00
clarng	badf444eda	Respect import order for psutil and setproctitle (#25780 ) Sort imports in a way that preserves the ordering requirements. This PR is needed for any file changes that imports psutil or setproctitle.	2022-06-14 17:44:41 -07:00
Ricky Xu	b1d0b12b4e	[Core \ State Observability] Use Submission client (#25557 ) ## Why are these changes needed? This is to refactor the interaction of state cli to API server from a hard-coded request workflow to `SubmissionClient` based. See #24956 for more details. ## Summary <!-- Please give a short summary of the change and the problem this solves. --> - Created a `StateApiClient` that inherits from the `SubmissionClient` and refactor various listing commands into class methods. ## Related issue number Closes #24956 Closes #25578	2022-06-13 17:11:19 -07:00
shrekris-anyscale	3278763dd7	[Serve] Start all Serve actors in the `"serve"` namespace only (#25575 )	2022-06-13 10:31:28 -07:00
Simon Mo	feb8c29063	Revert "Revert "Revert "use an agent-id rather than the process PID (#24968 )"… (#25376 )" (#25669 ) This reverts commit `cb151d5ad6`.	2022-06-13 09:22:52 -07:00
SangBin Cho	856bea31fb	[State Observability] Ray log CLI / API (#25481 ) This PR implements the basic log APIs. For the better APIs (like higher level APIs like ray logs actors), it will be implemented after the internal API review is done. # If there's only 1 match, print a file content. Otherwise, print all files that match glob. ray logs [glob_filter] --node-id=[head node by default] Args: --tail: Tail the last X lines --follow: Follow the new logs --actor-id: The actor id --pid --node-ip: For worker logs --node-id: The node id of the log --interval: When --follow is specified, logs are printed with this interval. (should we remove it?)	2022-06-13 05:52:57 -07:00
mwtian	65d7a610ab	[Core] Push message to driver when a Raylet dies (#25516 ) Currently when Raylets die, it is hard to figure out: if a Raylet died at all in a cluster. Usually we have to check on nodes where a number of workers died and see if the Raylet has died as well. reason of Raylet's death. With this PR, if a Raylet dies from a reason other than SIGTERM, the dashboard agent will report the failure along with last 20 lines of the Raylet log.	2022-06-09 05:54:34 -07:00
shrekris-anyscale	f3c2bd6718	[Serve] Make REST API deployments inherit top-level runtime_env (#25502 )	2022-06-08 15:58:00 -07:00
Archit Kulkarni	6d2806f951	[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562 ) The current inheritance behavior for runtime_envs enables the following workflow for Jobs: A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir. There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior. This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following: - Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API - Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test	2022-06-08 13:54:06 -07:00
mwtian	1ce0ab7b7c	[Core] Export additional metrics for workers and Raylet memory (#25418 ) Add visibility into the following to help Ray users and developers debug performance and OOM issues: Raylet memory usage broken down by USS vs remaining RSS. Total workers' count, CPU percentage usage, and memory usage.	2022-06-06 10:58:14 -07:00
SangBin Cho	00e3fd75f3	[State Observability] Ray log alpha API (#24964 ) This is the PR to implement ray log to the server side. The PR is continued from #24068. The PR supports two endpoints; /api/v0/logs # list logs of the node id filtered by the given glob. /api/v0/logs/{[file \| stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id Some tests need to be re-written, I will do it soon. As a follow-up after this PR, there will be 2 PRs. PR to add actual CLI PR to remove in-memory cached logs and do on-demand query for actor/worker logs	2022-06-04 05:10:23 -07:00

1 2 3 4 5 ...

498 commits