hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Guyang Song	902243fb03	[runtime env] support raylet sharing fate with agent (#22382 ) - Remove the agent restart feature. - Raylet shares fate with agent to make the failover logic easier. Refer to issue https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528	2022-02-21 18:16:21 +08:00
Guyang Song	57a94aae12	[runtime env][bugfix] Fix runtime env retry (#22495 ) - Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing. - Add test case for this.	2022-02-18 17:09:06 -08:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Jialing He	192f9de421	[runtime env] Introduce async Manager.create (#22311 )	2022-02-14 16:26:47 -06:00
Liu Bao	824453dd17	[runtime env] Create virtualenv for pip runtime env. (#21801 )	2022-02-10 12:25:18 -06:00
Edward Oakes	5df2a0a6c6	[jobs] Add test condition that job runs w/o CPUs available on head node (#22260 )	2022-02-10 10:23:02 -06:00
Archit Kulkarni	50e2bef9d0	[Jobs] Hide `dashboard` from Job Submission import path (#22223 ) For public SDK APIs, change the import path from ```python from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo from ray.dashboard.modules.job.sdk import JobSubmissionClient ``` to ```python from ray.job_submission import JobStatus, JobSubmissionClient ``` `JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.	2022-02-09 13:55:32 -06:00
SangBin Cho	20ab9188c6	[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170 ) This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details. The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj. You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files. The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. After this PR, we will add code to enable usage report "off by default".	2022-02-08 22:12:36 -08:00
Nikita Vemuri	d19aaf0fd3	[jobs] Add unit test for `parse_cluster_info` (#22205 ) Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.	2022-02-08 11:22:28 -06:00
Edward Oakes	8806b2d5c4	[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180 )	2022-02-07 15:25:25 -06:00
Jiao	a692e7d05e	[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938 ) As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address. Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2022-02-04 17:51:43 -06:00
SangBin Cho	ea4079465d	[Runtime Env] Support runtime env error message for actors (#22109 )	2022-02-04 15:32:02 -06:00
Nikita Vemuri	d9dc388082	[jobs] Support ray client format of connection string address for external module (#22116 ) Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.	2022-02-04 13:35:10 -06:00
SangBin Cho	d7fc7d2e9d	[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032 ) This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers. The first PR only handles tasks exception. TODO - [x] Task (this PR) - [ ] Actor - [ ] Turn of runtime env logs & improve error msgs	2022-02-03 16:47:04 -08:00
Archit Kulkarni	78f882dbbc	[runtime env] Local uri caching for working_dir, py_modules and conda (#20273 ) Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them. In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache. When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs. Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit - Adds unit tests for caching and integration tests for working_dir caching	2022-02-02 14:53:03 -06:00
Edward Oakes	e85bbfb338	[jobs] Enable default port in `http://` addresses (#22014 ) Closes https://github.com/ray-project/ray/issues/22012	2022-02-02 14:34:34 -06:00
Edward Oakes	8bbc5b936a	[jobs] Use `subprocess.list2cmdline` to properly handle quotes in CLI entrypoints (#22011 )	2022-02-02 14:33:57 -06:00
SangBin Cho	3566cfd279	[Dashboard] Enable dashboard in the minimal ray installation (#21896 ) This is the last PR to enable dashboard in the minimal ray installation. Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;	2022-01-31 22:34:40 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Yi Cheng	7d2237bc9f	[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919 )	2022-01-26 22:48:17 -08:00
SangBin Cho	e62c0052a0	[Dashboard] Agent in minimal ray installation (#21817 ) This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation. Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.	2022-01-26 04:03:54 -08:00
Alex Wu	7a45f60dbc	[autoscaler] Fix ray.autoscaler.sdk import issue (#21795 ) This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-01-25 14:43:24 -08:00
SangBin Cho	2010f13175	Fix dashboard test bug (#21742 ) Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).	2022-01-24 11:38:51 -06:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
Jiao	5d382cfeb3	[nit] remove decorator in test_cli.py (#21792 ) Full context see https://github.com/ray-project/ray/issues/21791 pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.	2022-01-23 06:05:05 -08:00
mwtian	e8ce01c525	[Dashboard] offload blocking work to a thread pool (#21762 ) Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.	2022-01-21 17:55:11 -08:00
mwtian	f18a8bd87f	[Dashboard] turn a noisy `info` log into `debug` (#21746 ) Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.	2022-01-20 22:08:23 -08:00
Archit Kulkarni	f058a1d342	[Jobs] Stream logs during job instead of only at the end (#21659 ) Closes https://github.com/ray-project/ray/issues/21517	2022-01-20 15:21:07 -06:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00
Shantanu	ae60548ef3	Silence "cut: write error: Broken pipe" log spew (#21686 ) On machines without GPUs, this can run subprocesses that spew to stderr. Then with log_to_driver=True, we get log spew from every single raylet. To avoid this, disable the GPU usage check on certain errors. Resolves #14305 Co-authored-by: hauntsaninja <>	2022-01-19 23:01:10 -08:00
Yao Yuan	422d20e945	[Dashboard] Fix NPE when there is no GPU on the node (#21650 ) There is an NPE bug that causes browser crash when no GPU on the node. We can add a condition to fix it.	2022-01-18 08:12:49 -08:00
Yi Cheng	6dccfbffa9	Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585 ) Reverts ray-project/ray#21584 and turn the flag off	2022-01-13 16:12:03 -08:00
Yi Cheng	bc696212d2	Revert "[gcs] turn on grpc pubsub by default" (#21584 ) test-reconnect seems flaky. Reverts ray-project/ray#21513	2022-01-13 12:34:02 -08:00
Yi Cheng	6194783312	[gcs] turn on grpc pubsub by default (#21513 ) Turn on grpc pubsub by default. This PR also fixed several tests which are failed before. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-12 22:13:03 -08:00
mwtian	45cddef2d3	[GCS] disable tests related to GCS restarting in GCS pubsub mode (#21534 ) `test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them. Also, suppress exceptions when sync subscribers are disconnected from GCS. I can push changes in this PR to #21513 as well.	2022-01-11 14:14:05 -08:00
mwtian	70db5c5592	[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232 ) After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster. Co-authored-by: Yi Cheng <chengyidna@gmail.com> Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>	2022-01-04 23:06:44 -08:00
mwtian	8cc268096c	[GCS][Bootstrap 3/n] Refactor to support GCS bootstrap (#21295 ) This PR refactors several components to support switching to GCS address bootstrapping later: - Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address. - Ray client servers support `--address` flag instead of `--redis-address`. - A few other miscellaneous cleanup. Also, add a test for starting non-head node with `ray start`.	2022-01-03 23:52:12 -08:00
mwtian	20ca1d85c2	[GCS][Bootstrap 2/n] Fix tests to enable using GCS address for bootstrapping (#21288 ) This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.	2021-12-29 19:25:51 -07:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
iasoon	1c93beb490	[serve] use true nulls in snapshot (#21062 )	2021-12-20 16:07:09 -08:00
mwtian	06ec07057c	Revert "[Core] Unrevert #21115 , fix auto address env (#21158 )" (#21189 ) This reverts commit `968f08607b`. It is breaking e2e tests where worker nodes cannot start. e.g. ``` Traceback (most recent call last): File "/home/ray/anaconda3/bin/ray", line 8, in <module> sys.exit(main()) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main return cli() File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__ return self.main(args, kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke return __callback(args, *kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper return f(args, **kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start address_ip, password=redis_password) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client _, redis_ip_address, redis_port = validate_bootstrap_address(redis_address) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address raise ValueError("Malformed address. Expected '<host>:<port>'.") ValueError: Malformed address. Expected '<host>:<port>'. ```	2021-12-20 00:22:12 -08:00
Clark Zinzow	968f08607b	[Core] Unrevert #21115 , fix auto address env (#21158 ) This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2021-12-18 07:45:00 -08:00
Chen Shen	d99f699e3d	Revert "[Core][GCS] Use `port` and `address` flags to configure GCS server / client in GCS bootstrapping mode (#21115 )" (#21157 ) This reverts commit `0e7c0b491b`.	2021-12-17 11:48:40 -08:00
mwtian	0e7c0b491b	[Core][GCS] Use `port` and `address` flags to configure GCS server / client in GCS bootstrapping mode (#21115 ) This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping. Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change. Also made some cleanups.	2021-12-16 15:11:05 -08:00
Jiao	ed34434131	[Jobs] Add log streaming for jobs (#20976 ) Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration. Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-12-14 17:01:53 -08:00
Edward Oakes	10947c83b3	[runtime_env] Make pip installs incremental (#20341 ) Uses a direct `pip install` instead of creating a conda env to make pip installs incremental to the cluster environment. Separates the handling of `pip` and `conda` dependencies. The new `pip` approach still works if only the base Ray is installed on the cluster and the user specifies libraries like "ray[serve]" in the `pip` field. The mechanism is as follows: - We don't actually want to reinstall ray via pip, since this could lead to version mismatch issues. Instead, we want to use the Ray that's already installed in the cluster. - So if "ray" was included by the user in the pip list, remove it - If a library "ray[serve]" or "ray[tune, rllib]" was included in the pip list, remove it and replace it by its dependencies (e.g. "uvicorn", "requests", ..) Co-authored-by: architkulkarni <arkulkar@gmail.com> Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>	2021-12-14 15:55:18 -08:00
iasoon	33059cff3d	[serve] support not exposing deployments over http (#21042 )	2021-12-13 09:43:55 -08:00
mwtian	6871a72a5c	[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860 ) Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.	2021-12-10 14:36:57 -08:00

... 2 3 4 5 6 ...

446 commits