hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Archit Kulkarni	a67c8a0739	[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719 ) Packages are uploaded to the GCS for `runtime_env`. These packages are garbage collected when their refcount becomes zero. The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded. It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job. It's likely the cause of https://github.com/ray-project/ray/issues/23423. We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted. The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s). This should be enough time for the job to start. When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-05-23 10:25:04 -05:00
SangBin Cho	73ed67e9e6	[State API] State api limit + Removing unnecessary modules (#24098 ) This PR does Move all routes into the same module, state_head.py Support a limit feature.	2022-04-22 15:59:46 -07:00
SangBin Cho	1c3329fa38	Revert "Revert "[State Observability] Basic functionality for central… (#23933 ) …ized data (#23744)" (#23918)" This reverts commit `fb14e82`.	2022-04-18 21:15:43 -07:00
Amog Kamsetty	fb14e82242	Revert "[State Observability] Basic functionality for centralized data (#23744 )" (#23918 ) This reverts commit `51a4a1a802`. breaking tune multinode tests and kuberay:test_autoscaling_e2e	2022-04-14 14:28:42 -07:00
SangBin Cho	51a4a1a802	[State Observability] Basic functionality for centralized data (#23744 ) Support listing actor/pg/job/node/workers Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2 Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.	2022-04-14 07:33:18 -07:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
shrekris-anyscale	a9ede4e499	[serve] Add REST API (#22578 ) This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.	2022-02-24 10:00:26 -06:00
Edward Oakes	58e5f0140d	[jobs] Rename JobData -> JobInfo (#22499 ) `JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.	2022-02-22 16:18:16 -06:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	63a5eb492d	Revert "[serve] Add basic REST API to dashboard (#22257 )" (#22414 ) This reverts commit `f37f35c5da`.	2022-02-15 21:47:50 -06:00
Edward Oakes	f37f35c5da	[serve] Add basic REST API to dashboard (#22257 )	2022-02-15 15:36:58 -06:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	e62c0052a0	[Dashboard] Agent in minimal ray installation (#21817 ) This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation. Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.	2022-01-26 04:03:54 -08:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
Yi Cheng	09421a4ca6	[2/gcs] Bootstrap dashboard for gcs ha (#21179 ) This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis. Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>	2021-12-21 16:58:03 -08:00
Jiao	ed34434131	[Jobs] Add log streaming for jobs (#20976 ) Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration. Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-12-14 17:01:53 -08:00
Jiao	5ce79d0a46	[jobs] Fix job server's ray init(to use redis address rather than auto (#20705 ) * [job submission] Use specific redis_address and redis_password instead of "auto" (#20687) Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-11-24 15:38:26 -08:00
SangBin Cho	cedd8806f7	Revert "[job submission] Use specific redis_address and redis_passwor… (#20699 ) The test breaks the master branch	2021-11-24 05:37:15 -08:00
Edward Oakes	66b4939184	[job submission] Use specific redis_address and redis_password instead of "auto" (#20687 )	2021-11-23 23:25:36 -06:00
Edward Oakes	39b2c3927c	[jobs] Add /api/version endpoint (#20622 )	2021-11-22 15:11:04 -06:00
Edward Oakes	d26c9e67e8	[job submission] Add a `message` to the JobStatus to return more detailed errors (#20491 )	2021-11-18 10:15:23 -06:00
Edward Oakes	48bc1af2da	[job submission] Remove DOES_NOT_EXIST status (#20354 )	2021-11-15 16:57:32 -08:00
Edward Oakes	6c3bad52b6	[job submission] Better validation + tests for input types, refactor API (#20332 )	2021-11-13 22:54:01 -08:00
Edward Oakes	81f036d078	[job submission] Move job_manager to dashboard module, common parts to common.py (#20209 )	2021-11-10 14:14:55 -08:00
Edward Oakes	5475bb054c	[job submission] Redirect stdout + stderr to a single log file (#20208 )	2021-11-09 22:34:12 -08:00
Edward Oakes	50f2cf8a74	[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164 )	2021-11-08 23:10:27 -08:00
Jiao	9ef75b27ac	[Job Submission] Add stop API to http & sdk, with better status code + stacktrace (#20094 )	2021-11-06 12:37:54 -05:00
Edward Oakes	65161fe9b4	[job submission] Move HTTP routes to /api/jobs prefix (#19995 )	2021-11-04 17:45:25 -05:00
Jiao	6cfb52ff1d	[job submission] Add stop API + subprocess cleanup (#19860 )	2021-11-04 13:59:47 -05:00
Edward Oakes	f8a6cad0b7	[job submission] SDK prototype w/ dynamic working_dir uploads (#19843 )	2021-11-02 16:01:54 -05:00
Edward Oakes	bf23a31017	[job submission] Always generate and return job_id (#19851 )	2021-10-29 09:09:54 -05:00
Edward Oakes	42ac906313	[job submission] Support passing metadata to the JobConfig (#19845 )	2021-10-28 16:40:03 -05:00
Jiao	e53fecfbd5	[jobs] Initial http jobs server on head node (#19657 )	2021-10-23 12:48:16 -05:00
Oscar Knagg	5a05e89267	[Core] Add TLS/SSL support to gRPC channels (#18631 )	2021-10-20 22:39:11 -07:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
fyrestone	56c309416e	[Job submission] Basic job submission structure (#15103 )	2021-05-12 15:08:20 +08:00
fyrestone	4853aa96cb	[Dashboard] Fix missing actor pid (#13229 )	2021-01-13 16:45:12 +08:00
fyrestone	6a54897577	Job module without submission (#13081 ) Co-authored-by: 刘宝 <po.lb@antfin.com>	2020-12-31 11:12:17 +08:00

39 commits