hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-10 13:26:39 -04:00

Author	SHA1	Message	Date
Alan Guo	5d6bc5360d	Fix the jobs tab in the beta dashboard and fill it with data from both "submission" jobs and "driver" jobs (#25902 ) ## Why are these changes needed? - Fixes the jobs tab in the new dashboard. Previously it didn't load. - Combines the old job concept, "driver jobs" and the new job submission conception into a single concept called "jobs". Jobs tab shows information about both jobs. - Updates all job APIs: They now returns both submission jobs and driver jobs. They also contains additional data in the response including "id", "job_id", "submission_id", and "driver". They also accept either job_id or submission_id as input. - Job ID is the same as the "ray core job id" concept. It is in the form of "0100000" and is the primary id to represent jobs. - Submission ID is an ID that is generated for each ray job submission. It is in the form of "raysubmit_12345...". It is a secondary id that can be used if a client needs to provide a self-generated id. or if the job id doesn't exist (ex: if the submission job doesn't create a ray driver) This PR has 2 deprecations - The `submit_job` sdk now accepts a new kwarg `submission_id`. `job_id is deprecated. - The `ray job submit` CLI now accepts `--submission-id`. `--job-id` is deprecated. This PR has 4 backwards incompatible changes: - list_jobs sdk now returns a list instead of a dictionary - the `ray job list` CLI now prints a list instead of a dictionary - The `/api/jobs` endpoint returns a list instead of a dictionary - The `POST api/jobs` endpoint (submit job) now returns a json with `submission_id` field instead of `job_id`.	2022-07-27 02:39:52 -07:00
Alan Guo	e8222ff600	[dashboard] Update cluster_activities endpoint to use pydantic. (#26609 ) Update cluster_activities endpoint to use pydantic so we have better data validation. Make timestamp a required field. Add pydantic to ray[default] requirements	2022-07-25 10:54:22 -07:00
Stephanie Wang	55a0f7bb2d	[core] ray.init defaults to an existing Ray instance if there is one (#26678 ) ray.init() will currently start a new Ray instance even if one is already existing, which is very confusing if you are a new user trying to go from local development to a cluster. This PR changes it so that, when no address is specified, we first try to find an existing Ray cluster that was created through `ray start`. If none is found, we will start a new one. This makes two changes to the ray.init() resolution order: 1. When `ray start` is called, the started cluster address was already written to a file called `/tmp/ray/ray_current_cluster`. For ray.init() and ray.init(address="auto"), we will first check this local file for an existing cluster address. The file is deleted on `ray stop`. If the file is empty, autodetect any running cluster (legacy behavior) if address="auto", or we will start a new local Ray instance if address=None. 2. When ray.init(address="local") is called, we will create a new local Ray instance, even if one is already existing. This behavior seems to be necessary mainly for `ray.client` use cases. This also surfaces the logs about which Ray instance we are connecting to. Previously these were hidden because we didn't set up the log until after connecting to Ray. So now Ray will log one of the following messages during ray.init: ``` (Connecting to existing Ray cluster at address: <IP>...) ...connection... (Started a local Ray cluster.\| Connected to Ray Cluster.)( View the dashboard at <URL>) ``` Note that this changes the dashboard URL to be printed with `ray.init()` instead of when the dashboard is first started. Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-07-23 11:27:22 -07:00
Eric Liang	43aa2299e6	[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695 ) Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.	2022-06-21 15:13:29 -07:00
Edward Oakes	65d21b7ae6	[job submission] Handle `env_vars: None` case properly in supervisor runtime_env logic (#25087 )	2022-05-24 11:01:19 -05:00
Edward Oakes	4c1f27118a	[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546 ) Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`. This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.	2022-05-10 11:43:04 -05:00
Archit Kulkarni	27e7c284ee	[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123 ) In the snapshot, all timestamps are given in ms except for Jobs: ``` wget -q -O - http://127.0.0.1:8265/api/snapshot { "result":true, "msg":"hello", "data":{ "snapshot":{ "jobs":{ "01000000":{ "status":null, "statusMessage":null, "isDead":false, "startTime":1650315791249, "endTime":0, "config":{ "namespace":"_ray_internal_dashboard", "metadata":{ }, "runtimeEnv":{ } } } }, "jobSubmission":{ "raysubmit9Bsej1Rtxqqetxup":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650315925, "endTime":1650315926, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"ls" }, "raysubmitEibragqkyg16Hpcj":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316039, "endTime":1650316041, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" }, "raysubmitSh1U7Grdsbqrf6Je":{ "status":"SUCCEEDED", "message":"Job finished successfully.", "errorType":null, "startTime":1650316354, "endTime":1650316355, "metadata":{ "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4" }, "runtimeEnv":{ "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "entrypoint":"echo hi" } }, "actors":{ "8c8e28e642ba2cfd0457d45e01000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650315926620, "endTime":1650315927499, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10003, "metadata":{ } }, "a7fd8354567129910c44298401000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316355718, "endTime":1650316356620, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10005, "metadata":{ } }, "19ca9ad190f47bae963592d601000000":{ "jobId":"01000000", "state":"DEAD", "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj", "namespace":"_ray_internal_dashboard", "runtimeEnv":{ "uris":{ "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip" }, "startTime":1650316041089, "endTime":1650316041978, "isDetached":true, "resources":{ "node:172.31.73.39":0.001 }, "actorClass":"JobSupervisor", "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba", "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7", "ipAddress":"172.31.73.39", "port":10004, "metadata":{ } } }, "deployments":{ }, "sessionName":"session_2022-04-18_13-49-44_814862_139", "rayVersion":"1.12.0", "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d" } } } ``` This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.	2022-04-26 08:37:41 -07:00
Archit Kulkarni	77090144a2	[jobs] Add `entrypoint` field to JobInfo (#23253 )	2022-03-16 22:02:22 -05:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Edward Oakes	58e5f0140d	[jobs] Rename JobData -> JobInfo (#22499 ) `JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.	2022-02-22 16:18:16 -06:00
Archit Kulkarni	df581c584a	[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225 ) The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection). In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command. As such a Job can have zero or multiple Ray drivers. This means we should add a new snapshot entry corresponding to new jobs. We'll leave the old snapshot in place for legacy jobs. - Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID. It wasn't working before. - This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot. For backwards compatibility, the `status` and `message` fields are preserved.	2022-02-18 09:54:37 -06:00
Archit Kulkarni	50e2bef9d0	[Jobs] Hide `dashboard` from Job Submission import path (#22223 ) For public SDK APIs, change the import path from ```python from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo from ray.dashboard.modules.job.sdk import JobSubmissionClient ``` to ```python from ray.job_submission import JobStatus, JobSubmissionClient ``` `JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.	2022-02-09 13:55:32 -06:00
Edward Oakes	8806b2d5c4	[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180 )	2022-02-07 15:25:25 -06:00
Jiao	a692e7d05e	[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938 ) As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address. Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2022-02-04 17:51:43 -06:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Archit Kulkarni	f058a1d342	[Jobs] Stream logs during job instead of only at the end (#21659 ) Closes https://github.com/ray-project/ray/issues/21517	2022-01-20 15:21:07 -06:00
mwtian	8cc268096c	[GCS][Bootstrap 3/n] Refactor to support GCS bootstrap (#21295 ) This PR refactors several components to support switching to GCS address bootstrapping later: - Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address. - Ray client servers support `--address` flag instead of `--redis-address`. - A few other miscellaneous cleanup. Also, add a test for starting non-head node with `ray start`.	2022-01-03 23:52:12 -08:00
mwtian	20ca1d85c2	[GCS][Bootstrap 2/n] Fix tests to enable using GCS address for bootstrapping (#21288 ) This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.	2021-12-29 19:25:51 -07:00
mwtian	06ec07057c	Revert "[Core] Unrevert #21115 , fix auto address env (#21158 )" (#21189 ) This reverts commit `968f08607b`. It is breaking e2e tests where worker nodes cannot start. e.g. ``` Traceback (most recent call last): File "/home/ray/anaconda3/bin/ray", line 8, in <module> sys.exit(main()) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main return cli() File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__ return self.main(args, kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke return __callback(args, *kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper return f(args, **kwargs) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start address_ip, password=redis_password) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client _, redis_ip_address, redis_port = validate_bootstrap_address(redis_address) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address raise ValueError("Malformed address. Expected '<host>:<port>'.") ValueError: Malformed address. Expected '<host>:<port>'. ```	2021-12-20 00:22:12 -08:00
Clark Zinzow	968f08607b	[Core] Unrevert #21115 , fix auto address env (#21158 ) This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2021-12-18 07:45:00 -08:00
Chen Shen	d99f699e3d	Revert "[Core][GCS] Use `port` and `address` flags to configure GCS server / client in GCS bootstrapping mode (#21115 )" (#21157 ) This reverts commit `0e7c0b491b`.	2021-12-17 11:48:40 -08:00
mwtian	0e7c0b491b	[Core][GCS] Use `port` and `address` flags to configure GCS server / client in GCS bootstrapping mode (#21115 ) This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping. Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change. Also made some cleanups.	2021-12-16 15:11:05 -08:00
Jiao	ed34434131	[Jobs] Add log streaming for jobs (#20976 ) Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration. Co-authored-by: Sven Mika <sven@anyscale.io> Co-authored-by: sven1977 <svenmika1977@gmail.com> Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com> Co-authored-by: Jiao Dong <jiaodong@anyscale.com>	2021-12-14 17:01:53 -08:00
Edward Oakes	d26c9e67e8	[job submission] Add a `message` to the JobStatus to return more detailed errors (#20491 )	2021-11-18 10:15:23 -06:00
Edward Oakes	eae523159f	[job submission] Prefix job ID with `raysubmit_` and pass `job_name` metadata (#20490 )	2021-11-17 21:48:22 -06:00
Edward Oakes	48bc1af2da	[job submission] Remove DOES_NOT_EXIST status (#20354 )	2021-11-15 16:57:32 -08:00
Edward Oakes	81f036d078	[job submission] Move job_manager to dashboard module, common parts to common.py (#20209 )	2021-11-10 14:14:55 -08:00

27 commits