Commit graph

104 commits

Author SHA1 Message Date
brucez-anyscale
f76d7b23f2
Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
Yi Cheng
12d147ff1f
Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)" (#26333)
This reverts commit 84166ccb04.
2022-07-06 13:30:33 -07:00
brucez-anyscale
84166ccb04
[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)
In Ray 2.0, we want to achieve api server HA.
Originally serve endpoints are in head node.
This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.
2022-07-06 10:58:00 -07:00
Eric Liang
43aa2299e6
[api] Annotate as public / move ray-core APIs to _private and add enforcement rule (#25695)
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
2022-06-21 15:13:29 -07:00
Archit Kulkarni
6d2806f951
[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562)
The current inheritance behavior for runtime_envs enables the following workflow for Jobs:  A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.

There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.

This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:

- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
2022-06-08 13:54:06 -07:00
Eric Liang
517f78e2b8
[minor] Add a job submission hook by env var (#25343) 2022-06-01 11:15:43 -07:00
Philipp Moritz
323605d169
Support file:// for runtime_env working directories in jobs (#25062)
This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories.

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-05-24 16:17:18 -07:00
Edward Oakes
65d21b7ae6
[job submission] Handle env_vars: None case properly in supervisor runtime_env logic (#25087) 2022-05-24 11:01:19 -05:00
Archit Kulkarni
a67c8a0739
[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719)
Packages are uploaded to the GCS for `runtime_env`.  These packages are garbage collected when their refcount becomes zero.

The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded.  It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job.  It's likely the cause of https://github.com/ray-project/ray/issues/23423.

We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted.

The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s).  This should be enough time for the job to start.  When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-05-23 10:25:04 -05:00
Edward Oakes
cb7bcbd651
[job submission] Fix address defaulting behavior (#24970)
Per the discussion in https://github.com/ray-project/ray/issues/24858:

- If an address without a port is provided, don't append a port.
- Default to `http://localhost:8265` if nothing is provided.
2022-05-20 14:10:36 -05:00
Edward Oakes
4c1f27118a
[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546)
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.

This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
2022-05-10 11:43:04 -05:00
Philipp Moritz
27917f570d
[runtime_env] Extend runtime_env hook to also cover jobs (#24328)
This extends https://github.com/ray-project/ray/pull/24036 to also cover job submission.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-04-30 09:15:51 -07:00
Archit Kulkarni
12b9383d52
[Jobs] Reenable test_backwards_compatibility using Ray 1.12 (#24124)
Closes https://github.com/ray-project/ray/issues/23258
2022-04-26 13:53:51 -05:00
Archit Kulkarni
27e7c284ee
[Jobs] Change jobs start_time end_time from seconds to ms for consistency (#24123)
In the snapshot, all timestamps are given in ms except for Jobs:

```
wget -q -O - http://127.0.0.1:8265/api/snapshot

{
   "result":true,
   "msg":"hello",
   "data":{
      "snapshot":{
         "jobs":{
            "01000000":{
               "status":null,
               "statusMessage":null,
               "isDead":false,
               "startTime":1650315791249,
               "endTime":0,
               "config":{
                  "namespace":"_ray_internal_dashboard",
                  "metadata":{
                     
                  },
                  "runtimeEnv":{
                     
                  }
               }
            }
         },
         "jobSubmission":{
            "raysubmit9Bsej1Rtxqqetxup":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650315925,
               "endTime":1650315926,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"ls"
            },
            "raysubmitEibragqkyg16Hpcj":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316039,
               "endTime":1650316041,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            },
            "raysubmitSh1U7Grdsbqrf6Je":{
               "status":"SUCCEEDED",
               "message":"Job finished successfully.",
               "errorType":null,
               "startTime":1650316354,
               "endTime":1650316355,
               "metadata":{
                  "creatorId":"usr_f6tgCaaFBJC6tZz1ZVzzAVf4"
               },
               "runtimeEnv":{
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "entrypoint":"echo hi"
            }
         },
         "actors":{
            "8c8e28e642ba2cfd0457d45e01000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_9BSeJ1rTXQqEtXuP",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650315926620,
               "endTime":1650315927499,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"9628b5eb54e98353601413845fbca0a8c4e5379d1469ce95f3dfbace",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10003,
               "metadata":{
                  
               }
            },
            "a7fd8354567129910c44298401000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_sh1u7grDsBQRf6je",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316355718,
               "endTime":1650316356620,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"f07fd7a393898bf7d9027a5de0b0f566bb64ae80c0fcbcc107185505",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10005,
               "metadata":{
                  
               }
            },
            "19ca9ad190f47bae963592d601000000":{
               "jobId":"01000000",
               "state":"DEAD",
               "name":"_ray_internal_job_actor_raysubmit_eibRAGqKyG16HpCj",
               "namespace":"_ray_internal_dashboard",
               "runtimeEnv":{
                  "uris":{
                     "workingDirUri":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
                  },
                  "workingDir":"gcs://_ray_pkg_6068c19fb3b8530f.zip"
               },
               "startTime":1650316041089,
               "endTime":1650316041978,
               "isDetached":true,
               "resources":{
                  "node:172.31.73.39":0.001
               },
               "actorClass":"JobSupervisor",
               "currentWorkerId":"50b8e7e9a6981fe0270afd7f6387bc93788356822c9a664c2988f5ba",
               "currentRayletId":"61ab3958258c82266b222f4691a53e71b6315e312408a21cb3350bc7",
               "ipAddress":"172.31.73.39",
               "port":10004,
               "metadata":{
                  
               }
            }
         },
         "deployments":{
            
         },
         "sessionName":"session_2022-04-18_13-49-44_814862_139",
         "rayVersion":"1.12.0",
         "rayCommit":"f18fc31c7562990955556899090f8e8656b48d2d"
      }
   }
}
```

 This PR fixes the inconsistency by changing Jobs start/end timestamps to ms.
2022-04-26 08:37:41 -07:00
SangBin Cho
73ed67e9e6
[State API] State api limit + Removing unnecessary modules (#24098)
This PR does

Move all routes into the same module, state_head.py
Support a limit feature.
2022-04-22 15:59:46 -07:00
Chu Xiangyang
6f74040b15
[Job] Fix typo in job sdk docstring (#23940) 2022-04-20 12:30:32 -05:00
SangBin Cho
1c3329fa38
Revert "Revert "[State Observability] Basic functionality for central… (#23933)
…ized data (#23744)" (#23918)"

This reverts commit fb14e82.
2022-04-18 21:15:43 -07:00
Amog Kamsetty
fb14e82242
Revert "[State Observability] Basic functionality for centralized data (#23744)" (#23918)
This reverts commit 51a4a1a802.

breaking tune multinode tests and kuberay:test_autoscaling_e2e
2022-04-14 14:28:42 -07:00
SangBin Cho
51a4a1a802
[State Observability] Basic functionality for centralized data (#23744)
Support listing actor/pg/job/node/workers

Design doc: https://docs.google.com/document/d/1IeEsJOiurg-zctOcBjY-tQVbsCmURFSnUCTkx_4a7Cw/edit#heading=h.9ub9e6yvu9p2

Note that this PR doesn't contain any output except ids. I will update them in the follow-up PRs.
2022-04-14 07:33:18 -07:00
Max Pumperla
60054995e6
[docs] fix doctests and activate CI (#23418) 2022-03-24 17:04:02 -07:00
shrekris-anyscale
75b7465ba4
[serve] Reject Ray client addresses when submitting via Dashboard (#23339)
Some commands in the Serve CLI use Ray client and some commands ping the Ray dashboard; however, all commands read `RAY_ADDRESS` to get the address. This change raises a nice exception if the user accidentally passes a Ray client address as the Ray Dashboard address.
2022-03-21 11:17:51 -05:00
Archit Kulkarni
76bb5396c7
[Doc] [jobs] Add links to Job Submission and improve doc (#23209)
- Adds links to Job Submission from existing library tutorials where `ray submit` is used.  When Jobs becomes GA, we should fully replace the uses of `ray submit` with Ray job submission and ensure this is tested.
- Adds docstrings for the Jobs SDK, which automatically show up in the API reference
- Improve the Job Submission main page
- Add a "Deployment Guide" landing page explaining when to use Ray Client vs Ray Jobs

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-18 12:52:13 -05:00
Archit Kulkarni
77090144a2
[jobs] Add entrypoint field to JobInfo (#23253) 2022-03-16 22:02:22 -05:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
Tomas Babej
7a1d10a3d0
[Job submission] Set headers when establishing websocket (#23111) 2022-03-15 16:20:44 -05:00
Archit Kulkarni
e8496374e2
[Jobs] Test job submit with no specified ray address (#23119) 2022-03-14 13:44:06 -05:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Archit Kulkarni
c78bd809ce
[job submission] Support local py_modules in jobs (#22843) 2022-03-10 11:42:25 -06:00
shrekris-anyscale
bc82e2d5c4
[serve] Restore "[serve] Support working_dir in serve run (#22760)" (#22971) 2022-03-09 21:31:23 -08:00
Kai Fricke
15601ed79b
Revert "[serve] Support working_dir in serve run (#22760)" (#22956)
This reverts commit ab2741d64b.

The PR breaks ray job submission for anyscale:// URLs
2022-03-09 17:04:46 +00:00
shrekris-anyscale
ab2741d64b
[serve] Support working_dir in serve run (#22760)
#22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.
2022-03-08 13:18:41 -06:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00
shrekris-anyscale
a9ede4e499
[serve] Add REST API (#22578)
This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.
2022-02-24 10:00:26 -06:00
Edward Oakes
58e5f0140d
[jobs] Rename JobData -> JobInfo (#22499)
`JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.
2022-02-22 16:18:16 -06:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Archit Kulkarni
63a5eb492d
Revert "[serve] Add basic REST API to dashboard (#22257)" (#22414)
This reverts commit f37f35c5da.
2022-02-15 21:47:50 -06:00
Edward Oakes
f37f35c5da
[serve] Add basic REST API to dashboard (#22257) 2022-02-15 15:36:58 -06:00
Edward Oakes
5df2a0a6c6
[jobs] Add test condition that job runs w/o CPUs available on head node (#22260) 2022-02-10 10:23:02 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Nikita Vemuri
d19aaf0fd3
[jobs] Add unit test for parse_cluster_info (#22205)
Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.
2022-02-08 11:22:28 -06:00
Edward Oakes
8806b2d5c4
[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180) 2022-02-07 15:25:25 -06:00
Jiao
a692e7d05e
[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938)
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2022-02-04 17:51:43 -06:00
Nikita Vemuri
d9dc388082
[jobs] Support ray client format of connection string address for external module (#22116)
Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.
2022-02-04 13:35:10 -06:00
SangBin Cho
d7fc7d2e9d
[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032)
This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers.

The first PR only handles tasks exception.

TODO
- [x] Task (this PR)
- [ ] Actor
- [ ] Turn of runtime env logs & improve error msgs
2022-02-03 16:47:04 -08:00
Edward Oakes
e85bbfb338
[jobs] Enable default port in http:// addresses (#22014)
Closes https://github.com/ray-project/ray/issues/22012
2022-02-02 14:34:34 -06:00
Edward Oakes
8bbc5b936a
[jobs] Use subprocess.list2cmdline to properly handle quotes in CLI entrypoints (#22011) 2022-02-02 14:33:57 -06:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00