Commit graph

71 commits

Author SHA1 Message Date
Archit Kulkarni
6d2806f951
[Jobs] [Test] Add integration tests to cover runtime_env inheritance with working_dir and with Tune (#25562)
The current inheritance behavior for runtime_envs enables the following workflow for Jobs:  A working_dir can be set in the Jobs API, and then inside the driver script, if a new per-task runtime_env is defined, it will automatically inherit the driver's working_dir.

There is an ongoing discussion about the best approach for runtime_env inheritance going forward: https://github.com/ray-project/ray/issues/25484, in which we noted that there were no tests covering this behavior.

This PR adds integration tests for the above behavior. If we ultimately decide to abandon the current inheritance behavior and instead have child runtime envs completely overwrite the parent runtime env, this test will fail, reminding us to do the following:

- Update the internal runtime_env usage in Ray Tune to use the `ray.get_runtime_context().runtime_env.update` API
- Update the documentation for Ray Jobs telling users to use `ray.get_runtime_context().runtime_env.update` and update this test
2022-06-08 13:54:06 -07:00
Eric Liang
517f78e2b8
[minor] Add a job submission hook by env var (#25343) 2022-06-01 11:15:43 -07:00
Philipp Moritz
323605d169
Support file:// for runtime_env working directories in jobs (#25062)
This makes it possible to use an NFS file system that is shared on a cluster for runtime_env working directories.

Co-authored-by: shrekris-anyscale <92341594+shrekris-anyscale@users.noreply.github.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-05-24 16:17:18 -07:00
Edward Oakes
65d21b7ae6
[job submission] Handle env_vars: None case properly in supervisor runtime_env logic (#25087) 2022-05-24 11:01:19 -05:00
Archit Kulkarni
a67c8a0739
[runtime_env] Add temporary URI reference to prevent URI deletion before job starts (#24719)
Packages are uploaded to the GCS for `runtime_env`.  These packages are garbage collected when their refcount becomes zero.

The problem is the reference doesn't get incremented until the job starts, which happens after the package is uploaded.  It's possible for the package's refcount to go to zero in between the upload and when the job starts, causing the package to be deleted before it's needed by the job.  It's likely the cause of https://github.com/ray-project/ray/issues/23423.

We can't just increment the refcount at the time of upload, because if the script is killed before the job is started (e.g. via Ctrl-C) then the reference will never be decremented and the package will never be deleted.

The solution in this PR is to increment the refcount at the time of upload, but automatically decrement after a configurable timeout (default 30s).  This should be enough time for the job to start.  When the job starts, it increments the refcount as usual and decrements it when the job finishes or is killed.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-05-23 10:25:04 -05:00
Edward Oakes
cb7bcbd651
[job submission] Fix address defaulting behavior (#24970)
Per the discussion in https://github.com/ray-project/ray/issues/24858:

- If an address without a port is provided, don't append a port.
- Default to `http://localhost:8265` if nothing is provided.
2022-05-20 14:10:36 -05:00
Edward Oakes
4c1f27118a
[job submission] Don't set CUDA_VISIBLE_DEVICES in job driver (#24546)
Currently job drivers cannot use GPUs due to `CUDA_VISIBLE_DEVICES` being set (no resource request for job driver's supervisor actor). This is a regression from `ray submit`.

This is a temporary workaround -- in the future we should support a resource request for the job supervisor actor.
2022-05-10 11:43:04 -05:00
Philipp Moritz
27917f570d
[runtime_env] Extend runtime_env hook to also cover jobs (#24328)
This extends https://github.com/ray-project/ray/pull/24036 to also cover job submission.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2022-04-30 09:15:51 -07:00
Archit Kulkarni
12b9383d52
[Jobs] Reenable test_backwards_compatibility using Ray 1.12 (#24124)
Closes https://github.com/ray-project/ray/issues/23258
2022-04-26 13:53:51 -05:00
shrekris-anyscale
75b7465ba4
[serve] Reject Ray client addresses when submitting via Dashboard (#23339)
Some commands in the Serve CLI use Ray client and some commands ping the Ray dashboard; however, all commands read `RAY_ADDRESS` to get the address. This change raises a nice exception if the user accidentally passes a Ray client address as the Ray Dashboard address.
2022-03-21 11:17:51 -05:00
Archit Kulkarni
77090144a2
[jobs] Add entrypoint field to JobInfo (#23253) 2022-03-16 22:02:22 -05:00
Archit Kulkarni
8707eb6288
[runtime env] Support .whl files in py_modules (#22368)
The `py_modules` field of runtime_env supports uploading local Python modules for use on the Ray cluster.  One gap in this is if the local Python module is in the form of a wheel (`.whl` file.)  This PR adds the missing support for uploading and installing the `.whl` file.
2022-03-16 16:37:10 -05:00
Archit Kulkarni
e8496374e2
[Jobs] Test job submit with no specified ray address (#23119) 2022-03-14 13:44:06 -05:00
Jialing He
39a6c054d3
[runtime env][feature] introduce pip_check_enable and pip_version (#22826) 2022-03-14 23:41:19 +08:00
Archit Kulkarni
52a722ffe7
[jobs] Make local pip/conda requirements files work with jobs (#22849) 2022-03-10 15:15:16 -06:00
Archit Kulkarni
c78bd809ce
[job submission] Support local py_modules in jobs (#22843) 2022-03-10 11:42:25 -06:00
shrekris-anyscale
bc82e2d5c4
[serve] Restore "[serve] Support working_dir in serve run (#22760)" (#22971) 2022-03-09 21:31:23 -08:00
Kai Fricke
15601ed79b
Revert "[serve] Support working_dir in serve run (#22760)" (#22956)
This reverts commit ab2741d64b.

The PR breaks ray job submission for anyscale:// URLs
2022-03-09 17:04:46 +00:00
shrekris-anyscale
ab2741d64b
[serve] Support working_dir in serve run (#22760)
#22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.
2022-03-08 13:18:41 -06:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Edward Oakes
58e5f0140d
[jobs] Rename JobData -> JobInfo (#22499)
`JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.
2022-02-22 16:18:16 -06:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Edward Oakes
5df2a0a6c6
[jobs] Add test condition that job runs w/o CPUs available on head node (#22260) 2022-02-10 10:23:02 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Nikita Vemuri
d19aaf0fd3
[jobs] Add unit test for parse_cluster_info (#22205)
Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.
2022-02-08 11:22:28 -06:00
Edward Oakes
8806b2d5c4
[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180) 2022-02-07 15:25:25 -06:00
Jiao
a692e7d05e
[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938)
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2022-02-04 17:51:43 -06:00
SangBin Cho
d7fc7d2e9d
[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032)
This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers.

The first PR only handles tasks exception.

TODO
- [x] Task (this PR)
- [ ] Actor
- [ ] Turn of runtime env logs & improve error msgs
2022-02-03 16:47:04 -08:00
Edward Oakes
e85bbfb338
[jobs] Enable default port in http:// addresses (#22014)
Closes https://github.com/ray-project/ray/issues/22012
2022-02-02 14:34:34 -06:00
Edward Oakes
8bbc5b936a
[jobs] Use subprocess.list2cmdline to properly handle quotes in CLI entrypoints (#22011) 2022-02-02 14:33:57 -06:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Jiao
5d382cfeb3
[nit] remove decorator in test_cli.py (#21792)
Full context see https://github.com/ray-project/ray/issues/21791

pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
2022-01-23 06:05:05 -08:00
Archit Kulkarni
f058a1d342
[Jobs] Stream logs during job instead of only at the end (#21659)
Closes https://github.com/ray-project/ray/issues/21517
2022-01-20 15:21:07 -06:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
Jiao
ed34434131
[Jobs] Add log streaming for jobs (#20976)
Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration.

Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-12-14 17:01:53 -08:00
Jiao
1e67bdfcec
[jobs] Add headers field to JobSubmissionClient and apply to all requests (#20663) 2021-12-03 18:44:30 -06:00
Jiao
5ce79d0a46
[jobs] Fix job server's ray init(to use redis address rather than auto (#20705)
* [job submission] Use specific redis_address and redis_password instead of "auto" (#20687)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-11-24 15:38:26 -08:00
SangBin Cho
cedd8806f7
Revert "[job submission] Use specific redis_address and redis_passwor… (#20699)
The test breaks the master branch
2021-11-24 05:37:15 -08:00
Edward Oakes
66b4939184
[job submission] Use specific redis_address and redis_password instead of "auto" (#20687) 2021-11-23 23:25:36 -06:00
Edward Oakes
39b2c3927c
[jobs] Add /api/version endpoint (#20622) 2021-11-22 15:11:04 -06:00
Edward Oakes
d26c9e67e8
[job submission] Add a message to the JobStatus to return more detailed errors (#20491) 2021-11-18 10:15:23 -06:00
Edward Oakes
eae523159f
[job submission] Prefix job ID with raysubmit_ and pass job_name metadata (#20490) 2021-11-17 21:48:22 -06:00
Edward Oakes
2d5d499f67
[job submission] Support specifying runtime_env to job submission CLI (#20339) 2021-11-14 13:52:47 -08:00
shrekris-anyscale
c0aeb4a236
[runtime_env] Support working_dir and py_modules from HTTPS and Google Cloud Storage (#20280) 2021-11-14 02:16:45 -08:00
Edward Oakes
6c3bad52b6
[job submission] Better validation + tests for input types, refactor API (#20332) 2021-11-13 22:54:01 -08:00
Edward Oakes
07add6f7f2
Revert "Revert "[job submission] Use ray.init format addresses for Jo… (#20328) 2021-11-13 16:24:02 -08:00
Eric Liang
567e955810
Revert "[job submission] Use ray.init format addresses for JobSubmissionClient (#20245)" (#20314)
This reverts commit adc15a0fb0.
2021-11-12 21:11:24 -08:00