Commit graph

272 commits

Author SHA1 Message Date
shrekris-anyscale
ab2741d64b
[serve] Support working_dir in serve run (#22760)
#22714 added `serve run` to the Serve CLI. This change allows the user to specify a local or remote `working_dir` in `serve run`.
2022-03-08 13:18:41 -06:00
shrekris-anyscale
521298e093
[serve] Make route prefix the deployment name by default (#22840)
The REST API's schema default denies HTTP access to deployments when `route_prefix` is omitted. This doesn't match `@serve.deployment`'s behavior, which make `route_prefix` the deployment's name when omitted.

This change matches the schema's behavior to the decorator. When `route_prefix` is omitted from the config, the deployment's `route_prefix` defaults to its name. When the `route_prefix` is specified as `null`, the deployment won't have HTTP access.

This change also fixes a bug in Serve where when a deployment is updated from a non-`None` `route_prefix` to a `None` `route_prefix`, its `route_prefix` does not change. This bug meant that a deployment available over HTTP would continue to be available at the same route even when deployed again with `route_prefix=None`.
2022-03-06 20:03:31 -06:00
Yi Cheng
11bbf00338
[dashboard] Remove redis in dashboard (#22788)
As we are turning redisless ray by default, dashboard doesn't need to talk with redis anymore. Instead it should talk with gcs and gcs can talk with redis.
2022-03-04 12:32:17 -08:00
Archit Kulkarni
1752f17c6d
[Job submission] Add list_jobs API (#22679)
Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-03-01 21:27:09 -06:00
Edward Oakes
2a09561edf
[serve] Enable REST API tests with main clause (#22706)
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
2022-03-01 11:21:22 -06:00
shrekris-anyscale
49ee443231
[serve] Add Serve CLI commands for REST API (#22648) 2022-02-28 20:45:46 -06:00
Archit Kulkarni
85657b1377
[Doc] [Jobs] add CLI and SDK reference to docs (#22680) 2022-02-28 17:57:46 -06:00
Jialing He
aa1885ae2a
[runtime env] Make plugin setup process that has not been refactor run in threads. (#22588)
I recently realized that during a runtime_env creation process, a plugin/manager that is very slow to setup may block the creation of other runtime_env, so I make plugin/manager setup run in threads.

[The refactor of `PipManager`](https://github.com/ray-project/ray/pull/22381) is about to be completed, so I ignore it in this PR.
2022-02-28 17:33:13 +08:00
Jialing He
98a69cbd90
[runtime env][strong-typed API] Combine ParsedRuntimeEnv and RuntimeEnv into ray.runtime.RuntimeEnv (#22522)
Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495

- The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`.
- The `new RuntimeEnv` will be exposed directly to the user.
- example:
```python
runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", 
        pip=["requests"],
        java_jars=["s3://jar1.zip"],
        java_jvm_options=["-Dxxx=xxx"])
```
2022-02-28 16:18:10 +08:00
shrekris-anyscale
8548affdc2
Increase test_failed_job_status timeout in test_job_submission (#22643)
`test_job_submission` has become [flakey](https://flakey-tests.ray.io/) due to timeout. This change increases the timeout in `test_failed_job_status` from 10 to 25 seconds.
2022-02-25 10:08:55 -08:00
shrekris-anyscale
e85540a1a2
[serve] Expose deployment statuses in REST API (#22611) 2022-02-25 08:41:07 -06:00
shrekris-anyscale
a9ede4e499
[serve] Add REST API (#22578)
This change adds the GET, PUT, and DELETE commands for Serve’s REST API. The dashboard receives these commands and issues corresponding requests to the Serve controller.
2022-02-24 10:00:26 -06:00
shrekris-anyscale
40fa56f40c
[serve] Add JSON schemas for REST API (#22547) 2022-02-22 21:36:42 -06:00
SangBin Cho
36a31cb6fd
[Usage Stats] Implement usage stats report "Turned off by default". (#22249)
This is the second PR to implement usage stats on Ray. Please refer to the file usage_lib.py for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

This adds a dashboard module to enable usage stats. **Usage stats report is turned off by default** after this PR. We can control the report (enablement, report period, and URL. Note that URL is strictly for testing) using the env variable.  

## NOTE
This requires us to add `requests` to the default library. `requests` must be okay to be included because
1. it is extremely lightweight. It is implemented only with built-in libs.
2. It is really stable. The project basically claims they are "deprecated", meaning no new features will be added there.

cc @edoakes @richardliaw for the approval

For the HTTP request, I was alternatively considered httpx, but it was not as lightweight as `requests`. So I decided to implement async requests using the thread pool.
2022-02-22 15:32:02 -08:00
Edward Oakes
58e5f0140d
[jobs] Rename JobData -> JobInfo (#22499)
`JobData` could be confused with the actual output data of a job, `JobInfo` makes it more clear that this is status information + metadata.
2022-02-22 16:18:16 -06:00
Guyang Song
57a94aae12
[runtime env][bugfix] Fix runtime env retry (#22495)
- Bug: `error_message` is not cleared when the retry succeeds. This bug lead to runtime env creation failing.
- Add test case for this.
2022-02-18 17:09:06 -08:00
Archit Kulkarni
df581c584a
[Job] [Dashboard] Add Job Submission data to cluster snapshot (#22225)
The existing Job info in the cluster snapshot uses the old definition of Job, which is a single Ray driver (a single `ray.init()` connection).  

In the new Job Submission protocol, a Job just specifies an entrypoint which can be any shell command.  As such a Job can have zero or multiple Ray drivers.  This means we should add a new snapshot entry corresponding to new jobs.  We'll leave the old snapshot in place for legacy jobs.

- Also fixes `get_all_jobs` by using the appropriate KV namespace, and stripping the job key KV prefix from the job ID.  It wasn't working before.

- This PR also unifies the datatype used by the GET jobs/ endpoint to be the same as the one used by the new jobs cluster snapshot.  For backwards compatibility, the `status` and `message` fields are preserved.
2022-02-18 09:54:37 -06:00
Archit Kulkarni
63a5eb492d
Revert "[serve] Add basic REST API to dashboard (#22257)" (#22414)
This reverts commit f37f35c5da.
2022-02-15 21:47:50 -06:00
Edward Oakes
f37f35c5da
[serve] Add basic REST API to dashboard (#22257) 2022-02-15 15:36:58 -06:00
Jialing He
192f9de421
[runtime env] Introduce async Manager.create (#22311) 2022-02-14 16:26:47 -06:00
Liu Bao
824453dd17
[runtime env] Create virtualenv for pip runtime env. (#21801) 2022-02-10 12:25:18 -06:00
Edward Oakes
5df2a0a6c6
[jobs] Add test condition that job runs w/o CPUs available on head node (#22260) 2022-02-10 10:23:02 -06:00
Archit Kulkarni
50e2bef9d0
[Jobs] Hide dashboard from Job Submission import path (#22223)
For public SDK APIs, change the import path from 

```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```

to 
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```

`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
2022-02-09 13:55:32 -06:00
Nikita Vemuri
d19aaf0fd3
[jobs] Add unit test for parse_cluster_info (#22205)
Add unit test to check addresses of various formats are correctly passed to `get_job_submission_client_cluster_info`.
2022-02-08 11:22:28 -06:00
Edward Oakes
8806b2d5c4
[jobs] Monitor jobs in the background to avoid requiring clients to poll (#22180) 2022-02-07 15:25:25 -06:00
Jiao
a692e7d05e
[jobs] Fix restarting local ray cluster with http ray address broke local job submission (#21938)
As titled. We have a corner case on user laptop where user might left RAY_ADDRESS as http address but restarted local ray cluster. In this case we will try to do job submission with an http prefixed address.

Co-authored-by: Archit Kulkarni <architkulkarni@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2022-02-04 17:51:43 -06:00
SangBin Cho
ea4079465d
[Runtime Env] Support runtime env error message for actors (#22109) 2022-02-04 15:32:02 -06:00
Nikita Vemuri
d9dc388082
[jobs] Support ray client format of connection string address for external module (#22116)
Ray client currently supports connection strings for external modules of the format `"other_module://"`, however `ray job` commands don't support this format because trailing `/` is removed. Update so `ray job` commands also support this format.
2022-02-04 13:35:10 -06:00
SangBin Cho
d7fc7d2e9d
[Runtime Env] Plumbing runtime env failure error message to the exception: Task [1/3] (#22032)
This is the PR to write better runtime env exception. After 3 PRs are merged, we can entirely turn off the runtime env logs streamed to drivers.

The first PR only handles tasks exception.

TODO
- [x] Task (this PR)
- [ ] Actor
- [ ] Turn of runtime env logs & improve error msgs
2022-02-03 16:47:04 -08:00
Archit Kulkarni
78f882dbbc
[runtime env] Local uri caching for working_dir, py_modules and conda (#20273)
Previously, local files corresponding to runtime env URIs were eagerly garbage collected as soon as there were no more references to them.  In this PR, we store this data in a cache instead, so when the reference count for a URI drops to zero, instead of deleting it we simple mark it as unused in the cache.  When the cache exceeds its size limit (default 10 GB) it will delete unused URIs until the cache is back under the size limit or there are no more unused URIs.

Design doc: https://docs.google.com/document/d/1x1JAHg7c0ewcOYwhhclbuW0B0UC7l92WFkF4Su0T-dk/edit

- Adds unit tests for caching and integration tests for working_dir caching
2022-02-02 14:53:03 -06:00
Edward Oakes
e85bbfb338
[jobs] Enable default port in http:// addresses (#22014)
Closes https://github.com/ray-project/ray/issues/22012
2022-02-02 14:34:34 -06:00
Edward Oakes
8bbc5b936a
[jobs] Use subprocess.list2cmdline to properly handle quotes in CLI entrypoints (#22011) 2022-02-02 14:33:57 -06:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
7d2237bc9f
[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919) 2022-01-26 22:48:17 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
SangBin Cho
2010f13175
Fix dashboard test bug (#21742)
Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).
2022-01-24 11:38:51 -06:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
Jiao
5d382cfeb3
[nit] remove decorator in test_cli.py (#21792)
Full context see https://github.com/ray-project/ray/issues/21791

pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
2022-01-23 06:05:05 -08:00
mwtian
e8ce01c525
[Dashboard] offload blocking work to a thread pool (#21762)
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
2022-01-21 17:55:11 -08:00
mwtian
f18a8bd87f
[Dashboard] turn a noisy info log into debug (#21746)
Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.
2022-01-20 22:08:23 -08:00
Archit Kulkarni
f058a1d342
[Jobs] Stream logs during job instead of only at the end (#21659)
Closes https://github.com/ray-project/ray/issues/21517
2022-01-20 15:21:07 -06:00
mwtian
a4581e58ee
[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712)
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
2022-01-20 07:04:54 -08:00
Shantanu
ae60548ef3
Silence "cut: write error: Broken pipe" log spew (#21686)
On machines without GPUs, this can run subprocesses that spew to
stderr. Then with log_to_driver=True, we get log spew from every
single raylet. To avoid this, disable the GPU usage check on
certain errors.

Resolves #14305

Co-authored-by: hauntsaninja <>
2022-01-19 23:01:10 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
mwtian
70db5c5592
[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232)
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-01-04 23:06:44 -08:00
mwtian
8cc268096c
[GCS][Bootstrap 3/n] Refactor to support GCS bootstrap (#21295)
This PR refactors several components to support switching to GCS address bootstrapping later:
- Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address.
- Ray client servers support `--address` flag instead of `--redis-address`.
- A few other miscellaneous cleanup.

Also, add a test for starting non-head node with `ray start`.
2022-01-03 23:52:12 -08:00