Commit graph

277 commits

Author SHA1 Message Date
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
7d2237bc9f
[dashboard] Remove unused fields in dashboard actor table for better memory footprint (#21919) 2022-01-26 22:48:17 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
SangBin Cho
2010f13175
Fix dashboard test bug (#21742)
Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).
2022-01-24 11:38:51 -06:00
SangBin Cho
1ae14ec513
[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774)
This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing

The fully working e2e prototype is here (it passes all tests): cdad913883

This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. 

<img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">
2022-01-23 21:11:32 -08:00
Jiao
5d382cfeb3
[nit] remove decorator in test_cli.py (#21792)
Full context see https://github.com/ray-project/ray/issues/21791

pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.
2022-01-23 06:05:05 -08:00
mwtian
e8ce01c525
[Dashboard] offload blocking work to a thread pool (#21762)
Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.
2022-01-21 17:55:11 -08:00
mwtian
f18a8bd87f
[Dashboard] turn a noisy info log into debug (#21746)
Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.
2022-01-20 22:08:23 -08:00
Archit Kulkarni
f058a1d342
[Jobs] Stream logs during job instead of only at the end (#21659)
Closes https://github.com/ray-project/ray/issues/21517
2022-01-20 15:21:07 -06:00
mwtian
a4581e58ee
[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712)
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
2022-01-20 07:04:54 -08:00
Shantanu
ae60548ef3
Silence "cut: write error: Broken pipe" log spew (#21686)
On machines without GPUs, this can run subprocesses that spew to
stderr. Then with log_to_driver=True, we get log spew from every
single raylet. To avoid this, disable the GPU usage check on
certain errors.

Resolves #14305

Co-authored-by: hauntsaninja <>
2022-01-19 23:01:10 -08:00
Yao Yuan
422d20e945
[Dashboard] Fix NPE when there is no GPU on the node (#21650)
There is an NPE bug that causes browser crash when no GPU on the node.
We can add a condition to fix it.
2022-01-18 08:12:49 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
mwtian
45cddef2d3
[GCS] disable tests related to GCS restarting in GCS pubsub mode (#21534)
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.

Also, suppress exceptions when sync subscribers are disconnected from GCS.

I can push changes in this PR to #21513 as well.
2022-01-11 14:14:05 -08:00
mwtian
70db5c5592
[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232)
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-01-04 23:06:44 -08:00
mwtian
8cc268096c
[GCS][Bootstrap 3/n] Refactor to support GCS bootstrap (#21295)
This PR refactors several components to support switching to GCS address bootstrapping later:
- Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address.
- Ray client servers support `--address` flag instead of `--redis-address`.
- A few other miscellaneous cleanup.

Also, add a test for starting non-head node with `ray start`.
2022-01-03 23:52:12 -08:00
mwtian
20ca1d85c2
[GCS][Bootstrap 2/n] Fix tests to enable using GCS address for bootstrapping (#21288)
This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.
2021-12-29 19:25:51 -07:00
Yi Cheng
09421a4ca6
[2/gcs] Bootstrap dashboard for gcs ha (#21179)
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 16:58:03 -08:00
iasoon
1c93beb490
[serve] use true nulls in snapshot (#21062) 2021-12-20 16:07:09 -08:00
mwtian
06ec07057c
Revert "[Core] Unrevert #21115, fix auto address env (#21158)" (#21189)
This reverts commit 968f08607b.

It is breaking e2e tests where worker nodes cannot start. e.g.

```
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start
    address_ip, password=redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client
    _, redis_ip_address, redis_port = validate_bootstrap_address(redis_address)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address
    raise ValueError("Malformed address. Expected '<host>:<port>'.")
ValueError: Malformed address. Expected '<host>:<port>'.
```
2021-12-20 00:22:12 -08:00
Clark Zinzow
968f08607b
[Core] Unrevert #21115, fix auto address env (#21158)
This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2021-12-18 07:45:00 -08:00
Chen Shen
d99f699e3d
Revert "[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)" (#21157)
This reverts commit 0e7c0b491b.
2021-12-17 11:48:40 -08:00
mwtian
0e7c0b491b
[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)
This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping.

Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change.

Also made some cleanups.
2021-12-16 15:11:05 -08:00
Jiao
ed34434131
[Jobs] Add log streaming for jobs (#20976)
Current logs API simply returns a str to unblock development and integration. We should add proper log streaming for better UX and external job manager integration.

Co-authored-by: Sven Mika <sven@anyscale.io>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
Co-authored-by: Ed Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-12-14 17:01:53 -08:00
Edward Oakes
10947c83b3
[runtime_env] Make pip installs incremental (#20341)
Uses a direct `pip install` instead of creating a conda env to make pip installs incremental to the cluster environment.

Separates the handling of `pip` and `conda` dependencies.

The new `pip` approach still works if only the base Ray is installed on the cluster and the user specifies libraries like "ray[serve]" in the `pip` field.  The mechanism is as follows:
- We don't actually want to reinstall ray via pip, since this could lead to version mismatch issues.  Instead, we want to use the Ray that's already installed in the cluster.
- So if "ray" was included by the user in the pip list, remove it
- If a library "ray[serve]" or "ray[tune, rllib]" was included in the pip list, remove it and replace it by its dependencies (e.g. "uvicorn", "requests", ..)

Co-authored-by: architkulkarni <arkulkar@gmail.com>
Co-authored-by: architkulkarni <architkulkarni@users.noreply.github.com>
2021-12-14 15:55:18 -08:00
iasoon
33059cff3d
[serve] support not exposing deployments over http (#21042) 2021-12-13 09:43:55 -08:00
mwtian
6871a72a5c
[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860)
Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.
2021-12-10 14:36:57 -08:00
mwtian
d751a242d8
[Core][Dashboard Pubsub 2/n] Add resource reporter and actor to Python GCS pubsub (#21001)
Dashboard contains resource reporter and actor subscribers. Dashboard agent has resource report publisher. So GCS pubsub needs to support these channel types.

Also refactor GCS AIO subscribers to have each subscriber per channel. This matches the API of GCS sync subscribers, and make subscribing with multiple channels easier.
2021-12-09 23:10:10 -08:00
Jiao
1e67bdfcec
[jobs] Add headers field to JobSubmissionClient and apply to all requests (#20663) 2021-12-03 18:44:30 -06:00
mwtian
95c26eec26
[Core][Pubsub][Logging 2/n] Use GCS pubsub for logs (#20492)
Using Ray pubsub for publishing and subscribing logs via GCS, from Python worker, log importer, dashboard and unit tests.

This change is guarded behind the RAY_gcs_grpc_based_pubsub feature flag.
2021-11-30 12:00:44 -08:00
Jiao
5ce79d0a46
[jobs] Fix job server's ray init(to use redis address rather than auto (#20705)
* [job submission] Use specific redis_address and redis_password instead of "auto" (#20687)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-11-24 15:38:26 -08:00
Guyang Song
53630ee03b
Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692)
- Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641
- Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.
2021-11-24 09:01:01 -08:00
SangBin Cho
cedd8806f7
Revert "[job submission] Use specific redis_address and redis_passwor… (#20699)
The test breaks the master branch
2021-11-24 05:37:15 -08:00
Edward Oakes
66b4939184
[job submission] Use specific redis_address and redis_password instead of "auto" (#20687) 2021-11-23 23:25:36 -06:00
Alex Wu
9388d28233
Revert "[runtime env] redefine runtime env to protobuf" (#20641)
Reverts #19511

Breaks windows compilation
2021-11-22 13:11:30 -08:00
Edward Oakes
39b2c3927c
[jobs] Add /api/version endpoint (#20622) 2021-11-22 15:11:04 -06:00
Guyang Song
ad56b9b432
[runtime env] redefine runtime env to protobuf (#19511) 2021-11-20 16:54:42 +08:00
Jiao
1a00964902
[job submission] Fix job sdk's lazy import of requests that led to minimal build failure (#20577) 2021-11-19 17:04:22 -06:00
mwtian
da79f24e8c
[Core][Pubsub] Refactor to prepare for migrating logging to Ray pubsub (#20560)
## Why are these changes needed?
Publisher and subscriber for logs, in driver, dashboard and tests are refactored to make it easier to support using Ray pubsub for logs. Actual support of Ray pubsub for logs will be added later in #20492.

This PR does not intend to introduce any behavior change.

## Related issue number
2021-11-19 12:28:37 -08:00
Edward Oakes
d26c9e67e8
[job submission] Add a message to the JobStatus to return more detailed errors (#20491) 2021-11-18 10:15:23 -06:00
Edward Oakes
eae523159f
[job submission] Prefix job ID with raysubmit_ and pass job_name metadata (#20490) 2021-11-17 21:48:22 -06:00
Antoni Baum
20fc9f907d
[CI] Fix tune dashboard, increase timeout for test_commands (#20453) 2021-11-16 17:52:17 -08:00
Simon Mo
32a4f48aa2
[CI] Don't test tune dashboard (#20452) 2021-11-16 15:07:56 -08:00
Yi Cheng
a4e187c0e7
[gcs] Update function table to use internal kv (#20152)
## Why are these changes needed?
This is a part of redis removal. This PR remove redis kv in function table. 
rpush related code is not updated in this PR.

## Related issue number
2021-11-15 23:34:41 -08:00
Edward Oakes
48bc1af2da
[job submission] Remove DOES_NOT_EXIST status (#20354) 2021-11-15 16:57:32 -08:00
Simon Mo
72ae22e82b
[CI] Fix frontend build issue (#20375) 2021-11-15 10:12:43 -08:00