Commit graph

10784 commits

Author SHA1 Message Date
Qing Wang
2df27a5f87
[Java] Support ActorLifetime (#21074)
We add a enum class ActorLifetime to indicate the lifetime of an actor. In this PR, we also add the necessary API to create an actor with specifying lifetime.
Currently, it has 2 values: detached and default.
2021-12-23 19:48:56 +08:00
Qing Wang
e653d47533
[Java] Shade some widely used dependencies in bazel_jar_jar rule. (#21237)
These dependencies are widely used:
- com.google.common
- com.google.protobuf
- com.google.thirdparty

So that we need to shade them to avoid being conflict with jars introduced by user.

In this PR, we introduce a `bazel_jar_jar` rule for doing these and also shade them in maven pom files.
2021-12-23 16:54:31 +08:00
Jiajun Yao
60388b2834
Round robin during spread scheduling (#19968) 2021-12-22 20:27:34 -08:00
SangBin Cho
99693096d6
[gRPC] Improve blocking call Placement group (#21130)
Use Sync methods with timeout for placement group RPCs
2021-12-22 17:21:56 -08:00
Yi Cheng
11ab412db1
[4/gcs] Bootstrap global accessor from gcs (#21195)
This is part of redis removal. This PR enable global accessor to be able to start from gcs

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2021-12-22 01:27:25 -08:00
Gagandeep Singh
92bf609a08
Unskip tests in `test_basic_3.py` (#20433) 2021-12-22 00:09:32 -08:00
Yi Cheng
0c786b1109
[3/gcs] Bootstrap log monitor and monitor from gcs (#21194)
This is part of redis removal. This PR enable log monitor and monitor to bootstrap from gcs

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2021-12-21 23:15:55 -08:00
Simon Mo
cfe0897d05
[CI] Migrate Windows tests to Buildkite (#21227) 2021-12-21 20:16:34 -08:00
Sidhartha Parhi
5d6409fe2e
[Train] Remove run_dir param from BackendExecutor (#21231)
The run_dir argument in ray.train.backend.BackendExecutor.start_training isn't used but is causing the following error: if your host computer and job cluster use different OS, then you get a pathlib error because, for e.g., you can't instantiate a pathlib.WindowsPath in a Linux system.

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-12-21 19:54:43 -08:00
Amog Kamsetty
57db4640ca
[Train] [Tune] Refactor MLflow (#20802)
Pulls out Tune's MLflow logging logic to a shared MLflow util.
Adds an MLflow logger callback to Ray Train

Closes #20642
2021-12-21 17:17:52 -08:00
Yi Cheng
09421a4ca6
[2/gcs] Bootstrap dashboard for gcs ha (#21179)
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 16:58:03 -08:00
Eric Liang
1db03862a7
Isolate function exports by job in separate queues (#20882) 2021-12-21 16:19:00 -08:00
Jiajun Yao
7d861a2c58
[Test] Add ray wheel sanity check (#21223) 2021-12-21 14:24:02 -08:00
Gagandeep Singh
5dc0f90ada
[Windows] Unskipped tests in test_standalone.py (#21213) 2021-12-21 11:37:23 -08:00
Yi Cheng
f62faca04c
[1/gcs] gcs ha bootstrap for raylet (#21174)
This is part of #21129

This PR tries to cover the cpp/ray part of the bootstrap, some updates there:

remove the unused function/tests
some API updates

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 08:50:42 -08:00
SangBin Cho
5d3042ed9d
[Internal Observability] Record Raylet Gauge (#21049)
* Revert "[Please revert] Remove new metrics temporarily"

This reverts commit baf7846daa3d1dad50dbedac19b7afbae3e197fc.

* Addressed code review.

* [Please revert] Revert plasma stats for the next PR

* improve grammar

* Addressed code review v1.

* Addressed code review.

* Add code owner.

* Fix tests.

* Add code owner to metric_defs.cc
2021-12-21 00:34:48 -08:00
Sven Mika
62dbf26394
[RLlib] POC: Run PGTrainer w/o the distr. exec API (Trainer's new training_iteration method). (#20984) 2021-12-21 08:39:05 +01:00
Dmitri Gekhtman
c9cf912a15
[autoscaler] Pass on provider.internal_ip() exceptions during scale down (#21204)
Treats failures of provider.internal_ip during node drain as non-fatal.
For example, if a node is deleted by a third party between the time it's scheduled for termination and drained, there will now be no error on GCP.

Closes #21151
2021-12-20 22:23:17 -08:00
qicosmos
d1a27487a3
[C++ Worker] fix uninit ray runtime instance (#21125)
In some compiler, the static ray runtime in ray runtime holder maybe a new un-init instance in dynamic library, 
so we need to init ray time holder in dynamic library to make sure the new instance valid.
2021-12-21 12:07:59 +08:00
Qing Wang
94251fbcc4
[Core] Fix invalid to specify concurrency group at runtime. (#21191)
We fix the issue that it's unable to specify the concurrency group name of an actor task at runtime with the following usage:
```python
a.f2.options(concurrency_group="compute").remote()
```
2021-12-21 10:47:47 +08:00
Linsong Chu
61bbecdb7d
[Workflow]add doc for metadata (#20156)
This PR adds documentation for Workflow Metadata, which we recently added support in https://github.com/ray-project/ray/pull/19372.

Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2021-12-20 17:24:07 -08:00
Hankpipi
ae5bb34f60
[Serve Autoscaler] Raise warning if max_concurrent_queries < target_num_ongoing_requests (#21184) 2021-12-20 16:07:19 -08:00
iasoon
1c93beb490
[serve] use true nulls in snapshot (#21062) 2021-12-20 16:07:09 -08:00
SangBin Cho
44320aba3b
[Nightly Test] Fix broken scalability test #21201
I added memory monitor to the scalability tests. This broke the tests because creating a memory monitor requires the node resources (to be scheduled on a head node), and that broke "resource leak" check. Ideally, this resource leak check should be more robust, but I fix the issue in an easier way for now. In the sooner future, memory monitor will become a fixture, and in that case, we should fix resource leak function code.
2021-12-20 14:58:39 -08:00
architkulkarni
5cc1308c66
[runtime env] [doc] [test] Add docs and tests for RAY_runtime_env_skip_local_gc environment variable (#21163) 2021-12-20 10:34:59 -08:00
SangBin Cho
5959669a70
[Core] Remove task table. (#21188)
Remove task table that's not used anymore.
2021-12-20 06:22:01 -08:00
architkulkarni
5b6bf534a0
[Java] Fix typo projetct->project in XML file (#21060) 2021-12-20 20:21:35 +08:00
Qing Wang
bd502e8bd5
[Java] Remove out of date comment. (#21073)
The semantic of `setName` API is changed, but the comment is out of date. This PR fixes it.
2021-12-20 20:07:59 +08:00
DK.Pino
33a45e55df
Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)" (#21152)
* Revert "Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)"

This reverts commit 02465a6792.

* fix flakey ut
2021-12-20 00:32:42 -08:00
mwtian
06ec07057c
Revert "[Core] Unrevert #21115, fix auto address env (#21158)" (#21189)
This reverts commit 968f08607b.

It is breaking e2e tests where worker nodes cannot start. e.g.

```
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1961, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py", line 808, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/scripts/scripts.py", line 733, in start
    address_ip, password=redis_password)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 593, in create_redis_client
    _, redis_ip_address, redis_port = validate_bootstrap_address(redis_address)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/services.py", line 494, in validate_bootstrap_address
    raise ValueError("Malformed address. Expected '<host>:<port>'.")
ValueError: Malformed address. Expected '<host>:<port>'.
```
2021-12-20 00:22:12 -08:00
Guyang Song
2a9d9726d6
[doc] add doc for container runtime env (#21131) 2021-12-20 14:13:05 +08:00
architkulkarni
774163f9c9
[Java] Bump log4j 2.16.0 -> 2.17.0 (#21176)
Resolves [CVE-2021-45105](https://github.com/advisories/GHSA-p6xc-xr62-6r2g).
2021-12-20 10:27:24 +08:00
Oliver Mannion
8d9e0fca61
fix: data not exported (#20887)
* fix: data not exported

* empty commit
2021-12-18 22:33:34 -08:00
architkulkarni
2489b17634
[release] Uninstall old ray in all release test app configs to fix commit mismatch error (#21175)
* uninstall old ray in all release test app configs

* add instruction to e2e.py dosctring
2021-12-18 16:58:49 -08:00
Clark Zinzow
968f08607b
[Core] Unrevert #21115, fix auto address env (#21158)
This PR unreverts #21115, fixing the handling of an `"auto"` address in the `RAY_ADDRESS` environment variable.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2021-12-18 07:45:00 -08:00
Chen Shen
c9c3f0745a
[Dataset][nighlytest] use latest ray for running test #21148
We are actually using the ray comes with the image, which is on a very old version of Ray. (suprised this actually works)
2021-12-17 23:48:44 -08:00
Jun Gong
c98d4fe2f3
[ci] Change build-wheel-macos-arm64.sh to be executable. (#21164)
So the script can be simply executed. All the other build-wheels-xxx.sh are executable.
2021-12-17 17:23:10 -08:00
architkulkarni
56bd8e58de
[CI] [Release] uninstall Ray before installing new Ray version (#21159) 2021-12-17 16:25:15 -08:00
Clark Zinzow
c3d68fa0c1
[Dask-on-Ray] Add Dask config helper, set task-based shuffle by default. (#21114)
Dask default's to a disk-based shuffle even thought we're using a distributed scheduler, which appears to be resulting in dropped data since the filesystem isn't shared across nodes. Dask Distributed manually sets the shuffle algorithm in the global config to the task-based shuffle, which the Dask-on-Ray scheduler should probably do as well.

This PR adds a Dask config helper, `enable_dask_on_ray`, that sets Dask-on-Ray as the default scheduler along with changing the default shuffle to a task-based shuffle. The shuffle method can still be overridden by the user by manually specifying `df.set_index(shuffle="disk")`.
2021-12-17 13:16:37 -08:00
Chen Shen
d99f699e3d
Revert "[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)" (#21157)
This reverts commit 0e7c0b491b.
2021-12-17 11:48:40 -08:00
xwjiang2010
ce81ad21f3
Revert "[tune] Elongate test_trial_scheduler_pbt timeout. (#21120)" (#21155) 2021-12-17 11:32:00 -08:00
Gagandeep Singh
14fc023cb6
Bump timeout value for test_worker_capping.py::test_zero_cpu_scheduling (#21035) 2021-12-17 10:51:54 -08:00
Simon Mo
956774e757
[CI] Disable serve test_standalone on windows again (#21154) 2021-12-17 10:32:27 -08:00
Hankpipi
04ecdee9db
[Serve] Fix serve metrics test (#21140) 2021-12-17 10:23:17 -08:00
shrekris-anyscale
7e15a8199e
[Serve] Reduce test_cluster flakiness by increasing timeout (#21146) 2021-12-17 10:22:56 -08:00
SangBin Cho
02465a6792
Revert "[Placement Group] Make placement group prepare resource rpc r… (#21144)
This PR makes pg_test_2 flaky. cc @clay4444 can you re-merge it?
2021-12-17 00:13:26 -08:00
mwtian
0e7c0b491b
[Core][GCS] Use port and address flags to configure GCS server / client in GCS bootstrapping mode (#21115)
This change adds support for parsing `--address` as bootstrap address, and treating `--port` as GCS port, when using GCS for bootstrapping.

Not launching Redis in GCS bootstrapping mode, and using GCS to fetch initial cluster information, will be implemented in a subsequent change.

Also made some cleanups.
2021-12-16 15:11:05 -08:00
Matti Picus
29965ad325
enable passing serve tests on windows (#21107)
* enable passing serve tests on windows

* move test_handle to 'medium' and enable'

* move test_cli to 'medium'
2021-12-16 14:03:11 -08:00
architkulkarni
4dcba1d0f4
[CI] Pin anyscale version to fix release tests (#21138) 2021-12-16 13:15:16 -08:00
Simon Mo
0f0813b7b6
[Serve] Bump test_cli timeout (#21139) 2021-12-16 11:00:22 -08:00