Commit graph

140 commits

Author SHA1 Message Date
Dmitri Gekhtman
bc98afcdf8
Test of KubeRay autoscaler integration (#23365)
This PR adds a test of KubeRay autoscaler integration to the Ray CI.

- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
2022-03-23 18:18:48 -07:00
shrekris-anyscale
b00977b1b1
[serve] Remove dashboard's dependency on Serve (#23389) 2022-03-21 22:14:41 -07:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Kai Fricke
da140a80e9
[ci/release] Legacy field should be optional (#23326)
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
2022-03-18 11:34:05 +00:00
Simon Mo
78d6ed7029
[Serve] [CI] Split Serve tests into multiple shards (#23145) 2022-03-15 16:32:30 -07:00
mwtian
6eb805b357
[CI] remove GCS-Ray CI tests (#23149)
* remove redis ci tests

* remove mac
2022-03-14 18:18:59 -07:00
Simon Mo
3d3218d153
[CI] Add K8s Builder Step (#22035) 2022-02-24 13:11:38 -08:00
Yi Cheng
e3051ebf67
[ci] Fix grpcio 1.44 break test_output (#22494)
This PR limit grpc to be <= 1.42. This will fix testoutput.
2022-02-22 13:59:25 -08:00
Simon Mo
3e7511e84f
[CI] Disable privileged test (#22484) 2022-02-17 15:34:02 -08:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
matthewdeng
2c204a755b
[train] add minimal installation test suite (#22300)
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.

**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
2022-02-11 10:09:00 -08:00
SangBin Cho
20ab9188c6
[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170)
This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files.

The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. 

After this PR, we will add code to enable usage report "off by default".
2022-02-08 22:12:36 -08:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
mwtian
4faf3e1e31
[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589)
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).

This PR makes sure each test case shuts down its Ray cluster.
2022-01-17 23:08:47 -08:00
Yi Cheng
87d852fc28
[gcs/ha] Fix some tests failed in HA mode (#21587)
This PR fixed and reenabled tests in HA mode

- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api 
- //python/ray/tests:test_ray_debugger
2022-01-16 21:53:14 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
mwtian
0e5de61c18
remove unnecessary test filter (#21510)
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.

Also, re-enable some additional tests in GCS HA mode.
2022-01-11 10:17:03 -08:00
Yi Cheng
65598b3bb0
[gcs] Re-enable release tests with GCS HA (#21511)
Re-enable release tests with GCS HA mode.
2022-01-10 16:35:57 -08:00
Yi Cheng
4ab059eaa1
[gcs] Fix the server standalone tests in HA mode (#21480)
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
2022-01-07 22:54:50 -08:00
mwtian
24da654d90
[Test] Shard "Small & Large" tests (#21351) 2022-01-05 10:49:14 -08:00
mwtian
70db5c5592
[GCS][Bootstrap n/n] Do not start Redis in GCS bootstrapping mode (#21232)
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
2022-01-04 23:06:44 -08:00
Gagandeep Singh
92bf609a08
Unskip tests in `test_basic_3.py` (#20433) 2021-12-22 00:09:32 -08:00
Yi Cheng
09421a4ca6
[2/gcs] Bootstrap dashboard for gcs ha (#21179)
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 16:58:03 -08:00
Yi Cheng
f62faca04c
[1/gcs] gcs ha bootstrap for raylet (#21174)
This is part of #21129

This PR tries to cover the cpp/ray part of the bootstrap, some updates there:

remove the unused function/tests
some API updates

Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
2021-12-21 08:50:42 -08:00
Eric Liang
6f93ea437e
Remove the flaky test tag (#21006) 2021-12-11 01:03:17 -08:00
mwtian
b9bcd6215a
Disable two tests that are very flaky in GCS HA build (#21012)
`//python/ray/tests:test_client_reconnect` seems to only flake under GCS HA build. The client server starts to shutdown under injected failures, unlike the behavior without GCS KV or pubsub.

`//python/ray/tests:test_multi_node_3` seems to flake more often under GCS HA build, although it is still flaky without GCS HA feature flags. It seems raylet termination did not notify other processes properly.

Disable these two tests before they are fixed.
2021-12-10 17:08:25 -08:00
mwtian
6871a72a5c
[Core][Dashboard Pubsub 3/n] Migrate pubsub usages in dashboard to GCS pubsub (#20860)
Add support for Ray pubsub in dashboard. https://github.com/ray-project/ray/pull/20954 is the prerequisite, and contains more complete change under src/.
2021-12-10 14:36:57 -08:00
Kai Fricke
97ec2a03b6
[ci/buildkite] Add ml pipeline to speed up ML/RLLib tests (#20895)
ML tests will be built in a separate bootstrap step installing all required dependencies.
2021-12-09 21:14:10 +00:00
Yi Cheng
442b1025cd
[1/gcs-mem-kv] Memory mode for internal kv (#20881)
This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode.
There are two ways to address this:
- Update store client and use store client in internal kv
- Add memory table into internal kv directly.

The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change.

So the easier way to do this is 2) and it's what this PR doing.

Next: use the flag for store client
2021-12-08 10:40:35 -08:00
matthewdeng
0de105d42f
[train] update Trainer._is_tune_enabled to work when Tune is not installed (#20767) 2021-11-29 20:08:51 -08:00
Simon Mo
ca90c63483
[Serve] Add serve failure test to CI (#20392) 2021-11-16 08:12:08 -08:00
mwtian
a39fd74674
disable //python/ray/tests:test_autoscaler_drain_node_api in HA GCS build (#20296) 2021-11-12 15:47:42 -08:00
xwjiang2010
ce8504b0b2
[CI] Rebalance Tune tests a bit. (#20263) 2021-11-12 15:30:18 +00:00
chenk008
74fa267c72
Enable worker in container CI test (#20174) 2021-11-11 16:11:06 -08:00
mwtian
0330852baf
[Core][Pubsub] Implement Python GCS publisher and subscriber (#20111)
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.

Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.

Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.

GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.

## Related issue number
2021-11-11 14:59:57 -08:00
xwjiang2010
883fbd003c
[CI; Tune] Split Tune tests and examples (#20210)
* Split Tune tests and examples part 1 into tests and examples separate.

* fix typo.

* fix typo.

* Add docs.
2021-11-11 10:50:51 +01:00
Sven Mika
ebd56b57db
[RLlib; documentation] "RLlib in 60sec" overhaul. (#20215) 2021-11-10 22:20:06 +01:00
Sven Mika
50c30f89c6
[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016) 2021-11-04 20:40:57 +01:00
Yi Cheng
65d3054a09
[build] fix the wrong flag for gcs ha test (#20052)
## Why are these changes needed?
It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub`

## Related issue number
2021-11-04 09:59:11 -07:00
Sven Mika
4cb23d1c95
[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031) 2021-11-04 10:58:34 +01:00
mwtian
f83195a1e1
[Build] Add GCS HA builds (#20008)
## Why are these changes needed?
Add builds for Python tests with GCS pubsub enabled.

## Related issue number
2021-11-03 11:58:16 -07:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00