Commit graph

162 commits

Author SHA1 Message Date
Yi Cheng
818bb78542
[ci] Stop syncer staging tests (#26273)
The tests has been running for 1-2 months, and the overall observation is that it's not very useful to catch the actual regression. Basically, we didn't notice any regression. Stop this test for now to save some resources.
2022-07-03 11:17:10 -07:00
Jiao
f6735f90c7
[Ray DAG] Move dag project folder out of experimental (#25532) 2022-06-16 19:15:39 -07:00
Amog Kamsetty
1316a2d05e
[AIR/Train] Move ray.air.train to ray.train (#25570) 2022-06-08 21:34:18 -07:00
Yi Cheng
acf210fcac
[flakey] Skip ray_syncer_test for ubsan. (#25477)
From the message:
```
[       OK ] SyncerTest.TestMToN (13132 ms)
[----------] 5 tests from SyncerTest (43175 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 2 test suites ran. (43176 ms total)
[  PASSED  ] 8 tests.
external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable'
```

This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API. 
Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.
2022-06-04 23:06:57 -07:00
Yi Cheng
cb1f08a3c1
[core] Basic end-2-end multi-node tests for GCS HA in CI. (#25114)
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
2022-06-02 02:41:38 +00:00
Yi Cheng
8ec558dcb9
[core] Reenable GCS test with redis as backend. (#23506)
Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e.

The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit.

In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok.

This PR is part of GCS HA testing-related work.
2022-05-19 21:46:55 -07:00
mwtian
502c3e132d
Revert "[Core] allow using grpcio > 1.44.0 (#23722)" (#24935)
This reverts commit b02029b29f.
2022-05-18 18:16:39 -07:00
SangBin Cho
fb60d68bbb
[WIP] Run minimal tests against all supported python version (#24830)
Run minimal CI tests to all Python versions.
2022-05-18 09:42:26 -07:00
Chen Shen
1325cf7876
[python3.10] Build py310 images (#24859)
Build python 3.10 images so we can run release tests.
2022-05-18 08:48:20 -07:00
Yi Cheng
68384ec745
[ci] Add flag for staging tests and disable the unstable one. (#24745)
This PR tries to add a prefix for the staging ci test. This is useful to separate staging tests from stable tests in https://flakey-tests.ray.io/
2022-05-13 13:48:14 -07:00
Yi Cheng
a7d552ca25
[ci] Fix syncer staging tests error (#24681)
The staging tests failed due to using the wrong file. This PR fixed it.

https://buildkite.com/ray-project/ray-builders-branch/builds/7458#d6c28480-4c99-4a69-908c-9b0b5af9ce1f
2022-05-10 23:23:50 -07:00
Yi Cheng
6c60dbb242
[scheduler][6] Integrate ray with syncer. (#23660)
The new syncer comes with the feature of long-polling and versioning. This PR integrates it with ray.
2022-05-10 13:12:22 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
mwtian
b02029b29f
[Core] allow using grpcio > 1.44.0 (#23722) 2022-05-04 19:06:11 -07:00
Dmitri Gekhtman
d68c1ecaf9
[kuberay] Test Ray client and update autoscaler image (#24195)
This PR adds KubeRay e2e testing for Ray client and updates the suggested autoscaler image to one running the merge commit of PR #23883 .
2022-04-27 18:02:12 -07:00
Kai Fricke
b86d420a3c
[ci] Only upload wheels to S3 once (#24072)
Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories.

The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.
2022-04-25 21:19:11 +01:00
Dmitri Gekhtman
8c5fe44542
[KubeRay] Fix autoscaling with GPUs and custom resources, with e2e tests (#23883)
- Closes #23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
2022-04-21 14:54:37 -07:00
Yi Cheng
04611edf5a
[scheduler] Update syncer API and add reconnect feature. (#23929)
This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size.

Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage
Make ray syncer test work even when we add more components in the protobuf
Make ray syncer able to reconnect to a new node.
2022-04-20 14:31:24 -07:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Kai Fricke
afd287eb93
[ci] linkcheck should soft fail (#23559)
Linkcheck failures should not break the build.
2022-03-29 10:57:03 -07:00
Eric Liang
990b0ec934
Move linkcheck into a separate CI build
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
2022-03-29 01:08:53 -07:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Dmitri Gekhtman
bc98afcdf8
Test of KubeRay autoscaler integration (#23365)
This PR adds a test of KubeRay autoscaler integration to the Ray CI.

- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
2022-03-23 18:18:48 -07:00
shrekris-anyscale
b00977b1b1
[serve] Remove dashboard's dependency on Serve (#23389) 2022-03-21 22:14:41 -07:00
Jialing He
4a83bc3dc2
[runtime env] Support set timeout for runtime env setup (#23082)
Interface example:
```python
@ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10))
def f(): pass

@ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}})
def f(): pass
```

Support set timeout second for timeout of runtime environment creation.

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-03-18 12:52:59 -05:00
Kai Fricke
da140a80e9
[ci/release] Legacy field should be optional (#23326)
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
2022-03-18 11:34:05 +00:00
Simon Mo
78d6ed7029
[Serve] [CI] Split Serve tests into multiple shards (#23145) 2022-03-15 16:32:30 -07:00
mwtian
6eb805b357
[CI] remove GCS-Ray CI tests (#23149)
* remove redis ci tests

* remove mac
2022-03-14 18:18:59 -07:00
Simon Mo
3d3218d153
[CI] Add K8s Builder Step (#22035) 2022-02-24 13:11:38 -08:00
Yi Cheng
e3051ebf67
[ci] Fix grpcio 1.44 break test_output (#22494)
This PR limit grpc to be <= 1.42. This will fix testoutput.
2022-02-22 13:59:25 -08:00
Simon Mo
3e7511e84f
[CI] Disable privileged test (#22484) 2022-02-17 15:34:02 -08:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
matthewdeng
2c204a755b
[train] add minimal installation test suite (#22300)
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.

**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
2022-02-11 10:09:00 -08:00
SangBin Cho
20ab9188c6
[Ray Usage Stats] Record cluster metadata + Refactoring. (#22170)
This is the first PR to implement usage stats on Ray. Please refer to the file `usage_lib.py` for more details.

The full specification is here https://docs.google.com/document/d/1ZT-l9YbGHh-iWRUC91jS-ssQ5Qe2UQ43Lsoc1edCalc/edit#heading=h.17dss3b9evbj.

You can see the full PR for phase 1 from here; https://github.com/rkooo567/ray/pull/108/files.

The PR is doing some basic refactoring + adding cluster metadata to GCS instead of the version numbers. 

After this PR, we will add code to enable usage report "off by default".
2022-02-08 22:12:36 -08:00
SangBin Cho
3566cfd279
[Dashboard] Enable dashboard in the minimal ray installation (#21896)
This is the last PR to enable dashboard in the minimal ray installation.

Look https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit# for more details;
2022-01-31 22:34:40 -08:00
SangBin Cho
e62c0052a0
[Dashboard] Agent in minimal ray installation (#21817)
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.

Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
2022-01-26 04:03:54 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
mwtian
4faf3e1e31
[GCS] reenable test_client_reconnect.py for GCS HA builds (#21589)
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).

This PR makes sure each test case shuts down its Ray cluster.
2022-01-17 23:08:47 -08:00
Yi Cheng
87d852fc28
[gcs/ha] Fix some tests failed in HA mode (#21587)
This PR fixed and reenabled tests in HA mode

- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api 
- //python/ray/tests:test_ray_debugger
2022-01-16 21:53:14 -08:00
Yi Cheng
6dccfbffa9
Revert "Revert "[gcs] turn on grpc pubsub by default"" (#21585)
Reverts ray-project/ray#21584 and turn the flag off
2022-01-13 16:12:03 -08:00
mwtian
30968a9358
[GCS] support external Redis in GCS bootstrapping mode (#21436)
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.

Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
2022-01-13 16:01:11 -08:00
Yi Cheng
bc696212d2
Revert "[gcs] turn on grpc pubsub by default" (#21584)
test-reconnect seems flaky.
Reverts ray-project/ray#21513
2022-01-13 12:34:02 -08:00
Yi Cheng
6194783312
[gcs] turn on grpc pubsub by default (#21513)
Turn on grpc pubsub by default.  This PR also fixed several tests which are failed before.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-12 22:13:03 -08:00
mwtian
0e5de61c18
remove unnecessary test filter (#21510)
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.

Also, re-enable some additional tests in GCS HA mode.
2022-01-11 10:17:03 -08:00
Yi Cheng
65598b3bb0
[gcs] Re-enable release tests with GCS HA (#21511)
Re-enable release tests with GCS HA mode.
2022-01-10 16:35:57 -08:00
Yi Cheng
4ab059eaa1
[gcs] Fix the server standalone tests in HA mode (#21480)
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
2022-01-07 22:54:50 -08:00
mwtian
24da654d90
[Test] Shard "Small & Large" tests (#21351) 2022-01-05 10:49:14 -08:00