Commit graph

555 commits

Author SHA1 Message Date
SangBin Cho
c0f8de9c3c
[Nightly tests] Run benchmark tests on k8s as well (#23100)
Run benchmark tests on k8s as well.

Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration
2022-03-11 19:40:37 -08:00
SangBin Cho
97383e4c1b
[Nightly test] Fix a broken nightly test due to the wrong config (#23097) 2022-03-11 16:47:06 -08:00
SangBin Cho
2b38fe89e2
[Nightly tests] Migrate rest of core tests (#23085)
MIgrate the rest of core tests
2022-03-11 10:41:14 -08:00
Kai Fricke
04ea180dfb
[ci/release] Add "tiny" concurrency group, change limits (#23065)
E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups.
Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).
2022-03-11 10:19:38 -08:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address (#23067)
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
SangBin Cho
965d609627
[Nightly test] Fix a minor syntax issue for core nightly tests (#23069)
Add frequency to smoke tests
Remove unnecessary alerts
2022-03-11 04:58:40 -08:00
Kai Fricke
5b2d58674b
[ci/release] Migrate horovod tests (#22951)
Migrating horovod tests to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/125
2022-03-11 09:53:29 +00:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
SangBin Cho
92b50ff5da
Migrate multi nightly tests (#23005) 2022-03-11 01:32:10 +09:00
shrekris-anyscale
1100c98222
[serve] Implement Serve Application object (#22917)
The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.
2022-03-10 10:28:29 -06:00
SangBin Cho
d192ec30fd
[Nightly Tests] Readjust the concurrency limit. (#23002)
This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota.

Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.
2022-03-10 07:19:38 -08:00
SangBin Cho
4fa294ca49
[Nightly tests] Stop running broken tests (#22993) 2022-03-10 06:59:51 -08:00
SangBin Cho
e88abe4c8e
[Nightly tests] migrated most of daily tests (#22960)
* migrated most of daily tests

* Addressed code review.
2022-03-10 05:49:16 -08:00
Kai Fricke
007cf03d7a
[ci/release] Migrate RLLib tests (#22967)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/111
2022-03-10 10:26:03 +00:00
Kai Fricke
fee4065daf
[ci/release] Migrate SGD tests (#22966)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/110
2022-03-10 10:23:50 +00:00
Kai Fricke
614dc6b511
[ci/release] Migrate Serve tests (#22965)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/109
2022-03-10 10:23:25 +00:00
Kai Fricke
ccda1555cc
[ci/release] Migrate Runtime Env tests (#22963)
Migrating to new release test package.

https://buildkite.com/ray-project/release-tests-branch/builds/108
2022-03-10 10:22:57 +00:00
kyle-chen-uber
592656ca28
[horovod] remove deprecated slot concept, use worker instead (#22708)
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7

The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.

This commit updates the Horovod RayExecutor init args.

Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-03-10 08:16:42 +00:00
Kai Fricke
18d535f290
[ci/release] Migrate LightGBM tests (#22952)
Note that LightGBM release tests were previously not enabled.
https://buildkite.com/ray-project/release-tests-branch/builds/113
https://buildkite.com/ray-project/release-tests-branch/builds/114

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-03-10 08:14:31 +00:00
Edward Oakes
22e698d0ff
[serve][release tests] Add smoke test to CI for remaining tests (#22962) 2022-03-09 23:36:32 -06:00
Stephanie Wang
1b45582e43
[tests] Enable chaos testing for Dask-on-Ray (#22927)
Turns on failures for Dask-on-Ray chaos tests.
2022-03-09 18:08:41 -05:00
Edward Oakes
135cd121b9
[release tests] Fix minor bug in multi-deployment serve test (#22961) 2022-03-09 14:37:27 -06:00
Kai Fricke
ca87c37c61
[ci/release] Fix result output in Buildkite pipeline run (#22946)
The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
2022-03-09 17:29:31 +00:00
Edward Oakes
2cac49e4b0
[serve][release tests] Mark long-running failure test as non-stable (#22922) 2022-03-09 09:42:47 -06:00
Kai Fricke
ac654dbb9d
[ci/release] Fix schema validation for single tests / add stable field (#22947)
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
2022-03-09 15:22:49 +00:00
Kai Fricke
cac9d30909
[ci/release] Add schema validation for release test config (#22919)
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
2022-03-09 09:50:51 +00:00
Edward Oakes
aa907987bf
[serve][release tests] Use m5.8xlarge instance types for 1k replica tests (#22918) 2022-03-08 21:34:01 -06:00
SangBin Cho
549527687f
Migrate scalability tests (#22901)
This PR migrates scalability tests to the new infra.

I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
2022-03-08 17:22:41 -08:00
Kai Fricke
c57abb693b
[ci/release] Add frequency to core nightly test (#22905)
Breaks the scheduled build: https://buildkite.com/ray-project/release-tests-branch/builds/82#3994f5e1-6da3-4c70-8c30-bdcfb1fec851

We should enforce schema validation soon.
2022-03-08 17:44:20 +00:00
SangBin Cho
0137fc8e23
[Tests] Add microbenchmark to the new infra test (#22861)
Verified it works. It also addresses the frequency comments from the previous PR
2022-03-08 05:58:49 -08:00
Stephanie Wang
cb218d03b9
[core] Enable lineage reconstruction by default (#22816)
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
2022-03-07 17:40:30 -05:00
SangBin Cho
529911ee78
[Nightly tests] Add missing patches (#22862)
These changes are added to the old e2e.py, but not to the new infra
2022-03-07 19:48:43 +00:00
Jiajun Yao
1b5efb588e
[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844)
This's easier to sort and compare timestamp and avoid timezone issue.
2022-03-07 04:54:19 -08:00
SangBin Cho
9d0148dbbe
[Test] Migrate the first test to the new infra (#22770)
This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test
2022-03-06 18:24:54 -08:00
Jiajun Yao
23f2862067
[Release Test] Send release test result to db pipeline for new test infra (#22813)
* Send release test result to db pipeline for new test infra

* address comment
2022-03-05 07:34:40 +09:00
xwjiang2010
ee7a458762
[release test] fix horovod release test. (#22781)
horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. 
Error message is saying:
```
AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'>
```
The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`.

The two communicate via Anyscale client. 
The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.
2022-03-03 08:24:26 -08:00
Clark Zinzow
fa44ec82f3
Add Parquet metadata resolution nightly test to test set. (#22787) 2022-03-02 14:56:00 -08:00
Kai Fricke
7425fa6212
[ci/release] Add support for concurrency groups (#22728)
This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources.

Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)
2022-03-02 16:35:54 +01:00
Jiajun Yao
04a1a19f6b
[Release Test] Send release test result to db pipeline (#22667)
Send release test result to db pipeline
Add perf metrics for microbenchmark so that we can alert on them
2022-03-02 06:19:31 -08:00
Kai Fricke
d06c3ffd6f
[release] Migrate Tune + XGBoost tests to new infrastructure (#22705)
Migrate XGBoost and Tune tests to new release testing infrastructure.

https://buildkite.com/ray-project/release-tests-branch/builds/50
2022-03-01 08:10:06 +01:00
SangBin Cho
2c1184592e
mark threaded actor test unstable (#22696) 2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665) 2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702 2022-02-28 14:00:18 -08:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
SangBin Cho
1cedb1b6e4
[Test] Increase timeout for microbenchmark (#22655) 2022-02-25 17:29:12 -08:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Archit Kulkarni
31332f8930
[serve] [release tests] Add health check grace period for 1k deployment (#22651) 2022-02-25 12:13:44 -06:00
Archit Kulkarni
1165f99b0b
[CI] disable Serve microbenchmark k8s (#22631) 2022-02-24 16:50:06 -08:00
Yi Cheng
de76d86bcb
[nightly] Stop GCS HA related nightly test (#22636)
Since we've already turned it on on master, we should stop these tests for now.
2022-02-24 16:40:08 -08:00
Jun Gong
99b7be5e22
[rllib] Fix impala long running test (#22619)
fix impala long running test.
Bandits is the first agent that requires torch import at registration time.
2022-02-24 09:03:55 -08:00