Commit graph

498 commits

Author SHA1 Message Date
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
SangBin Cho
92b50ff5da
Migrate multi nightly tests (#23005) 2022-03-11 01:32:10 +09:00
shrekris-anyscale
1100c98222
[serve] Implement Serve Application object (#22917)
The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.
2022-03-10 10:28:29 -06:00
SangBin Cho
d192ec30fd
[Nightly Tests] Readjust the concurrency limit. (#23002)
This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota.

Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.
2022-03-10 07:19:38 -08:00
SangBin Cho
4fa294ca49
[Nightly tests] Stop running broken tests (#22993) 2022-03-10 06:59:51 -08:00
SangBin Cho
e88abe4c8e
[Nightly tests] migrated most of daily tests (#22960)
* migrated most of daily tests

* Addressed code review.
2022-03-10 05:49:16 -08:00
Kai Fricke
007cf03d7a
[ci/release] Migrate RLLib tests (#22967)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/111
2022-03-10 10:26:03 +00:00
Kai Fricke
fee4065daf
[ci/release] Migrate SGD tests (#22966)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/110
2022-03-10 10:23:50 +00:00
Kai Fricke
614dc6b511
[ci/release] Migrate Serve tests (#22965)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/109
2022-03-10 10:23:25 +00:00
Kai Fricke
ccda1555cc
[ci/release] Migrate Runtime Env tests (#22963)
Migrating to new release test package.

https://buildkite.com/ray-project/release-tests-branch/builds/108
2022-03-10 10:22:57 +00:00
kyle-chen-uber
592656ca28
[horovod] remove deprecated slot concept, use worker instead (#22708)
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7

The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.

This commit updates the Horovod RayExecutor init args.

Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-03-10 08:16:42 +00:00
Kai Fricke
18d535f290
[ci/release] Migrate LightGBM tests (#22952)
Note that LightGBM release tests were previously not enabled.
https://buildkite.com/ray-project/release-tests-branch/builds/113
https://buildkite.com/ray-project/release-tests-branch/builds/114

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-03-10 08:14:31 +00:00
Edward Oakes
22e698d0ff
[serve][release tests] Add smoke test to CI for remaining tests (#22962) 2022-03-09 23:36:32 -06:00
Stephanie Wang
1b45582e43
[tests] Enable chaos testing for Dask-on-Ray (#22927)
Turns on failures for Dask-on-Ray chaos tests.
2022-03-09 18:08:41 -05:00
Edward Oakes
135cd121b9
[release tests] Fix minor bug in multi-deployment serve test (#22961) 2022-03-09 14:37:27 -06:00
Kai Fricke
ca87c37c61
[ci/release] Fix result output in Buildkite pipeline run (#22946)
The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
2022-03-09 17:29:31 +00:00
Edward Oakes
2cac49e4b0
[serve][release tests] Mark long-running failure test as non-stable (#22922) 2022-03-09 09:42:47 -06:00
Kai Fricke
ac654dbb9d
[ci/release] Fix schema validation for single tests / add stable field (#22947)
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
2022-03-09 15:22:49 +00:00
Kai Fricke
cac9d30909
[ci/release] Add schema validation for release test config (#22919)
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
2022-03-09 09:50:51 +00:00
Edward Oakes
aa907987bf
[serve][release tests] Use m5.8xlarge instance types for 1k replica tests (#22918) 2022-03-08 21:34:01 -06:00
SangBin Cho
549527687f
Migrate scalability tests (#22901)
This PR migrates scalability tests to the new infra.

I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
2022-03-08 17:22:41 -08:00
Kai Fricke
c57abb693b
[ci/release] Add frequency to core nightly test (#22905)
Breaks the scheduled build: https://buildkite.com/ray-project/release-tests-branch/builds/82#3994f5e1-6da3-4c70-8c30-bdcfb1fec851

We should enforce schema validation soon.
2022-03-08 17:44:20 +00:00
SangBin Cho
0137fc8e23
[Tests] Add microbenchmark to the new infra test (#22861)
Verified it works. It also addresses the frequency comments from the previous PR
2022-03-08 05:58:49 -08:00
Stephanie Wang
cb218d03b9
[core] Enable lineage reconstruction by default (#22816)
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
2022-03-07 17:40:30 -05:00
SangBin Cho
529911ee78
[Nightly tests] Add missing patches (#22862)
These changes are added to the old e2e.py, but not to the new infra
2022-03-07 19:48:43 +00:00
Jiajun Yao
1b5efb588e
[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844)
This's easier to sort and compare timestamp and avoid timezone issue.
2022-03-07 04:54:19 -08:00
SangBin Cho
9d0148dbbe
[Test] Migrate the first test to the new infra (#22770)
This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test
2022-03-06 18:24:54 -08:00
Jiajun Yao
23f2862067
[Release Test] Send release test result to db pipeline for new test infra (#22813)
* Send release test result to db pipeline for new test infra

* address comment
2022-03-05 07:34:40 +09:00
xwjiang2010
ee7a458762
[release test] fix horovod release test. (#22781)
horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. 
Error message is saying:
```
AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'>
```
The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`.

The two communicate via Anyscale client. 
The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.
2022-03-03 08:24:26 -08:00
Clark Zinzow
fa44ec82f3
Add Parquet metadata resolution nightly test to test set. (#22787) 2022-03-02 14:56:00 -08:00
Kai Fricke
7425fa6212
[ci/release] Add support for concurrency groups (#22728)
This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources.

Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)
2022-03-02 16:35:54 +01:00
Jiajun Yao
04a1a19f6b
[Release Test] Send release test result to db pipeline (#22667)
Send release test result to db pipeline
Add perf metrics for microbenchmark so that we can alert on them
2022-03-02 06:19:31 -08:00
Kai Fricke
d06c3ffd6f
[release] Migrate Tune + XGBoost tests to new infrastructure (#22705)
Migrate XGBoost and Tune tests to new release testing infrastructure.

https://buildkite.com/ray-project/release-tests-branch/builds/50
2022-03-01 08:10:06 +01:00
SangBin Cho
2c1184592e
mark threaded actor test unstable (#22696) 2022-02-28 15:25:14 -08:00
Clark Zinzow
cf3577f0ee
[Datasets] Patch Parquet file fragment serialization to prevent metadata fetching. (#22665) 2022-02-28 15:15:30 -08:00
Chen Shen
7e90700521
[Dataset][nighly-test] promote data ingestion test to stable #22702 2022-02-28 14:00:18 -08:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
SangBin Cho
1cedb1b6e4
[Test] Increase timeout for microbenchmark (#22655) 2022-02-25 17:29:12 -08:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Archit Kulkarni
31332f8930
[serve] [release tests] Add health check grace period for 1k deployment (#22651) 2022-02-25 12:13:44 -06:00
Archit Kulkarni
1165f99b0b
[CI] disable Serve microbenchmark k8s (#22631) 2022-02-24 16:50:06 -08:00
Yi Cheng
de76d86bcb
[nightly] Stop GCS HA related nightly test (#22636)
Since we've already turned it on on master, we should stop these tests for now.
2022-02-24 16:40:08 -08:00
Jun Gong
99b7be5e22
[rllib] Fix impala long running test (#22619)
fix impala long running test.
Bandits is the first agent that requires torch import at registration time.
2022-02-24 09:03:55 -08:00
SangBin Cho
5e847f7e09
[Usage Stats] Usage stats only enabled on nightly test infra (#22591)
This PR **enables the usage stats only on the release test infrastructure** (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.
2022-02-23 22:11:48 -08:00
Eric Liang
e15a419028
Enable stage fusion by default for dataset pipelines (#22476)
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
2022-02-23 17:34:05 -08:00
Max Pumperla
29d94a2211
[docs] sphinx gallery removal, migrate to ipynb (#22467) 2022-02-19 01:19:07 -08:00
Jiajun Yao
baa14d695a
Round robin during spread scheduling (#21303)
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
2022-02-18 15:05:35 -08:00
Stephanie Wang
03a5589591
[core] Enable lineage reconstruction in CI (#21519)
Enables lineage reconstruction in all CI and release tests.
2022-02-18 11:04:20 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479 2022-02-17 15:20:39 -08:00
mwtian
05dd72101b
[Release 1.11.0] Release logs for 1.11.0rc1 (#22443)
This is the release log for 1.11.0rc1, with GCS-Ray enabled. The diff is against 1.11.0rc0, without GCS-Ray.
2022-02-16 17:03:49 -08:00