Commit graph

76 commits

Author SHA1 Message Date
Archit Kulkarni
1165f99b0b
[CI] disable Serve microbenchmark k8s (#22631) 2022-02-24 16:50:06 -08:00
Yi Cheng
de76d86bcb
[nightly] Stop GCS HA related nightly test (#22636)
Since we've already turned it on on master, we should stop these tests for now.
2022-02-24 16:40:08 -08:00
Chen Shen
17f589a05d
[Dataset][nighlty-test] use 2 instead of 15 windows for 1.5TB data ingestion #22479 2022-02-17 15:20:39 -08:00
Archit Kulkarni
da57012cbc
Add comment to periodic CI pipeline to update release process doc when updating test suites (#22037)
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.

This is an action item from the Ray 1.10.0 release retrospective.
2022-02-11 11:14:24 -06:00
Jiajun Yao
56c7b74072
Delete nightly shuffle_data_loader (#22185) 2022-02-07 15:23:34 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Yi Cheng
570f67798a
[nightly] Move scheduling tests into one suite (#21959)
For future convenience, we are moving scheduling-related tests into one suite for easier monitoring and benchmarking.
2022-01-28 13:32:34 -08:00
Yi Cheng
3560211ab5
[nightly] Temporarily stops the two pipelines for scheduling until with good setup. (#21922)
Right now these two tests always run out-of-time. We disable them for now and after solid test, we'll reenable them with good parameters.
2022-01-26 20:15:59 -08:00
SangBin Cho
7d4287a6ab
[Test] Move long running tests to run everyday (#21813)
Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.
2022-01-24 15:10:27 -08:00
SangBin Cho
babc03edf2
Add a threaded actor k8s test (#21739)
Add threaded actor flaky test to k8s.
2022-01-23 20:12:57 -08:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
Yi Cheng
90093769df
[nightly] Add more many tasks tests (#21727)
This PR add four tests for many tasks:

many short tasks send from the single node
many short tasks send from multiple nodes
many long tasks send from multiple nodes
many long tasks send from the single node
TODO: migrate many nodes actor tests to this one.

scheduling envelop should contain:

(tasks): scheduling_test_many_xx_tasks_yy_nodes
(actors):many_nodes_actor_test (to be combined with this one)
(shuffle): pipelined_ingestion_1500_gb_15_windows
(shuffle): dask_on_ray_1tb_sort
2022-01-20 14:52:26 -08:00
SangBin Cho
02af73a571
[Test] First core nightly test migration to k8s (#21698)
The first migration of test into k8s. We are adopting a conservative approach (migrate slowly while we keep existing test suites). Once things are confirmed to be stable, we will migrate with more speed.
2022-01-19 13:31:49 -08:00
Kai Fricke
4ef0c6c434
[tune/release] Demote xgboost_sweep to weekly testing (#21704)
XGBoost functionality is tested daily in the xgboost release test suite. The expensive XGBoost sweep test can thus be run weekly.
2022-01-19 09:15:04 -08:00
Kai Fricke
0e9e8824e4
[ci/release] use s3 sync (#21626)
Previous changes failed because a) permission errors b) unzip being unavailable at remote nodes. Instead we are using tar gzip archives now.

This reverts commit 42bcab27e8.
2022-01-15 17:53:19 -08:00
Kai Fricke
42bcab27e8
Revert "[Release Test] Opt-in tests to use K8s based cloud. (#21583)" (#21605)
This reverts commit 0d5fbcc7bb.
2022-01-14 11:46:52 -08:00
Simon Mo
0d5fbcc7bb
[Release Test] Opt-in tests to use K8s based cloud. (#21583) 2022-01-13 17:20:36 -08:00
Yi Cheng
4e0de0053d
[nightly] Add staging nightly test for gcs ha (#21004)
This PR adds four staging nightly tests for gcs :
- many_actors
- many_tasks
- many_pgs
- many_nodes

These are benchmark tests that are highly related to gcs ha. 

To make it easier to add tests, this PR also change e2e.py a little bit to include testing flags to app config.
2021-12-09 23:07:23 -08:00
Chen Shen
d0e79a36f9
[chaos-test] chaos test pipeline ingestion (#20929)
since it has been passing my test run; i'll land it and mark it as unstable.
2021-12-09 13:43:00 -08:00
Chen Shen
aca954e8dd
[dataset][cuj2] add another single node ingestion example (#20754)
* add runner

* fix bugs

* add configs

* add time
2021-12-07 02:50:17 -08:00
Kai Fricke
6b683ec8dc
[ci] Retry release tests on infra error (#20478)
This PR introduces proper exit codes for release tests. These are used to restart a certain set of infrastructure related failures automatically.
2021-12-02 10:34:40 -08:00
Chen Shen
6d17fe5fc5
[cuj2] merge latest change to cuj2 (groupby based filtering) and add a debug mode. (#20742)
This PR does two things:

merge latest groupby based filtering to CUJ2
add a debug mode so we only run dummy trainer for measure data processing performance.
2021-11-29 19:10:17 -08:00
SangBin Cho
97b4490401
[Nightly Test] Readjust nightly test schedule (#20717)
- Removing scale_to logic from object store. We don't need to scale during tests, which will disambiguate infra failures vs app failures.
- Run microbenchmark in core nightly, meaning it will run even more often
- Run weekly scalability tests daily instead. (They are not too expensive).
- Run some core daily tests separately to avoid infra failures.
2021-11-26 06:59:16 -08:00
Kai Fricke
7446269ac9
[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536)
Fixes some race conditions and softens some constraints around checkpoint numbers.
2021-11-24 09:29:12 +00:00
Chen Shen
107aef89a8
[CUJ2] add nightly tests for running 500GB ray train (#20195)
* add

* update cluster env

* fix build

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-11-21 20:04:45 -08:00
Kai Fricke
05d21497db
[rllib/tune] Fix durable trainable in trainer template, add release test (#20422) 2021-11-16 20:52:42 +00:00
SangBin Cho
5ec63ccc5f
[Regresion test] Placement group long running test (#20251)
Why are these changes needed?
In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed.

This PR adds a long running test for the placement group. There are 2 purposes of the test.

Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round.
Make sure placement group removal / creation works consistently for a long time without an issue.
Q: Should we make it a real long running test? (that runs for a day?)
2021-11-16 04:21:18 -08:00
Kai Fricke
4300039d01
[ci/release] Display commit hash in buildkite overview (#20323) 2021-11-15 10:09:04 +00:00
SangBin Cho
6cc493079b
[Core] Add Placement group performance test (#20218)
* in progress

* ip

* Fix issues

* done

* Address code review.
2021-11-14 09:17:54 +09:00
SangBin Cho
b2acfd6ff4
[Test] Change the frequency of many nodes actor test (#20232) 2021-11-10 21:12:22 -08:00
Simon Mo
215f47bc53
[CI] Move Serve nightly tests to a separate suite (#20194)
So we can run them via separate cronjobs
2021-11-09 13:22:50 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00
Yi Cheng
6a6cc434ba
[nightly] Remove grpc staging test since nightly is stable #20119 (#20119) 2021-11-05 21:36:58 -07:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
Kai Fricke
f96078687f
[xgboost/release] Xgboost/connect gpu test (#19838)
* [xgboost/release] Add GPU connect user test

* Use scaling cluster

* typo

* Increase xgboost placement group timeout

* Much higher timeout

* Move os environment timeout

* Move os environ

* [dev] install xgboost-ray from master

* GPU xgboost master

* Remove master install after new xgboost release

* Install latest

* Add master test
2021-11-02 08:40:48 -07:00
Amog Kamsetty
3a52187da8
[Release/Lightning] Add Ray lightning user test (#19812)
* wip

* wip

* add ray lightning test

* fix

* update

* merge and add

* fix

* fix

* rename

* autoscale

* add tblib

* gloo backend

* typo

* upgrade torch

* latest and master
2021-11-01 18:29:48 -07:00
Amog Kamsetty
474e44f7e0
[Release/Horovod] Add user test for Horovod (#19661)
* infra

* wip

* add test

* typo

* typo

* update

* rename

* fix

* full path

* formatting

* reorder

* update

* update

* Update release/horovod_tests/workloads/horovod_user_test.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* bump num_workers

* update installs

* try

* add pip_packages

* min_workers

* fix

* bump pg timeout

* Fix symlink

* fix

* fix

* cmake

* fix

* pin filelock

* final

* update

* fix

* Update release/horovod_tests/workloads/horovod_user_test.py

* fix

* fix

* separate compute template

* test latest and master

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-01 18:28:07 -07:00
matthewdeng
e1e4a45b8d
[train] add simple Ray Train release tests (#19817)
* [train] add simple Ray Train release tests

* simplify tests

* update

* driver requirements

* move to test

* remove connect

* fix

* fix

* fix torch

* gpu

* add assert

* remove assert

* use gloo backend

* fix

* finish

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-01 18:25:19 -07:00
architkulkarni
702bffe072
[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906) 2021-10-31 10:48:48 -05:00
Kai Fricke
fa0158abe5
[tune] Cloud checkpointing release tests (#19638) 2021-10-29 12:12:01 +02:00
Simon Mo
3e038aebb2
[CI] Allow release tests infra to accept buildkite artifacts (#19803) 2021-10-27 13:04:01 -07:00
Yi Cheng
abec07700a
[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779)
## Why are these changes needed?
We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging.

## Related issue number
#19438
2021-10-27 10:46:13 -07:00
Amog Kamsetty
6e61ca623d
[CI] Infra for "user" tests (#19662) 2021-10-26 08:47:22 +01:00
Yi Cheng
7a7b356899
[Nightly test] add test for grpc broadcasting (#19579) 2021-10-21 07:01:41 -07:00
Yi Cheng
7a9cedfc5c
[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531)
* dbg

* up

* check

* up

* up

* put grpc based one into nightly test

* up
2021-10-19 19:59:39 -07:00
Yi Cheng
f47f69d31e
[nightly] Add decision_tree_autoscaling_20_runs to nightly test 2021-10-18 11:19:40 -07:00
Kai Fricke
6c6639a0d7
[ci/release] hotfix for undefined local variable (#19460) 2021-10-18 11:28:33 +01:00