Commit graph

587 commits

Author SHA1 Message Date
Simon Mo
0d5fbcc7bb
[Release Test] Opt-in tests to use K8s based cloud. (#21583) 2022-01-13 17:20:36 -08:00
Jun Gong
83955a9407
[RLlib] Extend CQL perf test to 1hr. (#21449) 2022-01-07 11:35:16 +01:00
Jiajun Yao
76b91efd9b
Fix wrong many_nodes_actor_test app config (#21404)
RAY_GCS_ACTOR_SCHEDULING_ENABLED is wrong should be RAY_gcs_actor_scheduling_enabled. Since gcs based actor scheduling is not enabled yet so I just removed this flag.
2022-01-05 11:52:13 -08:00
Kai Fricke
aa35045b6f
[ci/release] Update to recent anyscale API changes (#21149)
Recent changes in the anyscale API rendered the current e2e script incompatible. This PR resolves these subtle API changes.
2022-01-04 11:21:47 +00:00
Chen Shen
704404d408
[BigDataTraining] Fix test script introduced by API change (#21347)
* fix

* fix test failure

* Update release/nightly_tests/dataset/ray_sgd_training.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-01-03 12:14:36 -08:00
Antoni Baum
7ce22b72ed
[datasets] Expand to_torch's functionality (#21117)
Expands the `to_torch` method for Datasets with:
* An ability to choose to output a list/dict of feature tensors instead of just one (through setting `feature_columns` to be a list of lists or a dict of lists)
* An ability to choose whether the label should be unsqueezed or not
* An ability to pass `None` as the label (for prediction).

Furthermore, this changes how the `feature_column_dtypes` argument works. Previously, it took a list of dtypes for each feature. However, as the tensor was concatenated in the end, only one dtype mattered (the biggest one). Now, this argument expects a single dtype which will be applied to the features tensor (or a list/dict if `feature_columns` is a list of list/dict of lists).

Unit tests for all cases are included.

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2022-01-03 09:03:50 -08:00
Jiajun Yao
9776e21842
Revert "Round robin during spread scheduling (#19968)" (#21293)
This reverts commit 60388b2834.
2021-12-30 10:33:06 +09:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
Akash Patel
cbcd03b779
Upgrade cython to 0.29.26 for py310 (#21244) 2021-12-26 20:26:08 -08:00
Jiajun Yao
60388b2834
Round robin during spread scheduling (#19968) 2021-12-22 20:27:34 -08:00
Jiajun Yao
7d861a2c58
[Test] Add ray wheel sanity check (#21223) 2021-12-21 14:24:02 -08:00
architkulkarni
2489b17634
[release] Uninstall old ray in all release test app configs to fix commit mismatch error (#21175)
* uninstall old ray in all release test app configs

* add instruction to e2e.py dosctring
2021-12-18 16:58:49 -08:00
Chen Shen
c9c3f0745a
[Dataset][nighlytest] use latest ray for running test #21148
We are actually using the ray comes with the image, which is on a very old version of Ray. (suprised this actually works)
2021-12-17 23:48:44 -08:00
architkulkarni
56bd8e58de
[CI] [Release] uninstall Ray before installing new Ray version (#21159) 2021-12-17 16:25:15 -08:00
architkulkarni
4dcba1d0f4
[CI] Pin anyscale version to fix release tests (#21138) 2021-12-16 13:15:16 -08:00
Chen Shen
80eb00f525
[Chaos] fix dataset chaos test #21113 2021-12-15 20:13:38 -08:00
Clark Zinzow
ec06a1f65e
[CUJ#2] Update nightly test for CUJ#2 #21064 2021-12-15 13:19:59 -08:00
Jun Gong
767f78eaf8
[RLlib] Always attach latest eval metrics. (#21011) 2021-12-15 11:42:53 +01:00
SangBin Cho
1c1430ff5c
Add memory monitor to scalability tests. (#21102)
This adds memory monitoring to scalability envelope tests so that we can compare the peak memory usage for both nonHA & HA.

NOTE: the current way of adding memory monitor is not great, and we should implement fixture to support this better, but that's not in progress yet.
2021-12-15 01:31:38 -08:00
Chen Shen
3c426ed7b5
[nighly-test] fix dataset nightly test reporting #21061 2021-12-14 00:05:40 -08:00
Kai Fricke
b58f839534
[ci/release] Remove hard numpy removal from app configs (#21005) 2021-12-13 15:22:02 +00:00
xwjiang2010
46d2f2c160
[release test] Update torch_tune_serve test to be compatible with new TrialCheckpoint class. (#21010) 2021-12-10 17:26:15 +00:00
Yi Cheng
4e0de0053d
[nightly] Add staging nightly test for gcs ha (#21004)
This PR adds four staging nightly tests for gcs :
- many_actors
- many_tasks
- many_pgs
- many_nodes

These are benchmark tests that are highly related to gcs ha. 

To make it easier to add tests, this PR also change e2e.py a little bit to include testing flags to app config.
2021-12-09 23:07:23 -08:00
Chen Shen
d0e79a36f9
[chaos-test] chaos test pipeline ingestion (#20929)
since it has been passing my test run; i'll land it and mark it as unstable.
2021-12-09 13:43:00 -08:00
Chen Shen
6a274dfd76
CI][Chaos-test] chaos test now can set max-nodes-to-kill #20962 2021-12-09 13:41:46 -08:00
Chen Shen
aca954e8dd
[dataset][cuj2] add another single node ingestion example (#20754)
* add runner

* fix bugs

* add configs

* add time
2021-12-07 02:50:17 -08:00
Chen Shen
a628182cf5
[nighly-test] update cuj2 to reflect latest change #20889
we fixed groupby issue in cuj2; sync the change into nightly test. this test doesn't need to use gpu at all. it returns soon after data ingestion finishes.
2021-12-06 09:59:21 -08:00
Kai Fricke
b3a9d4d87d
[ci/release] Remove quotation marks from pip installs (#20638)
Quotation marks were needed in Anyscale app configs to avoid install errors when # were used e.g. in URLs.
Since this has been fixed on the Anyscale side, we can get rid of these.
2021-12-05 17:57:08 -08:00
xwjiang2010
368da1742b
[tune] Enforce one future at a time for any given trial at any given time. (#20783)
Also enforce disabling (instead of allowing user to override this) buffer training when checkpoint_at_end is used.
2021-12-03 08:14:12 -08:00
Kai Fricke
6b683ec8dc
[ci] Retry release tests on infra error (#20478)
This PR introduces proper exit codes for release tests. These are used to restart a certain set of infrastructure related failures automatically.
2021-12-02 10:34:40 -08:00
Yi Cheng
b25a757c91
[release] update release log for 1.9.0 release (#20781)
Update 1.9.0 release log.
2021-11-29 22:20:37 -08:00
Chen Shen
6d17fe5fc5
[cuj2] merge latest change to cuj2 (groupby based filtering) and add a debug mode. (#20742)
This PR does two things:

merge latest groupby based filtering to CUJ2
add a debug mode so we only run dummy trainer for measure data processing performance.
2021-11-29 19:10:17 -08:00
Amog Kamsetty
99ed623371
[Release] Use NCCL backend for release tests (#20677)
* use nccl for release tests

* link issue
2021-11-29 12:42:13 -08:00
Alex Wu
d7b14ad9b8
[release][m1] Update sanity check python versions for M1 mac (#20730)
This is a minor update to our release sanity check script so that it runs out of the box on M1. Since M1s only support python 3.8 and 3.9, we shouldn't try to install python 3.6 or 3.7.
2021-11-29 11:38:38 -08:00
SangBin Cho
6649f078e5
[Internal Observability] Move debug_state.txt to the log dir + support gcs_server debug state (#20722)
Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below.
Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds
For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.
2021-11-28 20:42:37 -08:00
SangBin Cho
6fc6ebb43e
Promote some tests stable. (#20740)
Mark staging tests that pass 10+ time in a row as stable tests
2021-11-28 18:43:39 -08:00
Amog Kamsetty
ac843a957c
[Release] Use large instance type for long running impala test (#20691)
* add

* update
2021-11-26 11:42:41 -08:00
SangBin Cho
97b4490401
[Nightly Test] Readjust nightly test schedule (#20717)
- Removing scale_to logic from object store. We don't need to scale during tests, which will disambiguate infra failures vs app failures.
- Run microbenchmark in core nightly, meaning it will run even more often
- Run weekly scalability tests daily instead. (They are not too expensive).
- Run some core daily tests separately to avoid infra failures.
2021-11-26 06:59:16 -08:00
SangBin Cho
cd7a32f1a5
[Nightly test] Chaos test fixture (#20277)
This PR is mostly for implementing "fixture" for nightly test. Note that the current fixture implementation is not that great, and we can probably improve this in the future after refactoring e2e.py.
2021-11-24 17:13:29 -08:00
Alex Wu
63969c9a5c
[nigthly-tests][dataset] Use actor compute model for GPU inference (#20689)
## Why are these changes needed?
Fix nightly tests to avoid oom

## Checks
2021-11-24 11:03:23 -08:00
Antoni Baum
a8d7897a56
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544)
Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node.
2021-11-24 10:27:17 +00:00
Kai Fricke
7446269ac9
[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536)
Fixes some race conditions and softens some constraints around checkpoint numbers.
2021-11-24 09:29:12 +00:00
SangBin Cho
ca092fd032
[Nightly test] Fix broken pg long running test master (#20674)
* Fixed.

* Fix trial
2021-11-23 21:24:00 -08:00
Yi Cheng
b6b4d4cf57
[test] Update base image for nightly testing (#20680)
## Why are these changes needed?

`base_image: "anyscale/ray-ml:pinned-nightly-py37"` doesn't exist anymore which fails a lot of nightly tests, change to `base_image: "anyscale/ray-ml:nightly-py37-gpu"`
## Related issue number

## Checks
2021-11-23 11:06:44 -08:00
Chen Shen
107aef89a8
[CUJ2] add nightly tests for running 500GB ray train (#20195)
* add

* update cluster env

* fix build

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-11-21 20:04:45 -08:00
Alex Wu
24f27203ba
[hotfix] Fix inference nightly test by upgrading numpy (#20546)
The ray-ml image depends on numpy ~=1.19.2 via the tensorflow==2.6 requirement. Unfortunately that's incompatible with Dataset (see here #20258 (comment)).

This PR upgrades the numpy dependency only for the nightly test.
2021-11-19 08:15:23 -08:00
Alex Wu
a811b2b6d7
[hotfix] Fix stress_test_many_tasks cluster environment (#20519)
This should fix the long running release tests that are failing to build their app configs.

It seems like pip install ray[all] now downgrades the ray version. It's unclear why, but most likely, a dependency has pinned the ray version now. This PR explicitely install the version of Ray that we want after the pip install ray[all] to fix the problem.
2021-11-18 11:51:46 -08:00
Amog Kamsetty
3f1092fb3d
[Release] Revert impala app config (#20397) 2021-11-18 11:24:22 -08:00
Simon Mo
d7f208dea4
[Releaes] Make e2e.py link clickable on buildkite (#20436)
Adds log formatting to output clickable links to buildkite console logs
2021-11-18 12:45:59 +00:00