Commit graph

81 commits

Author SHA1 Message Date
Chen Shen
a628182cf5
[nighly-test] update cuj2 to reflect latest change #20889
we fixed groupby issue in cuj2; sync the change into nightly test. this test doesn't need to use gpu at all. it returns soon after data ingestion finishes.
2021-12-06 09:59:21 -08:00
Chen Shen
6d17fe5fc5
[cuj2] merge latest change to cuj2 (groupby based filtering) and add a debug mode. (#20742)
This PR does two things:

merge latest groupby based filtering to CUJ2
add a debug mode so we only run dummy trainer for measure data processing performance.
2021-11-29 19:10:17 -08:00
SangBin Cho
6fc6ebb43e
Promote some tests stable. (#20740)
Mark staging tests that pass 10+ time in a row as stable tests
2021-11-28 18:43:39 -08:00
SangBin Cho
cd7a32f1a5
[Nightly test] Chaos test fixture (#20277)
This PR is mostly for implementing "fixture" for nightly test. Note that the current fixture implementation is not that great, and we can probably improve this in the future after refactoring e2e.py.
2021-11-24 17:13:29 -08:00
Alex Wu
63969c9a5c
[nigthly-tests][dataset] Use actor compute model for GPU inference (#20689)
## Why are these changes needed?
Fix nightly tests to avoid oom

## Checks
2021-11-24 11:03:23 -08:00
SangBin Cho
ca092fd032
[Nightly test] Fix broken pg long running test master (#20674)
* Fixed.

* Fix trial
2021-11-23 21:24:00 -08:00
Chen Shen
107aef89a8
[CUJ2] add nightly tests for running 500GB ray train (#20195)
* add

* update cluster env

* fix build

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
2021-11-21 20:04:45 -08:00
Alex Wu
24f27203ba
[hotfix] Fix inference nightly test by upgrading numpy (#20546)
The ray-ml image depends on numpy ~=1.19.2 via the tensorflow==2.6 requirement. Unfortunately that's incompatible with Dataset (see here #20258 (comment)).

This PR upgrades the numpy dependency only for the nightly test.
2021-11-19 08:15:23 -08:00
Amog Kamsetty
9796ae56d5
[Train][Data] Change usages of iter_datasets to iter_epochs (#20487) 2021-11-17 18:05:51 -08:00
SangBin Cho
5ec63ccc5f
[Regresion test] Placement group long running test (#20251)
Why are these changes needed?
In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed.

This PR adds a long running test for the placement group. There are 2 purposes of the test.

Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round.
Make sure placement group removal / creation works consistently for a long time without an issue.
Q: Should we make it a real long running test? (that runs for a day?)
2021-11-16 04:21:18 -08:00
SangBin Cho
a4f72c6606
[nightly] Fix pg stress test (#20362)
## Why are these changes needed?

This was mistakenly added to the nightly. Fixing it. 

## Related issue number
2021-11-15 00:17:18 -08:00
SangBin Cho
6cc493079b
[Core] Add Placement group performance test (#20218)
* in progress

* ip

* Fix issues

* done

* Address code review.
2021-11-14 09:17:54 +09:00
SangBin Cho
9fd8c6648c
[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220)
* Fix nightly tests

* done

* done
2021-11-11 05:01:19 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
Yi Cheng
6a6cc434ba
[nightly] Remove grpc staging test since nightly is stable #20119 (#20119) 2021-11-05 21:36:58 -07:00
Yi Cheng
04f60c998e
[nightly] Fix pytest missing in nightly test (#20076)
## Why are these changes needed?
In the nightly test we see
```
Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest'
```
This PR fixes this error.

## Related issue number
2021-11-04 13:38:05 -07:00
Lixin Wei
1fe9f3372e
[Nightly Test] Remove duplicate printing code (#19874)
## Why are these changes needed?

Remove duplicate printing code
2021-10-29 10:19:19 -07:00
Yi Cheng
abec07700a
[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779)
## Why are these changes needed?
We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging.

## Related issue number
#19438
2021-10-27 10:46:13 -07:00
SangBin Cho
ecd5a622ef
[Tests] Add a memory usage on dask on ray tests (#19674) 2021-10-25 14:58:26 -07:00
Yi Cheng
7a7b356899
[Nightly test] add test for grpc broadcasting (#19579) 2021-10-21 07:01:41 -07:00
Yi Cheng
01b899dafb
[nightly] Fix broken test due to bad syntax #19536 (#19536) 2021-10-19 21:43:46 -07:00
Yi Cheng
7a9cedfc5c
[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531)
* dbg

* up

* check

* up

* up

* put grpc based one into nightly test

* up
2021-10-19 19:59:39 -07:00
Chen Shen
b38ebd368c
[Dataset][nighlyt-test] spend less money #19488
Reduce the epoch and ensure everything runs in the same datacenter.
2021-10-18 18:53:50 -07:00
Kai Fricke
ad94eb03c6
[ci/release] wrap pip github installs in quotation marks to prevent comment errors (#19464) 2021-10-18 18:55:56 +01:00
Chen Shen
9dba5e0ead
[dataset][nightly-test] fix pipeline ingest test (#19437) 2021-10-18 11:31:24 +01:00
Yi Cheng
1dc03cd49d
[nightly] Put many nodes actor test back (#19313)
## Why are these changes needed?
There are two issues fixed in this PR:
- make sure wait for session count alive node
- upgrade the machine to match what's tested in oss ray.

## Related issue number
https://github.com/ray-project/ray/issues/19084
2021-10-13 15:51:12 -07:00
SangBin Cho
dd1c1f9787
[Nightly test] remove env vars from tests (#19221)
When testing it we should minimize unnecessary env vars (and it's better working with the default config). This PR removes unnecessary env vars that are set.
2021-10-08 06:53:23 -07:00
Clark Zinzow
ca731d7c86
[Datasets] Fix API breakage in Datasets nightly test. 2021-10-07 15:07:19 -07:00
SangBin Cho
22f4ffed08
Disable cpu-only-nodes preferred scheduling that breaks placement groups. (#19129)
* Add a regression test for the short term

* done

* address code review

* lint
2021-10-07 05:34:04 -07:00
Eric Liang
86cbe3e833
[data] Add support for repeating and re-windowing a DatasetPipeline (#19091) 2021-10-06 20:13:43 -07:00
Yi Cheng
1eecb7d80b
up (#19092) 2021-10-04 23:54:31 -07:00
SangBin Cho
55227a15b9
Handle retry to avoid statement timeout exception/ (#18968) 2021-09-29 23:04:35 -07:00
Yi Cheng
a993f3a262
[nightly] update nightly test for many node test 2021-09-29 17:28:44 -07:00
Dmitri Gekhtman
944309c017
Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)" (#18954)
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)"

This reverts commit fc6a739e4b.

* move to large test

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 11:02:14 -04:00
Chen Shen
62a73f4ce8
[nightly test][event] enable event logs in nightly tests (#18936) 2021-09-28 01:29:26 -07:00
Chen Shen
7c99aae033
[dataset][nightly-test] add pipelined ingestion/training nightly test 2021-09-23 20:39:03 -07:00
Yi Cheng
fc6a739e4b
[nightly] Deflaky nightly test many_nodes_actor_test (#18582) 2021-09-20 22:43:48 -07:00
Kai Fricke
7d1e6d3129
[ci/release] Add sanity check for ray wheels hash to release tests (#18489) 2021-09-10 17:50:31 +01:00
Yi Cheng
23e9af0601
[test] Add x nodes y actors test to nightly tests (#18291) 2021-09-03 18:54:23 -07:00
SangBin Cho
814095add6
Revert "Change instance type for some tests (#18248)" (#18320)
This reverts commit 34026a7bd5.
2021-09-02 17:45:02 -07:00
SangBin Cho
34026a7bd5
Change instance type for some tests (#18248) 2021-08-31 10:10:46 -07:00
SangBin Cho
eab506cc37
[Test] Disable non streaming shuffle 5000 partitions (#18224)
* Disable non streaming shuffle 5000 partitions

* increase timeout for 5000 partition shuffle
2021-08-31 00:28:15 -07:00
SangBin Cho
dfbad8668a
Support better infra failure detection + stable flag (#18202) 2021-08-30 10:51:03 -07:00
SangBin Cho
43da68e657
Fix a nightly dask on ray test (#18060) 2021-08-24 22:15:34 -07:00
Chen Shen
89f988e9cc
add dataset shuffle data loader (#17917) 2021-08-20 11:26:01 -07:00
SangBin Cho
4971e13941
[Build] Asan wheel test (#17685)
* in progerss

* ASAN tests.

* d

* in progress

* in progress without the asan wheel

* Support the asan wheel.

* Support the asan wheels

* Not build a binary for asan

* Fix issues

* Remove a wrong build

* Separate out asan wheel build

* Try preparing more deps.

* ip

* Try different version

* done

* d

* Trial

* Another try

* Another try

* skip cpp build to see what happens

* add more des

* ip

* abc

* Try next

* completed

* try

* Try without static libasan

* dbg

* Try static link

* Fix issues

* abc
2021-08-17 10:21:41 -07:00
Eric Liang
ce171f10a1
Remove legacy plasma unlimited and pull manager pinning flag (#17753) 2021-08-11 20:19:12 -07:00
SangBin Cho
a3c5cce834
Add prepare for dask on ray 1tb sort. (#17708) 2021-08-10 16:26:05 -07:00