Commit graph

356 commits

Author SHA1 Message Date
Amog Kamsetty
f164f3a8b5
[Release] Increase Placement Group timeout (#20224) 2021-11-10 13:02:38 -08:00
xwjiang2010
2fbbecf1e4
[release] Define worker node type even if no worker node is needed. (#20223) 2021-11-10 11:19:09 -08:00
matthewdeng
790e22f9ad
[tune] move force_on_current_node to ml_utils (#20211) 2021-11-10 10:21:24 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00
Simon Mo
215f47bc53
[CI] Move Serve nightly tests to a separate suite (#20194)
So we can run them via separate cronjobs
2021-11-09 13:22:50 -08:00
SangBin Cho
90fd38c64a
[Test] Large scale threaded actor workload (#20105)
* Done

* Addressed code review.

* lint

* Update release/nightly_tests/stress_tests/test_threaded_actors.py

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>

Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
2021-11-09 02:28:48 -08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00
xwjiang2010
99826d2ca6
[Release] Increase node memory by 2X in many_ppo test. (#19591) 2021-11-08 08:10:09 +09:00
Jiajun Yao
e110d958a1
Support different s3 url formats (#20133) 2021-11-07 14:58:51 -08:00
Yi Cheng
6a6cc434ba
[nightly] Remove grpc staging test since nightly is stable #20119 (#20119) 2021-11-05 21:36:58 -07:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
gjoliver
1341bb59bf
[RLlib; Release testing] long_running_tests should use RLlib's app_config. (#20095) 2021-11-05 15:18:56 +01:00
Simon Mo
4d583da7d5
[Serve] Add verbose log for nightly test only (#20088) 2021-11-04 16:15:22 -07:00
Yi Cheng
04f60c998e
[nightly] Fix pytest missing in nightly test (#20076)
## Why are these changes needed?
In the nightly test we see
```
Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest'
```
This PR fixes this error.

## Related issue number
2021-11-04 13:38:05 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
Amog Kamsetty
f4b425f84c
[Release/Xgboost] Fix master install (#19991) 2021-11-02 13:50:14 -07:00
Kai Fricke
f96078687f
[xgboost/release] Xgboost/connect gpu test (#19838)
* [xgboost/release] Add GPU connect user test

* Use scaling cluster

* typo

* Increase xgboost placement group timeout

* Much higher timeout

* Move os environment timeout

* Move os environ

* [dev] install xgboost-ray from master

* GPU xgboost master

* Remove master install after new xgboost release

* Install latest

* Add master test
2021-11-02 08:40:48 -07:00
Amog Kamsetty
3a52187da8
[Release/Lightning] Add Ray lightning user test (#19812)
* wip

* wip

* add ray lightning test

* fix

* update

* merge and add

* fix

* fix

* rename

* autoscale

* add tblib

* gloo backend

* typo

* upgrade torch

* latest and master
2021-11-01 18:29:48 -07:00
Amog Kamsetty
474e44f7e0
[Release/Horovod] Add user test for Horovod (#19661)
* infra

* wip

* add test

* typo

* typo

* update

* rename

* fix

* full path

* formatting

* reorder

* update

* update

* Update release/horovod_tests/workloads/horovod_user_test.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* bump num_workers

* update installs

* try

* add pip_packages

* min_workers

* fix

* bump pg timeout

* Fix symlink

* fix

* fix

* cmake

* fix

* pin filelock

* final

* update

* fix

* Update release/horovod_tests/workloads/horovod_user_test.py

* fix

* fix

* separate compute template

* test latest and master

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
2021-11-01 18:28:07 -07:00
matthewdeng
e1e4a45b8d
[train] add simple Ray Train release tests (#19817)
* [train] add simple Ray Train release tests

* simplify tests

* update

* driver requirements

* move to test

* remove connect

* fix

* fix

* fix torch

* gpu

* add assert

* remove assert

* use gloo backend

* fix

* finish

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-01 18:25:19 -07:00
xwjiang2010
1803ca13b6
Adding release logs for 1.8.0. (#19867) 2021-11-01 10:26:04 -07:00
architkulkarni
702bffe072
[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906) 2021-10-31 10:48:48 -05:00
xwjiang2010
4d293c4cee
Increase horovod_test disk space. (#19917) 2021-10-30 14:41:31 -07:00
Lixin Wei
1fe9f3372e
[Nightly Test] Remove duplicate printing code (#19874)
## Why are these changes needed?

Remove duplicate printing code
2021-10-29 10:19:19 -07:00
Kai Fricke
fa0158abe5
[tune] Cloud checkpointing release tests (#19638) 2021-10-29 12:12:01 +02:00
Kai Fricke
a13f738a10
[ci/release] Fix cloud search query (#19876) 2021-10-29 11:30:34 +02:00
Kai Fricke
564d8551ed
[ci/release] only check alert if test succeeded before (#19857) 2021-10-28 16:09:10 -07:00
Simon Mo
3e038aebb2
[CI] Allow release tests infra to accept buildkite artifacts (#19803) 2021-10-27 13:04:01 -07:00
Yi Cheng
abec07700a
[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779)
## Why are these changes needed?
We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging.

## Related issue number
#19438
2021-10-27 10:46:13 -07:00
Jiao
3f628d4f6b
increase long poll timeout and wrk trial cpu resource (#19768) 2021-10-26 21:31:39 -07:00
SangBin Cho
bcd27b708f
[Test] Mark many ppo as unstable (#19769) 2021-10-26 21:27:43 -07:00
xwjiang2010
ab15dfd478
[Tune release test] Set 500G disk space for rllib_tests. (#19730) 2021-10-26 10:12:03 -07:00
Jiao
aaef82920d
[serve] Add periodic timeouts to long poll client to avoid accumulating concurrent tasks in the controller (#19728) 2021-10-26 09:44:00 -05:00
Kai Fricke
98244ad130
[ci/release] Report error to database on alert (#19743) 2021-10-26 10:48:02 +01:00
Kai Fricke
96ddf5b9ac
[ci/release] Choose cloud by name or ID (#19742) 2021-10-26 10:21:54 +01:00
Amog Kamsetty
6e61ca623d
[CI] Infra for "user" tests (#19662) 2021-10-26 08:47:22 +01:00
SangBin Cho
ecd5a622ef
[Tests] Add a memory usage on dask on ray tests (#19674) 2021-10-25 14:58:26 -07:00
architkulkarni
414910b7fc
[test] [runtime env] Add release test with Ray Client and local pip files (#19026) 2021-10-25 11:49:27 -05:00
xwjiang2010
a632cb439f
[Tune] Remove queue_trials. (#19472) 2021-10-22 09:24:54 +01:00
SangBin Cho
9000f41aa6
[Nightly Test] Support memory profiling on Ray + implement memory monitor for nightly tests (#19539)
* random fixes

* Done

* done

* update the doc

* doc lint fix

* .

* .
2021-10-21 07:37:05 -07:00
Yi Cheng
7a7b356899
[Nightly test] add test for grpc broadcasting (#19579) 2021-10-21 07:01:41 -07:00
Kai Fricke
71564040ec
[ci/release] Unwrap after installing pip packages (#19552) 2021-10-20 13:41:16 +01:00
Yi Cheng
01b899dafb
[nightly] Fix broken test due to bad syntax #19536 (#19536) 2021-10-19 21:43:46 -07:00
Yi Cheng
7a9cedfc5c
[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531)
* dbg

* up

* check

* up

* up

* put grpc based one into nightly test

* up
2021-10-19 19:59:39 -07:00
Kai Fricke
3e8587644b
[ci/release] wrap all release test pip github installs in quotation marks (#19521) 2021-10-19 20:55:02 +01:00
Chen Shen
b38ebd368c
[Dataset][nighlyt-test] spend less money #19488
Reduce the epoch and ensure everything runs in the same datacenter.
2021-10-18 18:53:50 -07:00
gjoliver
e9f66cc394
Reduce success criteria for a few learning tests. (#19484) 2021-10-18 15:44:38 -07:00
Jiajun Yao
4d9585773f
[Release] Remove release process doc (#19312) 2021-10-18 11:24:03 -07:00