Commit graph

77 commits

Author SHA1 Message Date
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Artur Niederfahrenhorst
fb2915d26a
[RLlib] Replay Buffer API and Ape-X. (#24506) 2022-05-17 13:43:49 +02:00
Sven Mika
0cd7bc4054
[RLlib] Re-establish dashboard performance tests. (#24728) 2022-05-16 13:13:49 +02:00
Sven Mika
70d3bfcf9c
[RLlib] Provide more time for APPO Pong release and performance tests. (#24503) 2022-05-05 18:19:38 +02:00
Sven Mika
b48f63113b
[RLlib] SlateQ fixes: Release learning tests wrong yaml structure + TD-error torch issue (#24429) 2022-05-04 13:37:14 +02:00
Sven Mika
f066180ed5
[RLlib] Deprecate timesteps_per_iteration config key (in favor of min_[sample|train]_timesteps_per_reporting. (#24372) 2022-05-02 12:51:14 +02:00
Sven Mika
3052193c9e
[RLlib] Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)
Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead).

CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting.

If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).
2022-04-29 21:02:34 +01:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Avnish Narayan
fdc6e02c29
[RLlib; testing] Move num_workers to RLlib config (#23750) 2022-04-06 20:06:48 +02:00
Avnish Narayan
161d95c31b
[RLlib] Increase slateq workers to decrease runtime on prod (#23609) 2022-03-30 17:38:21 -07:00
Artur Niederfahrenhorst
9a64bd4e9b
[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842) 2022-03-29 14:44:40 +02:00
Sven Mika
22c9c4aa39
[RLlib] Slate-Q +GPU torch bug fix. (#23464) 2022-03-24 17:39:33 +01:00
Avnish Narayan
9040f54060
[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452) 2022-03-24 09:17:30 +01:00
Avnish Narayan
754bcd16f8
[rllib] Pin gym everywhere (#23384)
This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.
2022-03-22 09:44:22 +00:00
Avnish Narayan
6c20e9d898
[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168) 2022-03-16 09:15:59 +01:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Jun Gong
04dd536987
[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426) 2022-02-16 13:06:09 +01:00
Jun Gong
cbd24503b6
[RLlib] Add A3C to RLlib performance regression tests. (#22316) 2022-02-11 21:18:53 +01:00
Sven Mika
04a5c72ea3
Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708) 2022-02-10 13:44:22 +01:00
Alex Wu
b122f093c1
Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test." (#22250)
Reverts ray-project/ray#22126

Breaks rllib:tests/test_io
2022-02-09 09:26:36 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
Jun Gong
1315293dd8
[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643) 2022-01-18 09:29:01 +01:00
Jun Gong
7517aefe05
[RLlib] Bring back BC and Marwil learning tests. (#21574) 2022-01-14 14:35:32 +01:00
Jun Gong
83955a9407
[RLlib] Extend CQL perf test to 1hr. (#21449) 2022-01-07 11:35:16 +01:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
Jun Gong
767f78eaf8
[RLlib] Always attach latest eval metrics. (#21011) 2021-12-15 11:42:53 +01:00
gjoliver
724a140795
[rllib] Make sure json can serialize result dict (#20439)
We may have fields in the result dict that are or None.
Make sure our results are json serializable.
2021-11-17 10:27:00 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
gjoliver
b6b4aaa632
[Release] Fix stress_tests (#20233) 2021-11-10 16:05:46 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
xwjiang2010
ab15dfd478
[Tune release test] Set 500G disk space for rllib_tests. (#19730) 2021-10-26 10:12:03 -07:00
gjoliver
e9f66cc394
Reduce success criteria for a few learning tests. (#19484) 2021-10-18 15:44:38 -07:00
Sven Mika
5611150b1a
Increase rllib stress tests timeout for smoke test (#18810) 2021-09-22 14:30:42 +01:00
Sven Mika
e6aae61487
[RLlib; testing] Fix bug in stress tests not handling >1 trials per experiment (due to grid-search in IMPALA stress tests). (#18705) 2021-09-20 15:31:57 +02:00
Sven Mika
ba1c489b79
[RLlib Testing] Lower --smoke-test "time_total_s" to make sure it doesn't time out. (#18670) 2021-09-16 18:22:23 +02:00
gjoliver
df32ed35fd
Extend --smoke-test deadlines for learning and stress regression tests. (#18667) 2021-09-16 09:18:39 +01:00
Kai Fricke
c8188ea70e
[ci/rllib] wait for stress test cluster (#18603) 2021-09-14 19:01:22 +01:00
Sven Mika
08c09737fa
[RLlib] Fix R2D2 (torch) multi-GPU issue. (#18550) 2021-09-14 19:58:10 +02:00
gjoliver
2924afa41e
[Release] Create soft links for libcusolver.so.10 as a temporary fix. (#18562)
Co-authored-by: Jun Gong <jungong@anyscale.com>
2021-09-13 14:37:12 -07:00
Kai Fricke
7d1e6d3129
[ci/release] Add sanity check for ray wheels hash to release tests (#18489) 2021-09-10 17:50:31 +01:00
Simon Mo
6d24214085
[Release] Make sure to uninstall ray for rllib_tests (#18448) 2021-09-08 23:29:40 +01:00
gjoliver
50cdf551ce
[RLlib] Fix test name typo. (#18423)
Co-authored-by: Jun Gong <jungong@mbpro.local>
2021-09-08 23:30:37 +02:00
Sven Mika
cabaa3b3c6
[RLlib Testing] Add A3C/APPO/BC/DDPPO/MARWIL/CQL/ES/ARS/TD3 to weekly learning tests. (#18381) 2021-09-07 11:48:41 +02:00
Sven Mika
5292b70fc6
[RLlib] Add multi-GPU attention net tests to nightly test suite (+ R2D2 tests for LSTM and attention nets). (#18368) 2021-09-06 17:48:05 +02:00