Commit graph

89 commits

Author SHA1 Message Date
Steven Morad
259429bdc3
Bump gym dep to 0.24 (#26190)
Co-authored-by: Steven Morad <smorad@anyscale.com>
Co-authored-by: Avnish <avnishnarayan@gmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2022-07-22 12:37:16 -07:00
Avnish Narayan
82395c4646
[RLlib] Put learning test into own folders (#26862)
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
2022-07-22 11:20:47 -07:00
Avnish Narayan
5433c11650
[RLlib] Pin gym to 0.23.1 (#26752) 2022-07-20 11:49:01 -07:00
Artur Niederfahrenhorst
4ce9686d94
[RLlib] Fixes MARWIL release tests (#26586) 2022-07-15 11:13:15 -07:00
Jun Gong
c026374acb
[RLlib] Fix the 2 failing RLlib release tests. (#25603) 2022-06-14 14:51:08 +02:00
Sven Mika
7c39aa5fac
[RLlib] Trainer.training_iteration -> Trainer.training_step; Iterations vs reportings: Clarification of terms. (#25076) 2022-06-10 17:09:18 +02:00
Rohan Potdar
a9d8da0100
[RLlib]: Doubly Robust Off-Policy Evaluation. (#25056) 2022-06-07 12:52:19 +02:00
Kai Fricke
1ed8bd0345
[release/xgboost/lightgbm] Fix app config dependency install overwriting ray (#25307)
This line:

```
pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm
```

also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested:

```
[INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ...
[INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm  (ff6ae9f9)
```

Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.
2022-05-31 13:46:17 +02:00
Rohan Potdar
ab81c8e9ca
[RLlib]: Rename input_evaluation to off_policy_estimation_methods. (#25107) 2022-05-27 13:14:54 +02:00
Sven Mika
09886d7ab8
[RLlib] Upgrade gym 0.23 (#24171) 2022-05-23 08:18:44 +02:00
Steven Morad
501d932449
[RLlib] SAC, RNNSAC, and CQL TrainerConfig objects (#25059) 2022-05-22 19:58:47 +02:00
SangBin Cho
ec653e3196
[Nightly test] Move two line downloads to one line. (#25061)
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
2022-05-22 00:07:03 -07:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Artur Niederfahrenhorst
fb2915d26a
[RLlib] Replay Buffer API and Ape-X. (#24506) 2022-05-17 13:43:49 +02:00
Sven Mika
0cd7bc4054
[RLlib] Re-establish dashboard performance tests. (#24728) 2022-05-16 13:13:49 +02:00
Sven Mika
70d3bfcf9c
[RLlib] Provide more time for APPO Pong release and performance tests. (#24503) 2022-05-05 18:19:38 +02:00
Sven Mika
b48f63113b
[RLlib] SlateQ fixes: Release learning tests wrong yaml structure + TD-error torch issue (#24429) 2022-05-04 13:37:14 +02:00
Sven Mika
f066180ed5
[RLlib] Deprecate timesteps_per_iteration config key (in favor of min_[sample|train]_timesteps_per_reporting. (#24372) 2022-05-02 12:51:14 +02:00
Sven Mika
3052193c9e
[RLlib] Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). (#24345)
Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead).

CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting.

If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).
2022-04-29 21:02:34 +01:00
Kai Fricke
65d9a410f7
[ci] Clean up ci/ directory (refactor ci/travis) (#23866)
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.

Details:

- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
2022-04-13 18:11:30 +01:00
Avnish Narayan
fdc6e02c29
[RLlib; testing] Move num_workers to RLlib config (#23750) 2022-04-06 20:06:48 +02:00
Avnish Narayan
161d95c31b
[RLlib] Increase slateq workers to decrease runtime on prod (#23609) 2022-03-30 17:38:21 -07:00
Artur Niederfahrenhorst
9a64bd4e9b
[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842) 2022-03-29 14:44:40 +02:00
Sven Mika
22c9c4aa39
[RLlib] Slate-Q +GPU torch bug fix. (#23464) 2022-03-24 17:39:33 +01:00
Avnish Narayan
9040f54060
[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452) 2022-03-24 09:17:30 +01:00
Avnish Narayan
754bcd16f8
[rllib] Pin gym everywhere (#23384)
This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.
2022-03-22 09:44:22 +00:00
Avnish Narayan
6c20e9d898
[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168) 2022-03-16 09:15:59 +01:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Sven Mika
7b687e6cd8
[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544) 2022-02-25 21:58:16 +01:00
Jun Gong
04dd536987
[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426) 2022-02-16 13:06:09 +01:00
Jun Gong
cbd24503b6
[RLlib] Add A3C to RLlib performance regression tests. (#22316) 2022-02-11 21:18:53 +01:00
Sven Mika
04a5c72ea3
Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708) 2022-02-10 13:44:22 +01:00
Alex Wu
b122f093c1
Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test." (#22250)
Reverts ray-project/ray#22126

Breaks rllib:tests/test_io
2022-02-09 09:26:36 -08:00
Sven Mika
ac3e6ab411
[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test. (#22126) 2022-02-08 19:04:13 +01:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
Jun Gong
1315293dd8
[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643) 2022-01-18 09:29:01 +01:00
Jun Gong
7517aefe05
[RLlib] Bring back BC and Marwil learning tests. (#21574) 2022-01-14 14:35:32 +01:00
Jun Gong
83955a9407
[RLlib] Extend CQL perf test to 1hr. (#21449) 2022-01-07 11:35:16 +01:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
Jun Gong
767f78eaf8
[RLlib] Always attach latest eval metrics. (#21011) 2021-12-15 11:42:53 +01:00
gjoliver
724a140795
[rllib] Make sure json can serialize result dict (#20439)
We may have fields in the result dict that are or None.
Make sure our results are json serializable.
2021-11-17 10:27:00 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
gjoliver
b6b4aaa632
[Release] Fix stress_tests (#20233) 2021-11-10 16:05:46 -08:00
gjoliver
d8a61f801f
[RLlib] Create a set of performance benchmark tests to run nightly. (#19945)
* Create a core set of algorithms tests to run nightly.

* Run release tests under tf, tf2, and torch frameworks.

* Fix

* Add eager_tracing option for tf2 framework.

* make sure core tests can run in parallel.

* cql

* Report progress while running nightly/weekly tests.

* Innclude SAC in nightly lineup.

* Revert changes to learning_tests

* rebrand to performance test.

* update build_pipeline.py with new performance_tests name.

* Record stats.

* bug fix, need to populate experiments dict.

* Alphabetize yaml files.

* Allow specifying frameworks. And do not run tf2 by default.

* remove some debugging code.

* fix

* Undo testing changes.

* Do not run CQL regression for now.

* LINT.

Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-08 18:15:13 +01:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
xwjiang2010
ab15dfd478
[Tune release test] Set 500G disk space for rllib_tests. (#19730) 2021-10-26 10:12:03 -07:00
gjoliver
e9f66cc394
Reduce success criteria for a few learning tests. (#19484) 2021-10-18 15:44:38 -07:00