OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions.
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments.
Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead).
CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting.
If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
This fixes the previous problems from team column revert.
This has 2 additional changes;
alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289
Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
Please review **e2e.py and test_suite belonging to your team**!
This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#
This PR adds a team name to each test suite.
If the name is not specified, it will be reported as unspecified.
If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).
Note that we will aggregate all of test config into a single file, nightly_test.yaml.
* use nightly
* switch ml cpu to ray cpu
* fix
* add pytest
* add more pytest
* add constraint
* add tensorflow
* fix merge conflict
* add tblib
* fix
* add back uninstall
* Create a core set of algorithms tests to run nightly.
* Run release tests under tf, tf2, and torch frameworks.
* Fix
* Add eager_tracing option for tf2 framework.
* make sure core tests can run in parallel.
* cql
* Report progress while running nightly/weekly tests.
* Innclude SAC in nightly lineup.
* Revert changes to learning_tests
* rebrand to performance test.
* update build_pipeline.py with new performance_tests name.
* Record stats.
* bug fix, need to populate experiments dict.
* Alphabetize yaml files.
* Allow specifying frameworks. And do not run tf2 by default.
* remove some debugging code.
* fix
* Undo testing changes.
* Do not run CQL regression for now.
* LINT.
Co-authored-by: sven1977 <svenmika1977@gmail.com>
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* bump timeout
* Write a more informational result dict.
* Revert changes to compute config files that are not used.
* add smoke test
* update
* reduce timeout
* Reduce the # of env per worker to 1.
* Small fix for getting trial_states
* Trigger build
* simply result dict
* lint
* more lint
* fix smoke test
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>