Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
This line:
```
pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm
```
also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested:
```
[INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ...
[INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9)
```
Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions.
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments.
Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead).
CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting.
If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
This fixes the previous problems from team column revert.
This has 2 additional changes;
alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289
Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
Please review **e2e.py and test_suite belonging to your team**!
This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#
This PR adds a team name to each test suite.
If the name is not specified, it will be reported as unspecified.
If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).
Note that we will aggregate all of test config into a single file, nightly_test.yaml.