This line:
```
pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm
```
also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested:
```
[INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ...
[INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9)
```
Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.
Many release tests are currently failing for cuda version incompatibilities. Pinning the base image to 1.12.1 seems to resolve the problem for the time being.
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions.
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments.
Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767).
This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6.
Closes#24048
Adds a unit-tested and restructured ray_release package for running release tests.
Relevant changes in behavior:
Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).
The main subpackages are:
Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
Command runner: Runs commands, e.g. as client command or sdk command
File manager: Uploads/downloads files to/from session
Reporter: Reports results (e.g. to database)
Much of the code base is unit tested, but there are probably some pieces missing.
Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
This fixes the previous problems from team column revert.
This has 2 additional changes;
alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289
Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
Please review **e2e.py and test_suite belonging to your team**!
This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#
This PR adds a team name to each test suite.
If the name is not specified, it will be reported as unspecified.
If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).
Note that we will aggregate all of test config into a single file, nightly_test.yaml.
Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.
* [xgboost] Fix release test app configs
* Revert full app config
* Update base docker image
* Only change cpu base image
* default
* Pin xgboost to 1.5. in cpu tests
* Remove numpy hack
* Revert one line
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* use nightly
* switch ml cpu to ray cpu
* fix
* add pytest
* add more pytest
* add constraint
* add tensorflow
* fix merge conflict
* add tblib
* fix
* add back uninstall
* [xgboost/release] Add GPU connect user test
* Use scaling cluster
* typo
* Increase xgboost placement group timeout
* Much higher timeout
* Move os environment timeout
* Move os environ
* [dev] install xgboost-ray from master
* GPU xgboost master
* Remove master install after new xgboost release
* Install latest
* Add master test
* prepare for head node
* move command runner interface outside _private
* remove space
* Eric
* flake
* min_workers in multi node type
* fixing edge cases
* eric not idle
* fix target_workers to consider min_workers of node types
* idle timeout
* minor
* minor fix
* test
* lint
* eric v2
* eric 3
* min_workers constraint before bin packing
* Update resource_demand_scheduler.py
* Revert "Update resource_demand_scheduler.py"
This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.
* reducing diff
* make get_nodes_to_launch return a dict
* merge
* weird merge fix
* auto fill instance types for AWS
* Alex/Eric
* Update doc/source/cluster/autoscaling.rst
* merge autofill and input from user
* logger.exception
* make the yaml use the default autofill
* docs Eric
* remove test_autoscaler_yaml from windows tests
* lets try changing the test a bit
* return test
* lets see
* edward
* Limit max launch concurrency
* commenting frac TODO
* move to resource demand scheduler
* use STATUS UP TO DATE
* Eric
* make logger of gc freed refs debug instead of info
* add cluster name to docker mount prefix directory
* grrR
* fix tests
* moving docker directory to sdk
* move the import to prevent circular dependency
* smallf fix
* ian
* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running
* small fix
* deflake test_joblib
* lint
* placement groups bypass
* remove space
* Eric
* first ocmmit
* lint
* exmaple
* documentation
* hmm
* file path fix
* fix test
* some format issue in docs
* modified docs
* joblib strikes again on windows
* add ability to not start autoscaler/monitor
* a
* remove worker_default
* Remove default pod type from operator
* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types
* deprecate useless fields
Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>