Commit graph

33 commits

Author SHA1 Message Date
Kai Fricke
1d3c167bfe
[rllib/release] Fix rllib connect test with Tuner() API (#27155)
Currently failing because the Tune framework example does not return fitting results.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-28 11:08:02 +01:00
Amog Kamsetty
862d10c162
[AIR] Remove ML code from ray.util (#27005)
Removes all ML related code from `ray.util`

Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`

Moves `ray.util.ml_utils` to other locations

Closes #23900

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 14:24:19 +01:00
Antoni Baum
b9a4f64f32
[AIR/train] Use new Train API (#25735)
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires https://github.com/ray-project/ray/pull/25943 to be merged in first
2022-07-07 12:28:37 -07:00
xwjiang2010
40f9561f78
[ml/release] fix ptl ml user test. (#26365)
Between version1 and 2 of [this](https://console.anyscale-staging.com/o/anyscale-internal/configurations/app-config-versions/apt_TsCpJCRjMJDpNFhNgJmyCniS) cluster_env, 1 fails and 2 succeeds.

btw, we really should start to think about a systematic approach towards our python dependency story.
- between client and server
- but more importantly server side, and any conflicts among requirements
- how are pip freeze result evolving over time
2022-07-07 11:45:46 -07:00
Kai Fricke
e2d8e7a6ae
[ci/release/ml] Run ML release tests on staging (#26168)
This moves all ML release tests to staging.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-06-30 13:24:28 -07:00
matthewdeng
5c6b91d375
[Release] fix Horovod release tests (#25873)
Error message suggests:

Wait timeout after 30 seconds for key(s): 0. You may want to increase the timeout via HOROVOD_GLOO_TIMEOUT_SECONDS

Bumped up to 120 seconds.

Tests run successfully: https://buildkite.com/ray-project/release-tests-pr/builds/6906
2022-06-17 14:52:54 +01:00
SangBin Cho
ec653e3196
[Nightly test] Move two line downloads to one line. (#25061)
It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later
2022-05-22 00:07:03 -07:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Kai Fricke
de69b0d6d6
[train/release] Fix horovod user test master app config (#24734) 2022-05-14 21:20:45 -07:00
Kai Fricke
8a578c191f
[ci/release] Re-install anyscale package after local env setup (#24373)
The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.
2022-05-01 16:51:55 +01:00
Kai Fricke
ac036e4fe8
[ci/release] Print local environment information (#24346)
For debugging client environments, it is helpful to print the installed pip packages.
Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.
2022-04-29 21:01:50 +01:00
Amog Kamsetty
47243ace7c
[Release] Upgrade instance types for xgboost gpu release tests (#24002)
In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767).

This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6.

Closes #24048
2022-04-20 15:18:22 -07:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
830238cce2
[ci/release] Migrate ML user tests (#22953)
Most recent tests:

https://buildkite.com/ray-project/release-tests-branch/builds/156
https://buildkite.com/ray-project/release-tests-branch/builds/158
2022-03-14 11:50:16 +00:00
xwjiang2010
ee7a458762
[release test] fix horovod release test. (#22781)
horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. 
Error message is saying:
```
AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'>
```
The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`.

The two communicate via Anyscale client. 
The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.
2022-03-03 08:24:26 -08:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
Amog Kamsetty
bd726aab02
[Release] Disable caching for ray_lightning (#21886)
Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_

Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures.

Closes #21871
Closes #21863

Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images
2022-01-27 17:56:32 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
architkulkarni
2489b17634
[release] Uninstall old ray in all release test app configs to fix commit mismatch error (#21175)
* uninstall old ray in all release test app configs

* add instruction to e2e.py dosctring
2021-12-18 16:58:49 -08:00
Kai Fricke
b58f839534
[ci/release] Remove hard numpy removal from app configs (#21005) 2021-12-13 15:22:02 +00:00
Amog Kamsetty
99ed623371
[Release] Use NCCL backend for release tests (#20677)
* use nccl for release tests

* link issue
2021-11-29 12:42:13 -08:00
Antoni Baum
a8d7897a56
[CI] Modify remote wrapper in XGBoost-Ray client test (#20544)
Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node.
2021-11-24 10:27:17 +00:00
Richard Liaw
1cadd61917
Fix horovod failing tests by pinning down (#20484) 2021-11-17 13:54:25 -08:00
Amog Kamsetty
7e597814aa
[Release] Fix app config for horovod_tests (#20393)
Fixes `horovod_test` weekly test

Closes https://github.com/ray-project/ray/issues/20382
2021-11-16 09:06:42 -08:00
Kai Fricke
91920f1d02
[release/xgboost] xgboost release test fixes via app config (#20325)
* [xgboost] Fix release test app configs

* Revert full app config

* Update base docker image

* Only change cpu base image

* default

* Pin xgboost to 1.5. in cpu tests

* Remove numpy hack

* Revert one line

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-15 10:03:21 -08:00
matthewdeng
ed3cbe48f5
[train][xgboost][release] fix ml_user_tests using ray client (#20345) 2021-11-15 15:24:23 +00:00
matthewdeng
e22632dabc
[train] wrap BackendExecutor in ray.remote() (#20123)
* [train] wrap BackendExecutor in ray.remote()

* wip

* fix trainer tests

* move CheckpointManager to Trainer

* [tune] move force_on_current_node to ml_utils

* fix import

* force on head node

* init ray

* split test files

* update example

* move tests to ray client

* address comments

* move comment

* address comments
2021-11-13 15:30:44 -08:00
Amog Kamsetty
4396419a64
[Release] Fix tune_rllib connect test (#20321)
* [Release] Fix tune_rllib connect test

* use canonical app config
2021-11-13 10:11:20 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Amog Kamsetty
f164f3a8b5
[Release] Increase Placement Group timeout (#20224) 2021-11-10 13:02:38 -08:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00