Commit graph

36 commits

Author SHA1 Message Date
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00
Balaji Veeramani
7f1bacc7dc
[CI] Format Python code with Black (#21975)
See #21316 and #21311 for the motivation behind these changes.
2022-01-29 18:41:57 -08:00
SangBin Cho
b1308b1c8c
[Test Infra] Unrevert team col (#21700)
This fixes the previous problems from team column revert.

This has 2 additional changes;

alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289

Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time
2022-01-19 13:29:53 -08:00
mwtian
0b3fed5ef3
Revert "[Nightly Test] Add a team column to each test config. (#21198)" (#21289)
This reverts commit b5b11b2d06.
2021-12-30 06:44:51 +09:00
SangBin Cho
b5b11b2d06
[Nightly Test] Add a team column to each test config. (#21198)
Please review **e2e.py and test_suite belonging to your team**! 

This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit#

This PR adds a team name to each test suite.

If the name is not specified, it will be reported as unspecified. 

If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future).

Note that we will aggregate all of test config into a single file, nightly_test.yaml.
2021-12-27 14:42:41 -08:00
SangBin Cho
140a180ebb
[xgboost] Fix flaky train_small test (#20529)
Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.
2021-11-18 10:20:08 +00:00
Kai Fricke
91920f1d02
[release/xgboost] xgboost release test fixes via app config (#20325)
* [xgboost] Fix release test app configs

* Revert full app config

* Update base docker image

* Only change cpu base image

* default

* Pin xgboost to 1.5. in cpu tests

* Remove numpy hack

* Revert one line

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-15 10:03:21 -08:00
Amog Kamsetty
18dcf1ac25
[Release] Use nightly Docker images (#20001)
* use nightly

* switch ml cpu to ray cpu

* fix

* add pytest

* add more pytest

* add constraint

* add tensorflow

* fix merge conflict

* add tblib

* fix

* add back uninstall
2021-11-10 18:00:16 -08:00
Amog Kamsetty
3408b60d2b
[Release] Refactor User Tests (#20028)
* wip

* add directory

* wip

* try again

* Revert "try again"

This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d.

* finish

* formatting

* fix merge

* fix path

* chmod

* check

* sudo

* wip

* update

* fix horovod

* try

* typo

* reduce num workers
2021-11-05 17:28:37 -07:00
Amog Kamsetty
f4b425f84c
[Release/Xgboost] Fix master install (#19991) 2021-11-02 13:50:14 -07:00
Kai Fricke
f96078687f
[xgboost/release] Xgboost/connect gpu test (#19838)
* [xgboost/release] Add GPU connect user test

* Use scaling cluster

* typo

* Increase xgboost placement group timeout

* Much higher timeout

* Move os environment timeout

* Move os environ

* [dev] install xgboost-ray from master

* GPU xgboost master

* Remove master install after new xgboost release

* Install latest

* Add master test
2021-11-02 08:40:48 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray (#19186)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
e08d4253cf
[ci/release] Start cluster before connecting via anyscale connect (#18878) 2021-09-24 16:17:06 +01:00
Antoni Baum
eeb67a42cc
pip install xgboost_ray -> xgboost_ray[default] (#18607)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-09-15 14:45:56 +01:00
Kai Fricke
15a83d104d
[ci/release] remove legacy release tests (#18592) 2021-09-15 14:42:58 +01:00
Kai Fricke
b543c0e923
[ci] Do not use anyscale connect for xgboost_tests/train_small (#18569) 2021-09-13 20:38:00 +01:00
Kai Fricke
7d1e6d3129
[ci/release] Add sanity check for ray wheels hash to release tests (#18489) 2021-09-10 17:50:31 +01:00
Antoni Baum
2c0dcec18f
[test] Fix golden notebook tests always failing (#17873) 2021-08-31 17:07:47 +02:00
Antoni Baum
0a1228ef6e
Add configurable autosuspend for connect tests (#17958) 2021-08-20 10:57:41 +02:00
Clark Zinzow
d958457d07
[Core] Second pass at privatizing APIs. (#17885)
* gcs_utils

* resource_spec

* profiling

* ray_perf and ray_cluster_perf

* test_utils
2021-08-18 20:56:33 -07:00
Kai Fricke
8580e450cb
[release] update/unify base images (#17859) 2021-08-16 12:44:25 +02:00
mwtian
7669708237
Create a wait_for_num_nodes() function, and use it in train_small (#16784) 2021-07-01 10:17:53 +01:00
Kai Fricke
ef97bdd407
[release] Fix app config: Install latest releases. Bump xgboost-ray version (#16581) 2021-06-24 12:56:21 +01:00
mwtian
48599aef9e
Roll forward to run train_small in client mode. (#16610) 2021-06-23 08:52:08 +01:00
Kai Fricke
aecc4c8d28
[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532) 2021-06-18 11:40:04 +01:00
Kai Fricke
9352cb781c
[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355) 2021-06-16 21:37:17 +01:00
mwtian
2f7d535253
[Test] Use Ray client in XGBoost train_small release test (#16319) 2021-06-16 14:39:32 +01:00
Kai Fricke
153a8b8fec
[release] convert tune release tests (#15913) 2021-06-01 11:19:15 -07:00
Kai Fricke
8db2e5c23a
[release] Move xgboost tune small + microbenchmark release test to new release automation (#15619) 2021-05-08 20:38:39 +01:00
Kai Fricke
1d52ab819f
[release] release 1.3.0 results and test updates (#15366)
Convert a number of release tests and add logs for release 1.3.0
2021-05-04 22:10:04 +01:00
Kai Fricke
7364a7a327
[tune] Move Optuna to ask(fixed_distributions) interface (#14731)
Adjusting to changes in Optuna 2.6.0. Old interface was marked as deprecated.
2021-03-22 12:25:37 +01:00
Ian Rodney
eb12033612
[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-18 13:10:57 -07:00
Kai Fricke
a0f73cf3f7
[xgboost] Update XGBoost release test configs (#13941)
* Update XGBoost release test configs

* Use GPU containers

* Fix elastic check

* Use spot instances for GPU

* Add debugging output

* Fix success check, failure checking, outputs, sync behavior

* Update release checklist, rename mounts
2021-02-17 23:00:49 +01:00
Kai Fricke
1e113d2e6e
[tune/xgboost] Update release test docs (#13880)
* Update release test docs

* Update
2021-02-04 13:10:56 +01:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Kai Fricke
8804758409
[xgboost] Add XGBoost release tests (#13456)
* Add XGBoost release tests

* Add more xgboost release tests

* Use failure state manager

* Add release test documentation

* Fix wording

* Automate fault tolerance tests
2021-01-20 18:40:23 +01:00