hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Amog Kamsetty	47243ace7c	[Release] Upgrade instance types for xgboost gpu release tests (#24002 ) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048	2022-04-20 15:18:22 -07:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	18d535f290	[ci/release] Migrate LightGBM tests (#22952 ) Note that LightGBM release tests were previously not enabled. https://buildkite.com/ray-project/release-tests-branch/builds/113 https://buildkite.com/ray-project/release-tests-branch/builds/114 Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-03-10 08:14:31 +00:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
mwtian	0b3fed5ef3	Revert "[Nightly Test] Add a team column to each test config. (#21198 )" (#21289 ) This reverts commit `b5b11b2d06`.	2021-12-30 06:44:51 +09:00
SangBin Cho	b5b11b2d06	[Nightly Test] Add a team column to each test config. (#21198 ) Please review e2e.py and test_suite belonging to your team! This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit# This PR adds a team name to each test suite. If the name is not specified, it will be reported as unspecified. If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future). Note that we will aggregate all of test config into a single file, nightly_test.yaml.	2021-12-27 14:42:41 -08:00
SangBin Cho	140a180ebb	[xgboost] Fix flaky train_small test (#20529 ) Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.	2021-11-18 10:20:08 +00:00
Kai Fricke	91920f1d02	[release/xgboost] xgboost release test fixes via app config (#20325 ) * [xgboost] Fix release test app configs * Revert full app config * Update base docker image * Only change cpu base image * default * Pin xgboost to 1.5. in cpu tests * Remove numpy hack * Revert one line Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-15 10:03:21 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00
Amog Kamsetty	f4b425f84c	[Release/Xgboost] Fix master install (#19991 )	2021-11-02 13:50:14 -07:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00
Antoni Baum	e9df253f5d	[CI/docs] Remove [default] from xgboost-ray (#19186 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-10-14 16:29:55 +01:00
Kai Fricke	e08d4253cf	[ci/release] Start cluster before connecting via anyscale connect (#18878 )	2021-09-24 16:17:06 +01:00
Antoni Baum	eeb67a42cc	pip install xgboost_ray -> xgboost_ray[default] (#18607 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-09-15 14:45:56 +01:00
Kai Fricke	15a83d104d	[ci/release] remove legacy release tests (#18592 )	2021-09-15 14:42:58 +01:00
Kai Fricke	b543c0e923	[ci] Do not use anyscale connect for xgboost_tests/train_small (#18569 )	2021-09-13 20:38:00 +01:00
Kai Fricke	7d1e6d3129	[ci/release] Add sanity check for ray wheels hash to release tests (#18489 )	2021-09-10 17:50:31 +01:00
Antoni Baum	2c0dcec18f	[test] Fix golden notebook tests always failing (#17873 )	2021-08-31 17:07:47 +02:00
Antoni Baum	0a1228ef6e	Add configurable autosuspend for connect tests (#17958 )	2021-08-20 10:57:41 +02:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
Kai Fricke	8580e450cb	[release] update/unify base images (#17859 )	2021-08-16 12:44:25 +02:00
mwtian	7669708237	Create a wait_for_num_nodes() function, and use it in `train_small` (#16784 )	2021-07-01 10:17:53 +01:00
Kai Fricke	ef97bdd407	[release] Fix app config: Install latest releases. Bump xgboost-ray version (#16581 )	2021-06-24 12:56:21 +01:00
mwtian	48599aef9e	Roll forward to run train_small in client mode. (#16610 )	2021-06-23 08:52:08 +01:00
Kai Fricke	aecc4c8d28	[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532 )	2021-06-18 11:40:04 +01:00
Kai Fricke	9352cb781c	[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355 )	2021-06-16 21:37:17 +01:00
mwtian	2f7d535253	[Test] Use Ray client in XGBoost train_small release test (#16319 )	2021-06-16 14:39:32 +01:00
Kai Fricke	153a8b8fec	[release] convert tune release tests (#15913 )	2021-06-01 11:19:15 -07:00
Kai Fricke	8db2e5c23a	[release] Move xgboost tune small + microbenchmark release test to new release automation (#15619 )	2021-05-08 20:38:39 +01:00
Kai Fricke	1d52ab819f	[release] release 1.3.0 results and test updates (#15366 ) Convert a number of release tests and add logs for release 1.3.0	2021-05-04 22:10:04 +01:00
Kai Fricke	7364a7a327	[tune] Move Optuna to ask(fixed_distributions) interface (#14731 ) Adjusting to changes in Optuna 2.6.0. Old interface was marked as deprecated.	2021-03-22 12:25:37 +01:00
Ian Rodney	eb12033612	[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-03-18 13:10:57 -07:00
Kai Fricke	a0f73cf3f7	[xgboost] Update XGBoost release test configs (#13941 ) * Update XGBoost release test configs * Use GPU containers * Fix elastic check * Use spot instances for GPU * Add debugging output * Fix success check, failure checking, outputs, sync behavior * Update release checklist, rename mounts	2021-02-17 23:00:49 +01:00
Kai Fricke	1e113d2e6e	[tune/xgboost] Update release test docs (#13880 ) * Update release test docs * Update	2021-02-04 13:10:56 +01:00
Ameer Haj Ali	b7dd7ddb52	deprecate useless fields in the cluster yaml. (#13637 ) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs * joblib strikes again on windows * add ability to not start autoscaler/monitor * a * remove worker_default * Remove default pod type from operator * Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types * deprecate useless fields Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2021-01-23 12:06:51 -08:00
Kai Fricke	8804758409	[xgboost] Add XGBoost release tests (#13456 ) * Add XGBoost release tests * Add more xgboost release tests * Use failure state manager * Add release test documentation * Fix wording * Automate fault tolerance tests	2021-01-20 18:40:23 +01:00

39 commits