hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	1ed8bd0345	[release/xgboost/lightgbm] Fix app config dependency install overwriting ray (#25307 ) This line: ``` pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm ``` also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested: ``` [INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ... [INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9) ``` Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.	2022-05-31 13:46:17 +02:00
Kai Fricke	2cf20e5406	[ci/release] Use 1.12.1 as base image in app configs (#25216 ) Many release tests are currently failing for cuda version incompatibilities. Pinning the base image to 1.12.1 seems to resolve the problem for the time being.	2022-05-26 18:58:20 +02:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Amog Kamsetty	47243ace7c	[Release] Upgrade instance types for xgboost gpu release tests (#24002 ) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048	2022-04-20 15:18:22 -07:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	18d535f290	[ci/release] Migrate LightGBM tests (#22952 ) Note that LightGBM release tests were previously not enabled. https://buildkite.com/ray-project/release-tests-branch/builds/113 https://buildkite.com/ray-project/release-tests-branch/builds/114 Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-03-10 08:14:31 +00:00
Kai Fricke	331b71ea8d	[ci/release] Refactor release test e2e into package (#22351 ) Adds a unit-tested and restructured ray_release package for running release tests. Relevant changes in behavior: Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior). The main subpackages are: Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster Command runner: Runs commands, e.g. as client command or sdk command File manager: Uploads/downloads files to/from session Reporter: Reports results (e.g. to database) Much of the code base is unit tested, but there are probably some pieces missing. Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_ Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023	2022-02-16 17:35:02 +00:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
mwtian	0b3fed5ef3	Revert "[Nightly Test] Add a team column to each test config. (#21198 )" (#21289 ) This reverts commit `b5b11b2d06`.	2021-12-30 06:44:51 +09:00
SangBin Cho	b5b11b2d06	[Nightly Test] Add a team column to each test config. (#21198 ) Please review e2e.py and test_suite belonging to your team! This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit# This PR adds a team name to each test suite. If the name is not specified, it will be reported as unspecified. If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future). Note that we will aggregate all of test config into a single file, nightly_test.yaml.	2021-12-27 14:42:41 -08:00
SangBin Cho	140a180ebb	[xgboost] Fix flaky train_small test (#20529 ) Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.	2021-11-18 10:20:08 +00:00
Kai Fricke	91920f1d02	[release/xgboost] xgboost release test fixes via app config (#20325 ) * [xgboost] Fix release test app configs * Revert full app config * Update base docker image * Only change cpu base image * default * Pin xgboost to 1.5. in cpu tests * Remove numpy hack * Revert one line Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-15 10:03:21 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00
Amog Kamsetty	f4b425f84c	[Release/Xgboost] Fix master install (#19991 )	2021-11-02 13:50:14 -07:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00
Antoni Baum	e9df253f5d	[CI/docs] Remove [default] from xgboost-ray (#19186 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-10-14 16:29:55 +01:00
Kai Fricke	e08d4253cf	[ci/release] Start cluster before connecting via anyscale connect (#18878 )	2021-09-24 16:17:06 +01:00
Antoni Baum	eeb67a42cc	pip install xgboost_ray -> xgboost_ray[default] (#18607 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-09-15 14:45:56 +01:00
Kai Fricke	15a83d104d	[ci/release] remove legacy release tests (#18592 )	2021-09-15 14:42:58 +01:00
Kai Fricke	b543c0e923	[ci] Do not use anyscale connect for xgboost_tests/train_small (#18569 )	2021-09-13 20:38:00 +01:00
Kai Fricke	7d1e6d3129	[ci/release] Add sanity check for ray wheels hash to release tests (#18489 )	2021-09-10 17:50:31 +01:00
Antoni Baum	2c0dcec18f	[test] Fix golden notebook tests always failing (#17873 )	2021-08-31 17:07:47 +02:00
Antoni Baum	0a1228ef6e	Add configurable autosuspend for connect tests (#17958 )	2021-08-20 10:57:41 +02:00
Clark Zinzow	d958457d07	[Core] Second pass at privatizing APIs. (#17885 ) * gcs_utils * resource_spec * profiling * ray_perf and ray_cluster_perf * test_utils	2021-08-18 20:56:33 -07:00
Kai Fricke	8580e450cb	[release] update/unify base images (#17859 )	2021-08-16 12:44:25 +02:00
mwtian	7669708237	Create a wait_for_num_nodes() function, and use it in `train_small` (#16784 )	2021-07-01 10:17:53 +01:00
Kai Fricke	ef97bdd407	[release] Fix app config: Install latest releases. Bump xgboost-ray version (#16581 )	2021-06-24 12:56:21 +01:00
mwtian	48599aef9e	Roll forward to run train_small in client mode. (#16610 )	2021-06-23 08:52:08 +01:00
Kai Fricke	aecc4c8d28	[release] fix sgd base image, microbenchmark timeout, revert xgboost train_small to not use connect (#16532 )	2021-06-18 11:40:04 +01:00
Kai Fricke	9352cb781c	[release tests] Fix microbenchmark base image, network overhead cluster wait time, add long running tests (#16355 )	2021-06-16 21:37:17 +01:00
mwtian	2f7d535253	[Test] Use Ray client in XGBoost train_small release test (#16319 )	2021-06-16 14:39:32 +01:00
Kai Fricke	153a8b8fec	[release] convert tune release tests (#15913 )	2021-06-01 11:19:15 -07:00
Kai Fricke	8db2e5c23a	[release] Move xgboost tune small + microbenchmark release test to new release automation (#15619 )	2021-05-08 20:38:39 +01:00
Kai Fricke	1d52ab819f	[release] release 1.3.0 results and test updates (#15366 ) Convert a number of release tests and add logs for release 1.3.0	2021-05-04 22:10:04 +01:00
Kai Fricke	7364a7a327	[tune] Move Optuna to ask(fixed_distributions) interface (#14731 ) Adjusting to changes in Optuna 2.6.0. Old interface was marked as deprecated.	2021-03-22 12:25:37 +01:00
Ian Rodney	eb12033612	[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-03-18 13:10:57 -07:00
Kai Fricke	a0f73cf3f7	[xgboost] Update XGBoost release test configs (#13941 ) * Update XGBoost release test configs * Use GPU containers * Fix elastic check * Use spot instances for GPU * Add debugging output * Fix success check, failure checking, outputs, sync behavior * Update release checklist, rename mounts	2021-02-17 23:00:49 +01:00
Kai Fricke	1e113d2e6e	[tune/xgboost] Update release test docs (#13880 ) * Update release test docs * Update	2021-02-04 13:10:56 +01:00
Ameer Haj Ali	b7dd7ddb52	deprecate useless fields in the cluster yaml. (#13637 ) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs * joblib strikes again on windows * add ability to not start autoscaler/monitor * a * remove worker_default * Remove default pod type from operator * Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types * deprecate useless fields Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2021-01-23 12:06:51 -08:00
Kai Fricke	8804758409	[xgboost] Add XGBoost release tests (#13456 ) * Add XGBoost release tests * Add more xgboost release tests * Use failure state manager * Add release test documentation * Fix wording * Automate fault tolerance tests	2021-01-20 18:40:23 +01:00

43 commits