hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	1d3c167bfe	[rllib/release] Fix rllib connect test with Tuner() API (#27155 ) Currently failing because the Tune framework example does not return fitting results. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-28 11:08:02 +01:00
Amog Kamsetty	862d10c162	[AIR] Remove ML code from `ray.util` (#27005 ) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 14:24:19 +01:00
Antoni Baum	b9a4f64f32	[AIR/train] Use new Train API (#25735 ) Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires https://github.com/ray-project/ray/pull/25943 to be merged in first	2022-07-07 12:28:37 -07:00
xwjiang2010	40f9561f78	[ml/release] fix ptl ml user test. (#26365 ) Between version1 and 2 of [this](https://console.anyscale-staging.com/o/anyscale-internal/configurations/app-config-versions/apt_TsCpJCRjMJDpNFhNgJmyCniS) cluster_env, 1 fails and 2 succeeds. btw, we really should start to think about a systematic approach towards our python dependency story. - between client and server - but more importantly server side, and any conflicts among requirements - how are pip freeze result evolving over time	2022-07-07 11:45:46 -07:00
Kai Fricke	e2d8e7a6ae	[ci/release/ml] Run ML release tests on staging (#26168 ) This moves all ML release tests to staging. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-06-30 13:24:28 -07:00
matthewdeng	5c6b91d375	[Release] fix Horovod release tests (#25873 ) Error message suggests: Wait timeout after 30 seconds for key(s): 0. You may want to increase the timeout via HOROVOD_GLOO_TIMEOUT_SECONDS Bumped up to 120 seconds. Tests run successfully: https://buildkite.com/ray-project/release-tests-pr/builds/6906	2022-06-17 14:52:54 +01:00
SangBin Cho	ec653e3196	[Nightly test] Move two line downloads to one line. (#25061 ) It fixes the mysterious error when all cluster env build is failing when pip uninstall / pip install is written in 2 lines. The root cause will be fixed later	2022-05-22 00:07:03 -07:00
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Kai Fricke	de69b0d6d6	[train/release] Fix horovod user test master app config (#24734 )	2022-05-14 21:20:45 -07:00
Kai Fricke	8a578c191f	[ci/release] Re-install anyscale package after local env setup (#24373 ) The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.	2022-05-01 16:51:55 +01:00
Kai Fricke	ac036e4fe8	[ci/release] Print local environment information (#24346 ) For debugging client environments, it is helpful to print the installed pip packages. Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.	2022-04-29 21:01:50 +01:00
Amog Kamsetty	47243ace7c	[Release] Upgrade instance types for xgboost gpu release tests (#24002 ) In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767). This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6. Closes #24048	2022-04-20 15:18:22 -07:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Kai Fricke	830238cce2	[ci/release] Migrate ML user tests (#22953 ) Most recent tests: https://buildkite.com/ray-project/release-tests-branch/builds/156 https://buildkite.com/ray-project/release-tests-branch/builds/158	2022-03-14 11:50:16 +00:00
xwjiang2010	ee7a458762	[release test] fix horovod release test. (#22781 ) horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)]. Error message is saying: ``` AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'> ``` The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`. The two communicate via Anyscale client. The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.	2022-03-03 08:24:26 -08:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
Amog Kamsetty	bd726aab02	[Release] Disable caching for `ray_lightning` (#21886 ) Passing tests: https://buildkite.com/ray-project/periodic-ci/builds/2560#_ Add an echo timestamp to the post build commands of the ray lightning release tests to trigger a cluster env rebuild and get the latest versions of ray lightning. Without this, the cluster env gets cached so an outdated version is installed on the cluster that is different than the one on the driver, resulting in the below failures. Closes #21871 Closes #21863 Also reinstalls the dependencies in the post build commands so old versions are not cached in the Docker images	2022-01-27 17:56:32 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
mwtian	0b3fed5ef3	Revert "[Nightly Test] Add a team column to each test config. (#21198 )" (#21289 ) This reverts commit `b5b11b2d06`.	2021-12-30 06:44:51 +09:00
SangBin Cho	b5b11b2d06	[Nightly Test] Add a team column to each test config. (#21198 ) Please review e2e.py and test_suite belonging to your team! This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit# This PR adds a team name to each test suite. If the name is not specified, it will be reported as unspecified. If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future). Note that we will aggregate all of test config into a single file, nightly_test.yaml.	2021-12-27 14:42:41 -08:00
architkulkarni	2489b17634	[release] Uninstall old ray in all release test app configs to fix commit mismatch error (#21175 ) * uninstall old ray in all release test app configs * add instruction to e2e.py dosctring	2021-12-18 16:58:49 -08:00
Kai Fricke	b58f839534	[ci/release] Remove hard numpy removal from app configs (#21005 )	2021-12-13 15:22:02 +00:00
Amog Kamsetty	99ed623371	[Release] Use NCCL backend for release tests (#20677 ) * use nccl for release tests * link issue	2021-11-29 12:42:13 -08:00
Antoni Baum	a8d7897a56	[CI] Modify remote wrapper in XGBoost-Ray client test (#20544 ) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node.	2021-11-24 10:27:17 +00:00
Richard Liaw	1cadd61917	Fix horovod failing tests by pinning down (#20484 )	2021-11-17 13:54:25 -08:00
Amog Kamsetty	7e597814aa	[Release] Fix app config for `horovod_tests` (#20393 ) Fixes `horovod_test` weekly test Closes https://github.com/ray-project/ray/issues/20382	2021-11-16 09:06:42 -08:00
Kai Fricke	91920f1d02	[release/xgboost] xgboost release test fixes via app config (#20325 ) * [xgboost] Fix release test app configs * Revert full app config * Update base docker image * Only change cpu base image * default * Pin xgboost to 1.5. in cpu tests * Remove numpy hack * Revert one line Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-15 10:03:21 -08:00
matthewdeng	ed3cbe48f5	[train][xgboost][release] fix ml_user_tests using ray client (#20345 )	2021-11-15 15:24:23 +00:00
matthewdeng	e22632dabc	[train] wrap BackendExecutor in ray.remote() (#20123 ) * [train] wrap BackendExecutor in ray.remote() * wip * fix trainer tests * move CheckpointManager to Trainer * [tune] move force_on_current_node to ml_utils * fix import * force on head node * init ray * split test files * update example * move tests to ray client * address comments * move comment * address comments	2021-11-13 15:30:44 -08:00
Amog Kamsetty	4396419a64	[Release] Fix tune_rllib connect test (#20321 ) * [Release] Fix tune_rllib connect test * use canonical app config	2021-11-13 10:11:20 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
Amog Kamsetty	f164f3a8b5	[Release] Increase Placement Group timeout (#20224 )	2021-11-10 13:02:38 -08:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00

33 commits