hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
SangBin Cho	6649f078e5	[Internal Observability] Move debug_state.txt to the log dir + support gcs_server debug state (#20722 ) Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below. Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.	2021-11-28 20:42:37 -08:00
SangBin Cho	6fc6ebb43e	Promote some tests stable. (#20740 ) Mark staging tests that pass 10+ time in a row as stable tests	2021-11-28 18:43:39 -08:00
Amog Kamsetty	ac843a957c	[Release] Use large instance type for long running `impala` test (#20691 ) * add * update	2021-11-26 11:42:41 -08:00
SangBin Cho	97b4490401	[Nightly Test] Readjust nightly test schedule (#20717 ) - Removing scale_to logic from object store. We don't need to scale during tests, which will disambiguate infra failures vs app failures. - Run microbenchmark in core nightly, meaning it will run even more often - Run weekly scalability tests daily instead. (They are not too expensive). - Run some core daily tests separately to avoid infra failures.	2021-11-26 06:59:16 -08:00
SangBin Cho	cd7a32f1a5	[Nightly test] Chaos test fixture (#20277 ) This PR is mostly for implementing "fixture" for nightly test. Note that the current fixture implementation is not that great, and we can probably improve this in the future after refactoring e2e.py.	2021-11-24 17:13:29 -08:00
Alex Wu	63969c9a5c	[nigthly-tests][dataset] Use actor compute model for GPU inference (#20689 ) ## Why are these changes needed? Fix nightly tests to avoid oom ## Checks	2021-11-24 11:03:23 -08:00
Antoni Baum	a8d7897a56	[CI] Modify remote wrapper in XGBoost-Ray client test (#20544 ) Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node.	2021-11-24 10:27:17 +00:00
Kai Fricke	7446269ac9	[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536 ) Fixes some race conditions and softens some constraints around checkpoint numbers.	2021-11-24 09:29:12 +00:00
SangBin Cho	ca092fd032	[Nightly test] Fix broken pg long running test master (#20674 ) * Fixed. * Fix trial	2021-11-23 21:24:00 -08:00
Yi Cheng	b6b4d4cf57	[test] Update base image for nightly testing (#20680 ) ## Why are these changes needed? `base_image: "anyscale/ray-ml:pinned-nightly-py37"` doesn't exist anymore which fails a lot of nightly tests, change to `base_image: "anyscale/ray-ml:nightly-py37-gpu"` ## Related issue number ## Checks	2021-11-23 11:06:44 -08:00
Chen Shen	107aef89a8	[CUJ2] add nightly tests for running 500GB ray train (#20195 ) * add * update cluster env * fix build Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>	2021-11-21 20:04:45 -08:00
Alex Wu	24f27203ba	[hotfix] Fix inference nightly test by upgrading numpy (#20546 ) The ray-ml image depends on numpy ~=1.19.2 via the tensorflow==2.6 requirement. Unfortunately that's incompatible with Dataset (see here #20258 (comment)). This PR upgrades the numpy dependency only for the nightly test.	2021-11-19 08:15:23 -08:00
Alex Wu	a811b2b6d7	[hotfix] Fix stress_test_many_tasks cluster environment (#20519 ) This should fix the long running release tests that are failing to build their app configs. It seems like pip install ray[all] now downgrades the ray version. It's unclear why, but most likely, a dependency has pinned the ray version now. This PR explicitely install the version of Ray that we want after the pip install ray[all] to fix the problem.	2021-11-18 11:51:46 -08:00
Amog Kamsetty	3f1092fb3d	[Release] Revert impala app config (#20397 )	2021-11-18 11:24:22 -08:00
Simon Mo	d7f208dea4	[Releaes] Make e2e.py link clickable on buildkite (#20436 ) Adds log formatting to output clickable links to buildkite console logs	2021-11-18 12:45:59 +00:00
SangBin Cho	140a180ebb	[xgboost] Fix flaky train_small test (#20529 ) Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.	2021-11-18 10:20:08 +00:00
Amog Kamsetty	9796ae56d5	[Train][Data] Change usages of `iter_datasets` to `iter_epochs` (#20487 )	2021-11-17 18:05:51 -08:00
Simon Mo	c85e9e69b3	[Serve] Change multi_deployment_1k_noop_replica threshold (#20514 )	2021-11-17 17:25:54 -08:00
Richard Liaw	1cadd61917	Fix horovod failing tests by pinning down (#20484 )	2021-11-17 13:54:25 -08:00
gjoliver	724a140795	[rllib] Make sure json can serialize result dict (#20439 ) We may have fields in the result dict that are or None. Make sure our results are json serializable.	2021-11-17 10:27:00 -08:00
Kai Fricke	05d21497db	[rllib/tune] Fix durable trainable in trainer template, add release test (#20422 )	2021-11-16 20:52:42 +00:00
Amog Kamsetty	7e597814aa	[Release] Fix app config for `horovod_tests` (#20393 ) Fixes `horovod_test` weekly test Closes https://github.com/ray-project/ray/issues/20382	2021-11-16 09:06:42 -08:00
Simon Mo	ca90c63483	[Serve] Add serve failure test to CI (#20392 )	2021-11-16 08:12:08 -08:00
Kai Fricke	693063d6f8	[ci/release] fix exit code (use value, not object) (#20427 )	2021-11-16 15:15:39 +00:00
SangBin Cho	5ec63ccc5f	[Regresion test] Placement group long running test (#20251 ) Why are these changes needed? In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed. This PR adds a long running test for the placement group. There are 2 purposes of the test. Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round. Make sure placement group removal / creation works consistently for a long time without an issue. Q: Should we make it a real long running test? (that runs for a day?)	2021-11-16 04:21:18 -08:00
Yiran Wang	f4e8319eaa	Remove .boto files that are no longer needed during docker build (#20407 ) ## Why are these changes needed? The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues. ## Related issue number	2021-11-15 20:49:33 -08:00
Kai Fricke	d191ad2de8	[ci/release] Return exit codes based on different errors (#20289 )	2021-11-15 19:41:00 +00:00
Kai Fricke	91920f1d02	[release/xgboost] xgboost release test fixes via app config (#20325 ) * [xgboost] Fix release test app configs * Revert full app config * Update base docker image * Only change cpu base image * default * Pin xgboost to 1.5. in cpu tests * Remove numpy hack * Revert one line Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-15 10:03:21 -08:00
matthewdeng	ed3cbe48f5	[train][xgboost][release] fix ml_user_tests using ray client (#20345 )	2021-11-15 15:24:23 +00:00
Kai Fricke	4300039d01	[ci/release] Display commit hash in buildkite overview (#20323 )	2021-11-15 10:09:04 +00:00
SangBin Cho	a4f72c6606	[nightly] Fix pg stress test (#20362 ) ## Why are these changes needed? This was mistakenly added to the nightly. Fixing it. ## Related issue number	2021-11-15 00:17:18 -08:00
SangBin Cho	6cc493079b	[Core] Add Placement group performance test (#20218 ) * in progress * ip * Fix issues * done * Address code review.	2021-11-14 09:17:54 +09:00
matthewdeng	e22632dabc	[train] wrap BackendExecutor in ray.remote() (#20123 ) * [train] wrap BackendExecutor in ray.remote() * wip * fix trainer tests * move CheckpointManager to Trainer * [tune] move force_on_current_node to ml_utils * fix import * force on head node * init ray * split test files * update example * move tests to ray client * address comments * move comment * address comments	2021-11-13 15:30:44 -08:00
Amog Kamsetty	4396419a64	[Release] Fix tune_rllib connect test (#20321 ) * [Release] Fix tune_rllib connect test * use canonical app config	2021-11-13 10:11:20 -08:00
gjoliver	7fe42341ed	[release] Switch many_ppo test to use the canonical rllib app cfg as well. (#20310 )	2021-11-12 20:51:28 -08:00
Simon Mo	b6bd4fd5f3	[Serve] Don't recover from current state checkpoint (#19998 )	2021-11-12 09:02:27 -08:00
Kai Fricke	d88fdd6e38	[tune] refactor SyncConfig (#20155 )	2021-11-12 09:36:15 +00:00
architkulkarni	33f680095d	[Test] [runtime env] Retry wheel urls for up to 2h to give time for Mac wheels to build (#19337 )	2021-11-11 21:48:35 -08:00
Edward Oakes	7c9881b73d	[serve] Fix serve_failure test (#20268 )	2021-11-11 19:19:34 -08:00
Jiajun Yao	992ab3e098	[Release] Commit sanity check when a url is provided (#20255 )	2021-11-11 13:33:58 -08:00
SangBin Cho	9fd8c6648c	[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220 ) * Fix nightly tests * done * done	2021-11-11 05:01:19 -08:00
SangBin Cho	f3e3c04469	[Nightly test] Make report False by default. (#20238 ) * Make report False by default. * fix	2021-11-11 04:58:23 -08:00
SangBin Cho	b2acfd6ff4	[Test] Change the frequency of many nodes actor test (#20232 )	2021-11-10 21:12:22 -08:00
Tobias Kaymak	893f57591d	[serve] Add Google Cloud Storage as a backend (#20104 )	2021-11-10 19:45:19 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
gjoliver	b6b4aaa632	[Release] Fix stress_tests (#20233 )	2021-11-10 16:05:46 -08:00
Amog Kamsetty	f164f3a8b5	[Release] Increase Placement Group timeout (#20224 )	2021-11-10 13:02:38 -08:00
xwjiang2010	2fbbecf1e4	[release] Define worker node type even if no worker node is needed. (#20223 )	2021-11-10 11:19:09 -08:00
matthewdeng	790e22f9ad	[tune] move force_on_current_node to ml_utils (#20211 )	2021-11-10 10:21:24 -08:00
Kai Fricke	4e3e213549	[tune] Allow more versatile experiment analysis loading (#20181 )	2021-11-10 11:46:27 +00:00

1 2 3 4 5 ...

352 commits