hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	6b683ec8dc	[ci] Retry release tests on infra error (#20478 ) This PR introduces proper exit codes for release tests. These are used to restart a certain set of infrastructure related failures automatically.	2021-12-02 10:34:40 -08:00
Chen Shen	6d17fe5fc5	[cuj2] merge latest change to cuj2 (groupby based filtering) and add a debug mode. (#20742 ) This PR does two things: merge latest groupby based filtering to CUJ2 add a debug mode so we only run dummy trainer for measure data processing performance.	2021-11-29 19:10:17 -08:00
SangBin Cho	97b4490401	[Nightly Test] Readjust nightly test schedule (#20717 ) - Removing scale_to logic from object store. We don't need to scale during tests, which will disambiguate infra failures vs app failures. - Run microbenchmark in core nightly, meaning it will run even more often - Run weekly scalability tests daily instead. (They are not too expensive). - Run some core daily tests separately to avoid infra failures.	2021-11-26 06:59:16 -08:00
Kai Fricke	7446269ac9	[tune/rllib] Fix tune cloud tests for function and rllib trainables (#20536 ) Fixes some race conditions and softens some constraints around checkpoint numbers.	2021-11-24 09:29:12 +00:00
Chen Shen	107aef89a8	[CUJ2] add nightly tests for running 500GB ray train (#20195 ) * add * update cluster env * fix build Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>	2021-11-21 20:04:45 -08:00
Kai Fricke	05d21497db	[rllib/tune] Fix durable trainable in trainer template, add release test (#20422 )	2021-11-16 20:52:42 +00:00
SangBin Cho	5ec63ccc5f	[Regresion test] Placement group long running test (#20251 ) Why are these changes needed? In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed. This PR adds a long running test for the placement group. There are 2 purposes of the test. Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round. Make sure placement group removal / creation works consistently for a long time without an issue. Q: Should we make it a real long running test? (that runs for a day?)	2021-11-16 04:21:18 -08:00
Kai Fricke	4300039d01	[ci/release] Display commit hash in buildkite overview (#20323 )	2021-11-15 10:09:04 +00:00
SangBin Cho	6cc493079b	[Core] Add Placement group performance test (#20218 ) * in progress * ip * Fix issues * done * Address code review.	2021-11-14 09:17:54 +09:00
SangBin Cho	b2acfd6ff4	[Test] Change the frequency of many nodes actor test (#20232 )	2021-11-10 21:12:22 -08:00
Simon Mo	215f47bc53	[CI] Move Serve nightly tests to a separate suite (#20194 ) So we can run them via separate cronjobs	2021-11-09 13:22:50 -08:00
SangBin Cho	90fd38c64a	[Test] Large scale threaded actor workload (#20105 ) * Done * Addressed code review. * lint * Update release/nightly_tests/stress_tests/test_threaded_actors.py Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>	2021-11-09 02:28:48 -08:00
SangBin Cho	5c4fb4dc91	[Core]Chaos testing nightly (#20059 ) * Done initial stage. * lint * . * Finished. * Fix lint	2021-11-08 21:57:53 -08:00
gjoliver	d8a61f801f	[RLlib] Create a set of performance benchmark tests to run nightly. (#19945 ) * Create a core set of algorithms tests to run nightly. * Run release tests under tf, tf2, and torch frameworks. * Fix * Add eager_tracing option for tf2 framework. * make sure core tests can run in parallel. * cql * Report progress while running nightly/weekly tests. * Innclude SAC in nightly lineup. * Revert changes to learning_tests * rebrand to performance test. * update build_pipeline.py with new performance_tests name. * Record stats. * bug fix, need to populate experiments dict. * Alphabetize yaml files. * Allow specifying frameworks. And do not run tf2 by default. * remove some debugging code. * fix * Undo testing changes. * Do not run CQL regression for now. * LINT. Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-08 18:15:13 +01:00
Yi Cheng	6a6cc434ba	[nightly] Remove grpc staging test since nightly is stable #20119 (#20119 )	2021-11-05 21:36:58 -07:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00
Amog Kamsetty	3a52187da8	[Release/Lightning] Add Ray lightning user test (#19812 ) * wip * wip * add ray lightning test * fix * update * merge and add * fix * fix * rename * autoscale * add tblib * gloo backend * typo * upgrade torch * latest and master	2021-11-01 18:29:48 -07:00
Amog Kamsetty	474e44f7e0	[Release/Horovod] Add user test for Horovod (#19661 ) * infra * wip * add test * typo * typo * update * rename * fix * full path * formatting * reorder * update * update * Update release/horovod_tests/workloads/horovod_user_test.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * bump num_workers * update installs * try * add pip_packages * min_workers * fix * bump pg timeout * Fix symlink * fix * fix * cmake * fix * pin filelock * final * update * fix * Update release/horovod_tests/workloads/horovod_user_test.py * fix * fix * separate compute template * test latest and master Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-11-01 18:28:07 -07:00
matthewdeng	e1e4a45b8d	[train] add simple Ray Train release tests (#19817 ) * [train] add simple Ray Train release tests * simplify tests * update * driver requirements * move to test * remove connect * fix * fix * fix torch * gpu * add assert * remove assert * use gloo backend * fix * finish Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-01 18:25:19 -07:00
architkulkarni	702bffe072	[runtime env] [test] Enable runtime env nightly test with working_dir reconnection (#19906 )	2021-10-31 10:48:48 -05:00
Kai Fricke	fa0158abe5	[tune] Cloud checkpointing release tests (#19638 )	2021-10-29 12:12:01 +02:00
Simon Mo	3e038aebb2	[CI] Allow release tests infra to accept buildkite artifacts (#19803 )	2021-10-27 13:04:01 -07:00
Yi Cheng	abec07700a	[nightly] Adding more tests related to grpc broadcasting to staging mode (#19779 ) ## Why are these changes needed? We have concern that grpc based broadcasting might have negative impact on pg related workload. This test is to ensure it's running well before merging. ## Related issue number #19438	2021-10-27 10:46:13 -07:00
Amog Kamsetty	6e61ca623d	[CI] Infra for "user" tests (#19662 )	2021-10-26 08:47:22 +01:00
Yi Cheng	7a7b356899	[Nightly test] add test for grpc broadcasting (#19579 )	2021-10-21 07:01:41 -07:00
Yi Cheng	7a9cedfc5c	[nightly] Add grpc based broadcasting into nightly test for decision_tree (#19531 ) * dbg * up * check * up * up * put grpc based one into nightly test * up	2021-10-19 19:59:39 -07:00
Yi Cheng	f47f69d31e	[nightly] Add decision_tree_autoscaling_20_runs to nightly test	2021-10-18 11:19:40 -07:00
Kai Fricke	6c6639a0d7	[ci/release] hotfix for undefined local variable (#19460 )	2021-10-18 11:28:33 +01:00
Kai Fricke	c10d434713	[release] Allow commit hashes instead of URLs, add bisection utility (#19398 )	2021-10-18 10:44:29 +01:00
Kai Fricke	e17b23fa5b	[ci/release] Add support for RAY_WHEELS url (#19364 )	2021-10-14 21:40:01 +01:00
Jiao	893f76daf9	[serve] Add serve FT nightly test to buildkite (#19361 )	2021-10-13 13:56:55 -07:00
SangBin Cho	22f4ffed08	Disable cpu-only-nodes preferred scheduling that breaks placement groups. (#19129 ) * Add a regression test for the short term * done * address code review * lint	2021-10-07 05:34:04 -07:00
Chen Shen	7c99aae033	[dataset][nightly-test] add pipelined ingestion/training nightly test	2021-09-23 20:39:03 -07:00
Kai Fricke	2cbf326410	[ci/release] store buildkite artifacts on buildkite (#18712 )	2021-09-22 11:35:59 +01:00
SangBin Cho	51d94ebee0	[Tests] Make nightly test work + Remove work stealing logs (#18300 ) * make tests work * .	2021-09-14 09:52:58 -07:00
Jiao	d3734d803d	[serve] Change nightly test docker image and enable micro benchmark (#18566 )	2021-09-14 09:41:21 -05:00
Yi Cheng	6011d4197f	Open [nightly] Add many_nodes_actor_test to nightly test (#18406 )	2021-09-08 11:15:48 -07:00
Sven Mika	5292b70fc6	[RLlib] Add multi-GPU attention net tests to nightly test suite (+ R2D2 tests for LSTM and attention nets). (#18368 )	2021-09-06 17:48:05 +02:00
Kai Fricke	4c3276644e	[release] After buildkite ask step, use RAY_TEST_REPO pipeline (#18074 )	2021-08-25 15:58:38 +02:00
Sven Mika	9883505e84	[RLlib] Add [LSTM=True + multi-GPU]-tests to nightly RLlib testing suite (for all algos supporting RNNs, except R2D2, RNNSAC, and DDPPO). (#18017 )	2021-08-24 21:55:27 +02:00
Kai Fricke	fca8af88d2	[release] Fix e2e environment variable passing from pipeline (#18000 )	2021-08-23 09:26:37 +02:00
Chen Shen	89f988e9cc	add dataset shuffle data loader (#17917 )	2021-08-20 11:26:01 -07:00
architkulkarni	36c26578a7	[runtime env] [test] Add nightly test to verify Ray wheel URLs are valid (#17938 )	2021-08-19 15:48:37 -07:00
Kai Fricke	651aae76b9	[release] Ask for configuration in buildkite (#17948 )	2021-08-19 17:51:05 +02:00
Sven Mika	a428f10ebe	[RLlib] Add multi-GPU learning tests to nightly. (#17778 )	2021-08-18 17:21:01 +02:00
architkulkarni	b173b33934	[tests] Add runtime envs release test to nightly build script (#17638 )	2021-08-06 13:18:25 -05:00
Sven Mika	a708cca4bc	[RLlib, Testing] Add RLlib tests to nightly/weekly release test automation. (#17543 )	2021-08-03 13:44:00 -04:00
Alex Wu	63e335caf2	Update build_pipeline.py (#17544 )	2021-08-03 10:40:29 -07:00

1 2

56 commits