hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	6c5229295e	[ci/release] Support running tests with different python versions (#24843 ) OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.	2022-05-17 17:03:12 +01:00
Artur Niederfahrenhorst	fb2915d26a	[RLlib] Replay Buffer API and Ape-X. (#24506 )	2022-05-17 13:43:49 +02:00
Sven Mika	0cd7bc4054	[RLlib] Re-establish dashboard performance tests. (#24728 )	2022-05-16 13:13:49 +02:00
Sven Mika	70d3bfcf9c	[RLlib] Provide more time for APPO Pong release and performance tests. (#24503 )	2022-05-05 18:19:38 +02:00
Sven Mika	b48f63113b	[RLlib] SlateQ fixes: Release learning tests wrong yaml structure + TD-error torch issue (#24429 )	2022-05-04 13:37:14 +02:00
Sven Mika	f066180ed5	[RLlib] Deprecate `timesteps_per_iteration` config key (in favor of `min_[sample\|train]_timesteps_per_reporting`. (#24372 )	2022-05-02 12:51:14 +02:00
Sven Mika	3052193c9e	[RLlib] Fix CQL getting stuck when deprecated `timesteps_per_iteration` is used (use `min_train_timesteps_per_reporting` instead). (#24345 ) Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead). CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting. If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).	2022-04-29 21:02:34 +01:00
Kai Fricke	65d9a410f7	[ci] Clean up ci/ directory (refactor ci/travis) (#23866 ) Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories. Details: - Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc. - Minor adjustments to some scripts (variable renames) - Removes the outdated (unused) asan tests	2022-04-13 18:11:30 +01:00
Avnish Narayan	fdc6e02c29	[RLlib; testing] Move `num_workers` to RLlib config (#23750 )	2022-04-06 20:06:48 +02:00
Avnish Narayan	161d95c31b	[RLlib] Increase slateq workers to decrease runtime on prod (#23609 )	2022-03-30 17:38:21 -07:00
Artur Niederfahrenhorst	9a64bd4e9b	[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842 )	2022-03-29 14:44:40 +02:00
Sven Mika	22c9c4aa39	[RLlib] Slate-Q +GPU torch bug fix. (#23464 )	2022-03-24 17:39:33 +01:00
Avnish Narayan	9040f54060	[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452 )	2022-03-24 09:17:30 +01:00
Avnish Narayan	754bcd16f8	[rllib] Pin gym everywhere (#23384 ) This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.	2022-03-22 09:44:22 +00:00
Avnish Narayan	6c20e9d898	[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168 )	2022-03-16 09:15:59 +01:00
Kai Fricke	8608b64885	[ci/release] Remove old OSS release test infrastructure (#23134 ) Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.	2022-03-14 15:10:52 +00:00
Sven Mika	7b687e6cd8	[RLlib] SlateQ: Add a hard-task learning test to weekly regression suite. (#22544 )	2022-02-25 21:58:16 +01:00
Jun Gong	04dd536987	[Release tests] Disable A3C CI tests on torch for now. Also extend performance_test deadline to 3hrs. (#22426 )	2022-02-16 13:06:09 +01:00
Jun Gong	cbd24503b6	[RLlib] Add A3C to RLlib performance regression tests. (#22316 )	2022-02-11 21:18:53 +01:00
Sven Mika	04a5c72ea3	Revert "Revert "[RLlib] Speedup A3C up to 3x (new training_iteration function instead of execution_plan) and re-instate Pong learning test."" (#18708 )	2022-02-10 13:44:22 +01:00
Alex Wu	b122f093c1	Revert "[RLlib] Speedup A3C up to 3x (new `training_iteration` function instead of `execution_plan`) and re-instate Pong learning test." (#22250 ) Reverts ray-project/ray#22126 Breaks rllib:tests/test_io	2022-02-09 09:26:36 -08:00
Sven Mika	ac3e6ab411	[RLlib] Speedup A3C up to 3x (new `training_iteration` function instead of `execution_plan`) and re-instate Pong learning test. (#22126 )	2022-02-08 19:04:13 +01:00
Balaji Veeramani	7f1bacc7dc	[CI] Format Python code with Black (#21975 ) See #21316 and #21311 for the motivation behind these changes.	2022-01-29 18:41:57 -08:00
SangBin Cho	b1308b1c8c	[Test Infra] Unrevert team col (#21700 ) This fixes the previous problems from team column revert. This has 2 additional changes; alert handler receives the team argument, which was the root cause of breakage; https://github.com/ray-project/ray/pull/21289 Previously, tests without a team column were raising an exception, but I made the condition weaker (warning logs). I will eventually change it to raise an exception, but for smoother transition, we will log warning instead for a short time	2022-01-19 13:29:53 -08:00
Jun Gong	1315293dd8	[RLlib] Fix offline RL(BC & MARWIL) weekly learning tests. (#21643 )	2022-01-18 09:29:01 +01:00
Jun Gong	7517aefe05	[RLlib] Bring back BC and Marwil learning tests. (#21574 )	2022-01-14 14:35:32 +01:00
Jun Gong	83955a9407	[RLlib] Extend CQL perf test to 1hr. (#21449 )	2022-01-07 11:35:16 +01:00
mwtian	0b3fed5ef3	Revert "[Nightly Test] Add a team column to each test config. (#21198 )" (#21289 ) This reverts commit `b5b11b2d06`.	2021-12-30 06:44:51 +09:00
SangBin Cho	b5b11b2d06	[Nightly Test] Add a team column to each test config. (#21198 ) Please review e2e.py and test_suite belonging to your team! This is the first part of https://docs.google.com/document/d/16IrwerYi2oJugnRf5hvzukgpJ6FAVEpB6stH_CiNMjY/edit# This PR adds a team name to each test suite. If the name is not specified, it will be reported as unspecified. If you are running a local test, and if the new test suite doesn't have a team name specified, it will raise an exception (in this way, we can avoid missing team names in the future). Note that we will aggregate all of test config into a single file, nightly_test.yaml.	2021-12-27 14:42:41 -08:00
Jun Gong	767f78eaf8	[RLlib] Always attach latest eval metrics. (#21011 )	2021-12-15 11:42:53 +01:00
gjoliver	724a140795	[rllib] Make sure json can serialize result dict (#20439 ) We may have fields in the result dict that are or None. Make sure our results are json serializable.	2021-11-17 10:27:00 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
gjoliver	b6b4aaa632	[Release] Fix stress_tests (#20233 )	2021-11-10 16:05:46 -08:00
gjoliver	d8a61f801f	[RLlib] Create a set of performance benchmark tests to run nightly. (#19945 ) * Create a core set of algorithms tests to run nightly. * Run release tests under tf, tf2, and torch frameworks. * Fix * Add eager_tracing option for tf2 framework. * make sure core tests can run in parallel. * cql * Report progress while running nightly/weekly tests. * Innclude SAC in nightly lineup. * Revert changes to learning_tests * rebrand to performance test. * update build_pipeline.py with new performance_tests name. * Record stats. * bug fix, need to populate experiments dict. * Alphabetize yaml files. * Allow specifying frameworks. And do not run tf2 by default. * remove some debugging code. * fix * Undo testing changes. * Do not run CQL regression for now. * LINT. Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-08 18:15:13 +01:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
xwjiang2010	ab15dfd478	[Tune release test] Set 500G disk space for rllib_tests. (#19730 )	2021-10-26 10:12:03 -07:00
gjoliver	e9f66cc394	Reduce success criteria for a few learning tests. (#19484 )	2021-10-18 15:44:38 -07:00
Sven Mika	5611150b1a	Increase rllib stress tests timeout for smoke test (#18810 )	2021-09-22 14:30:42 +01:00
Sven Mika	e6aae61487	[RLlib; testing] Fix bug in stress tests not handling >1 trials per experiment (due to grid-search in IMPALA stress tests). (#18705 )	2021-09-20 15:31:57 +02:00
Sven Mika	ba1c489b79	[RLlib Testing] Lower `--smoke-test` "time_total_s" to make sure it doesn't time out. (#18670 )	2021-09-16 18:22:23 +02:00
gjoliver	df32ed35fd	Extend --smoke-test deadlines for learning and stress regression tests. (#18667 )	2021-09-16 09:18:39 +01:00
Kai Fricke	c8188ea70e	[ci/rllib] wait for stress test cluster (#18603 )	2021-09-14 19:01:22 +01:00
Sven Mika	08c09737fa	[RLlib] Fix R2D2 (torch) multi-GPU issue. (#18550 )	2021-09-14 19:58:10 +02:00
gjoliver	2924afa41e	[Release] Create soft links for libcusolver.so.10 as a temporary fix. (#18562 ) Co-authored-by: Jun Gong <jungong@anyscale.com>	2021-09-13 14:37:12 -07:00
Kai Fricke	7d1e6d3129	[ci/release] Add sanity check for ray wheels hash to release tests (#18489 )	2021-09-10 17:50:31 +01:00
Simon Mo	6d24214085	[Release] Make sure to uninstall ray for rllib_tests (#18448 )	2021-09-08 23:29:40 +01:00
gjoliver	50cdf551ce	[RLlib] Fix test name typo. (#18423 ) Co-authored-by: Jun Gong <jungong@mbpro.local>	2021-09-08 23:30:37 +02:00
Sven Mika	cabaa3b3c6	[RLlib Testing] Add A3C/APPO/BC/DDPPO/MARWIL/CQL/ES/ARS/TD3 to weekly learning tests. (#18381 )	2021-09-07 11:48:41 +02:00
Sven Mika	5292b70fc6	[RLlib] Add multi-GPU attention net tests to nightly test suite (+ R2D2 tests for LSTM and attention nets). (#18368 )	2021-09-06 17:48:05 +02:00

1 2

77 commits