hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
gjoliver	724a140795	[rllib] Make sure json can serialize result dict (#20439 ) We may have fields in the result dict that are or None. Make sure our results are json serializable.	2021-11-17 10:27:00 -08:00
Kai Fricke	05d21497db	[rllib/tune] Fix durable trainable in trainer template, add release test (#20422 )	2021-11-16 20:52:42 +00:00
Amog Kamsetty	7e597814aa	[Release] Fix app config for `horovod_tests` (#20393 ) Fixes `horovod_test` weekly test Closes https://github.com/ray-project/ray/issues/20382	2021-11-16 09:06:42 -08:00
Simon Mo	ca90c63483	[Serve] Add serve failure test to CI (#20392 )	2021-11-16 08:12:08 -08:00
Kai Fricke	693063d6f8	[ci/release] fix exit code (use value, not object) (#20427 )	2021-11-16 15:15:39 +00:00
SangBin Cho	5ec63ccc5f	[Regresion test] Placement group long running test (#20251 ) Why are these changes needed? In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed. This PR adds a long running test for the placement group. There are 2 purposes of the test. Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round. Make sure placement group removal / creation works consistently for a long time without an issue. Q: Should we make it a real long running test? (that runs for a day?)	2021-11-16 04:21:18 -08:00
Yiran Wang	f4e8319eaa	Remove .boto files that are no longer needed during docker build (#20407 ) ## Why are these changes needed? The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues. ## Related issue number	2021-11-15 20:49:33 -08:00
Kai Fricke	d191ad2de8	[ci/release] Return exit codes based on different errors (#20289 )	2021-11-15 19:41:00 +00:00
Kai Fricke	91920f1d02	[release/xgboost] xgboost release test fixes via app config (#20325 ) * [xgboost] Fix release test app configs * Revert full app config * Update base docker image * Only change cpu base image * default * Pin xgboost to 1.5. in cpu tests * Remove numpy hack * Revert one line Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-15 10:03:21 -08:00
matthewdeng	ed3cbe48f5	[train][xgboost][release] fix ml_user_tests using ray client (#20345 )	2021-11-15 15:24:23 +00:00
Kai Fricke	4300039d01	[ci/release] Display commit hash in buildkite overview (#20323 )	2021-11-15 10:09:04 +00:00
SangBin Cho	a4f72c6606	[nightly] Fix pg stress test (#20362 ) ## Why are these changes needed? This was mistakenly added to the nightly. Fixing it. ## Related issue number	2021-11-15 00:17:18 -08:00
SangBin Cho	6cc493079b	[Core] Add Placement group performance test (#20218 ) * in progress * ip * Fix issues * done * Address code review.	2021-11-14 09:17:54 +09:00
matthewdeng	e22632dabc	[train] wrap BackendExecutor in ray.remote() (#20123 ) * [train] wrap BackendExecutor in ray.remote() * wip * fix trainer tests * move CheckpointManager to Trainer * [tune] move force_on_current_node to ml_utils * fix import * force on head node * init ray * split test files * update example * move tests to ray client * address comments * move comment * address comments	2021-11-13 15:30:44 -08:00
Amog Kamsetty	4396419a64	[Release] Fix tune_rllib connect test (#20321 ) * [Release] Fix tune_rllib connect test * use canonical app config	2021-11-13 10:11:20 -08:00
gjoliver	7fe42341ed	[release] Switch many_ppo test to use the canonical rllib app cfg as well. (#20310 )	2021-11-12 20:51:28 -08:00
Simon Mo	b6bd4fd5f3	[Serve] Don't recover from current state checkpoint (#19998 )	2021-11-12 09:02:27 -08:00
Kai Fricke	d88fdd6e38	[tune] refactor SyncConfig (#20155 )	2021-11-12 09:36:15 +00:00
architkulkarni	33f680095d	[Test] [runtime env] Retry wheel urls for up to 2h to give time for Mac wheels to build (#19337 )	2021-11-11 21:48:35 -08:00
Edward Oakes	7c9881b73d	[serve] Fix serve_failure test (#20268 )	2021-11-11 19:19:34 -08:00
Jiajun Yao	992ab3e098	[Release] Commit sanity check when a url is provided (#20255 )	2021-11-11 13:33:58 -08:00
SangBin Cho	9fd8c6648c	[Test] Fix newly added nightly tests, threaded actor + chaos testing (#20220 ) * Fix nightly tests * done * done	2021-11-11 05:01:19 -08:00
SangBin Cho	f3e3c04469	[Nightly test] Make report False by default. (#20238 ) * Make report False by default. * fix	2021-11-11 04:58:23 -08:00
SangBin Cho	b2acfd6ff4	[Test] Change the frequency of many nodes actor test (#20232 )	2021-11-10 21:12:22 -08:00
Tobias Kaymak	893f57591d	[serve] Add Google Cloud Storage as a backend (#20104 )	2021-11-10 19:45:19 -08:00
Amog Kamsetty	18dcf1ac25	[Release] Use nightly Docker images (#20001 ) * use nightly * switch ml cpu to ray cpu * fix * add pytest * add more pytest * add constraint * add tensorflow * fix merge conflict * add tblib * fix * add back uninstall	2021-11-10 18:00:16 -08:00
gjoliver	b6b4aaa632	[Release] Fix stress_tests (#20233 )	2021-11-10 16:05:46 -08:00
Amog Kamsetty	f164f3a8b5	[Release] Increase Placement Group timeout (#20224 )	2021-11-10 13:02:38 -08:00
xwjiang2010	2fbbecf1e4	[release] Define worker node type even if no worker node is needed. (#20223 )	2021-11-10 11:19:09 -08:00
matthewdeng	790e22f9ad	[tune] move force_on_current_node to ml_utils (#20211 )	2021-11-10 10:21:24 -08:00
Kai Fricke	4e3e213549	[tune] Allow more versatile experiment analysis loading (#20181 )	2021-11-10 11:46:27 +00:00
Simon Mo	215f47bc53	[CI] Move Serve nightly tests to a separate suite (#20194 ) So we can run them via separate cronjobs	2021-11-09 13:22:50 -08:00
SangBin Cho	90fd38c64a	[Test] Large scale threaded actor workload (#20105 ) * Done * Addressed code review. * lint * Update release/nightly_tests/stress_tests/test_threaded_actors.py Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com> Co-authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>	2021-11-09 02:28:48 -08:00
SangBin Cho	5c4fb4dc91	[Core]Chaos testing nightly (#20059 ) * Done initial stage. * lint * . * Finished. * Fix lint	2021-11-08 21:57:53 -08:00
gjoliver	d8a61f801f	[RLlib] Create a set of performance benchmark tests to run nightly. (#19945 ) * Create a core set of algorithms tests to run nightly. * Run release tests under tf, tf2, and torch frameworks. * Fix * Add eager_tracing option for tf2 framework. * make sure core tests can run in parallel. * cql * Report progress while running nightly/weekly tests. * Innclude SAC in nightly lineup. * Revert changes to learning_tests * rebrand to performance test. * update build_pipeline.py with new performance_tests name. * Record stats. * bug fix, need to populate experiments dict. * Alphabetize yaml files. * Allow specifying frameworks. And do not run tf2 by default. * remove some debugging code. * fix * Undo testing changes. * Do not run CQL regression for now. * LINT. Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-08 18:15:13 +01:00
xwjiang2010	99826d2ca6	[Release] Increase node memory by 2X in many_ppo test. (#19591 )	2021-11-08 08:10:09 +09:00
Jiajun Yao	e110d958a1	Support different s3 url formats (#20133 )	2021-11-07 14:58:51 -08:00
Yi Cheng	6a6cc434ba	[nightly] Remove grpc staging test since nightly is stable #20119 (#20119 )	2021-11-05 21:36:58 -07:00
Amog Kamsetty	3408b60d2b	[Release] Refactor User Tests (#20028 ) * wip * add directory * wip * try again * Revert "try again" This reverts commit 82d33ccea6f92848df025e019b87df73cea49e5d. * finish * formatting * fix merge * fix path * chmod * check * sudo * wip * update * fix horovod * try * typo * reduce num workers	2021-11-05 17:28:37 -07:00
gjoliver	1341bb59bf	[RLlib; Release testing] long_running_tests should use RLlib's app_config. (#20095 )	2021-11-05 15:18:56 +01:00
Simon Mo	4d583da7d5	[Serve] Add verbose log for nightly test only (#20088 )	2021-11-04 16:15:22 -07:00
Yi Cheng	04f60c998e	[nightly] Fix pytest missing in nightly test (#20076 ) ## Why are these changes needed? In the nightly test we see ``` Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest' ``` This PR fixes this error. ## Related issue number	2021-11-04 13:38:05 -07:00
gjoliver	2c1fa459d4	[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807 ) * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * Add an RLlib Tune experiment to UserTest suite. * Add ray.init() * Move example script to example/tune/, so it can be imported as module. * add __init__.py so our new module will get included in python wheel. * Add block device to RLlib test instances. * Reduce disk size a little bit. * Add metrics reporting * Allow max of 5 workers to accomodate all the worker tasks. * revert disk size change. * Minor updates * Trigger build * set max num workers * Add a compute cfg for autoscaled cpu and gpu nodes. * use 1gpu instance. * install tblib for debugging worker crashes. * Manually upgrade to pytorch 1.9.0 * -y * torch=1.9.0 * install torch on driver * bump timeout * Write a more informational result dict. * Revert changes to compute config files that are not used. * add smoke test * update * reduce timeout * Reduce the # of env per worker to 1. * Small fix for getting trial_states * Trigger build * simply result dict * lint * more lint * fix smoke test Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-03 17:04:27 -07:00
Avnish Narayan	026bf01071	[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535 ) * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 * Reformatting * Fixing tests * Move atari-py install conditional to req.txt * migrate to new ale install method * Fix QMix, SAC, and MADDPA too. * Unpin gym and deprecate pendulum v0 Many tests in rllib depended on pendulum v0, however in gym 0.21, pendulum v0 was deprecated in favor of pendulum v1. This may change reward thresholds, so will have to potentially rerun all of the pendulum v1 benchmarks, or use another environment in favor. The same applies to frozen lake v0 and frozen lake v1 Lastly, all of the RLlib tests and have been moved to python 3.7 * Add gym installation based on python version. Pin python<= 3.6 to gym 0.19 due to install issues with atari roms in gym 0.20 Move atari-py install conditional to req.txt migrate to new ale install method Make parametric_actions_cartpole return float32 actions/obs Adding type conversions if obs/actions don't match space Add utils to make elements match gym space dtypes Co-authored-by: Jun Gong <jungong@anyscale.com> Co-authored-by: sven1977 <svenmika1977@gmail.com>	2021-11-03 16:24:00 +01:00
Amog Kamsetty	f4b425f84c	[Release/Xgboost] Fix master install (#19991 )	2021-11-02 13:50:14 -07:00
Kai Fricke	f96078687f	[xgboost/release] Xgboost/connect gpu test (#19838 ) * [xgboost/release] Add GPU connect user test * Use scaling cluster * typo * Increase xgboost placement group timeout * Much higher timeout * Move os environment timeout * Move os environ * [dev] install xgboost-ray from master * GPU xgboost master * Remove master install after new xgboost release * Install latest * Add master test	2021-11-02 08:40:48 -07:00
Amog Kamsetty	3a52187da8	[Release/Lightning] Add Ray lightning user test (#19812 ) * wip * wip * add ray lightning test * fix * update * merge and add * fix * fix * rename * autoscale * add tblib * gloo backend * typo * upgrade torch * latest and master	2021-11-01 18:29:48 -07:00
Amog Kamsetty	474e44f7e0	[Release/Horovod] Add user test for Horovod (#19661 ) * infra * wip * add test * typo * typo * update * rename * fix * full path * formatting * reorder * update * update * Update release/horovod_tests/workloads/horovod_user_test.py Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> * bump num_workers * update installs * try * add pip_packages * min_workers * fix * bump pg timeout * Fix symlink * fix * fix * cmake * fix * pin filelock * final * update * fix * Update release/horovod_tests/workloads/horovod_user_test.py * fix * fix * separate compute template * test latest and master Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2021-11-01 18:28:07 -07:00
matthewdeng	e1e4a45b8d	[train] add simple Ray Train release tests (#19817 ) * [train] add simple Ray Train release tests * simplify tests * update * driver requirements * move to test * remove connect * fix * fix * fix torch * gpu * add assert * remove assert * use gloo backend * fix * finish Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>	2021-11-01 18:25:19 -07:00
xwjiang2010	1803ca13b6	Adding release logs for 1.8.0. (#19867 )	2021-11-01 10:26:04 -07:00

... 2 3 4 5 6 ...

483 commits