hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	153a8b8fec	[release] convert tune release tests (#15913 )	2021-06-01 11:19:15 -07:00
Sven Mika	c9d220bcda	[RLlib] Upgrade RLlib regression test scripts to new testing tool - RLlib release logs for 1.4. (#16080 )	2021-06-01 17:39:18 +02:00
Amog Kamsetty	da6f28d777	[Release] Add multi-node, multi-GPU SGD release test (#16046 )	2021-05-31 16:23:04 -07:00
SangBin Cho	9fa3b9f6f3	[Nightly test] Test non streaming shuffle (#16150 )	2021-05-31 15:28:02 -07:00
SangBin Cho	94dc06d852	[Nightly test] improve error detection (#16102 ) * improve error detection * improve gitignore * fix	2021-05-27 00:33:21 -07:00
SangBin Cho	ee1ccb569d	[Test] Nightly shuffle test (#15998 ) * shuffle daily test update. * lint * Improve testing. * Download the real nightly. * Addressed code review. * fix typo * fix issue * fix the broken release test * Updated the test.	2021-05-24 15:33:31 -07:00
mwtian	5462c6e7de	Fix link to release checklist from release process doc. (#15793 )	2021-05-13 13:34:54 -07:00
SangBin Cho	259fcbd5bd	[Pubsub] Generalize the pubsub interface and adapt it for ref counting protocol (#15446 ) * Add mock code first * In the initial progress. * Fix the number error * In progress. * in more pgoress. * in progress. * lint. * Prototype done. * Fix compilation bug. * Now it is working with reference counting. * Remove template. * lint. * Fixed issues. * Fix reference count test. * Reference count test passes now. * Fixed the test array problem * Addressed code review. * lint. * Addressed half of code review. * Fix tests. * Addressed the most critical issue. * Make subscriber thread-safe. * Revert "Make subscriber thread-safe." This reverts commit 9a6a52197cfa8463ab60dfaae9530ad3c0ed8790. * Fixed test failures. The only failure now is the asan failure. * Reset test suites and see if it fixes the issue. * Fix a flaky test * Addressed code review.	2021-05-13 09:29:02 -07:00
Eric Liang	0dfd43c61b	Add nightly release test directory and add shuffle release test (#15671 ) * update * udpate * update * update * update * Adjust script/release test json * remove * update * lint Co-authored-by: Kai Fricke <kai@anyscale.com>	2021-05-08 14:21:55 -07:00
Kai Fricke	8db2e5c23a	[release] Move xgboost tune small + microbenchmark release test to new release automation (#15619 )	2021-05-08 20:38:39 +01:00
Kai Fricke	1d52ab819f	[release] release 1.3.0 results and test updates (#15366 ) Convert a number of release tests and add logs for release 1.3.0	2021-05-04 22:10:04 +01:00
Jenna Kwon	15da948214	Support object spilling mode and data load failure mode in dask_on_ra… (#15601 ) * Support object spilling mode and data load failure mode in dask_on_ray_large_scale_test.py * Remove freq and time decimation Co-authored-by: Jenna Kwon <jkkwon@amazon.com>	2021-05-04 10:57:49 -07:00
Amog Kamsetty	ebc44c3d76	[CI] Upgrade flake8 to 3.9.1 (#15527 ) * formatting * format util * format release * format rllib/agents * format rllib/env * format rllib/execution * format rllib/evaluation * format rllib/examples * format rllib/policy * format rllib utils and tests * format streaming * more formatting * update requirements files * fix rllib type checking * updates * update * fix circular import * Update python/ray/tests/test_runtime_env.py * noqa	2021-05-03 14:23:28 -07:00
SangBin Cho	df9329160e	[Tests] Dask on ray release test (#15256 ) * done. * Linting. * Update readme * Update. * Fix issues.	2021-04-15 10:30:17 -07:00
SangBin Cho	d0e83c43ca	[Release Test] Modify parameter to reduce stress (#15048 ) * Fix. * Fix.	2021-04-14 18:27:20 -07:00
Richard Liaw	59bf3a7b22	ray[cluster] -> ray[default] (#15251 )	2021-04-14 09:37:04 -07:00
Edward Oakes	0f9d1bb223	Serve failure release test fix (#15276 ) This test is currently not tested in CI	2021-04-13 17:49:29 +01:00
Edward Oakes	e4ca337e16	[serve] Change remaining tests to use deployment API (#15167 )	2021-04-08 08:15:38 -05:00
Richard Liaw	e72f6b0377	Fix ray[full] -> ray[cluster] #15112 Signed-off-by: Richard Liaw <rliaw@berkeley.edu>	2021-04-05 09:55:00 -07:00
Kai Fricke	b366500938	[tune] fix long running release test WIP (#14866 ) - Use placement groups - Introduce time between checks for failure testing - Use gloo instead of nccl	2021-03-25 11:03:22 +01:00
Amog Kamsetty	233f174984	Update release instructions (#14882 )	2021-03-24 12:41:50 -07:00
SangBin Cho	5f7ce293fe	[Test] Large scale dask on ray test (#14340 ) * Add a test. * Add a test. * d * Modify the release doc. * Addressed code review.	2021-03-23 11:00:35 -07:00
Kai Fricke	7364a7a327	[tune] Move Optuna to ask(fixed_distributions) interface (#14731 ) Adjusting to changes in Optuna 2.6.0. Old interface was marked as deprecated.	2021-03-22 12:25:37 +01:00
Ian Rodney	eb12033612	[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-03-18 13:10:57 -07:00
Kai Fricke	4014168928	[tune] Introduce `durable()` wrapper to convert trainables into durable trainables (#14306 ) * [tune] Introduce `durable()` wrapper to convert trainables into durable trainables * Fix wrong check * Improve docs, add FAQ for tackling overhead * Fix bugs in `tune.with_parameters` * Update doc/source/tune/api_docs/trainable.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Update doc/source/tune/_tutorials/_faq.rst Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-02-26 13:59:28 +01:00
SangBin Cho	5740b2391e	Add multi node data processing cluster.yaml (#14198 )	2021-02-19 16:16:55 -08:00
Kai Fricke	a0f73cf3f7	[xgboost] Update XGBoost release test configs (#13941 ) * Update XGBoost release test configs * Use GPU containers * Fix elastic check * Use spot instances for GPU * Add debugging output * Fix success check, failure checking, outputs, sync behavior * Update release checklist, rename mounts	2021-02-17 23:00:49 +01:00
Alex Wu	4846a6c2d0	Release process update (#13798 )	2021-02-15 11:40:49 -08:00
Kai Fricke	1ef2a6790c	[tune] add scalability release tests (#13986 ) * Add scalability tests * Network overhead cluster * Update xgboost tests * Document release tests * Don't raise on failed trial * Update to multi node yamls * Update yamls * Revert xgboost test changes * Fix import * Update release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Pass aws credentials (WIP) * Update durable trainable example * Update xgboost sweep * Change xgboost scope, fix durable trainable stop condition * Fix max depth to limit total test length * Add cluster information to test descriptions. Update release checklist/process docs Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-02-10 17:16:31 +01:00
Kai Fricke	1e113d2e6e	[tune/xgboost] Update release test docs (#13880 ) * Update release test docs * Update	2021-02-04 13:10:56 +01:00
Amog Kamsetty	2ba77ae3a2	[Release] Fix SGD+Tune long running distributed release test (#13812 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-01-31 21:05:50 -08:00
SangBin Cho	c21a79ae6e	[Object Spilling] 100GB shuffle release test (#13729 )	2021-01-29 12:38:06 -08:00
Ian Rodney	b4bcb9b60a	[Docker] Use Cuda 11 (#13691 )	2021-01-27 13:45:30 -08:00
Alex Wu	840987c7af	Scalability Envelope Tests (#13464 )	2021-01-25 18:48:31 -08:00
Simon Mo	fe8262afd0	Add K8s test to release process (#13694 )	2021-01-25 16:53:52 -08:00
Ameer Haj Ali	b7dd7ddb52	deprecate useless fields in the cluster yaml. (#13637 ) * prepare for head node * move command runner interface outside _private * remove space * Eric * flake * min_workers in multi node type * fixing edge cases * eric not idle * fix target_workers to consider min_workers of node types * idle timeout * minor * minor fix * test * lint * eric v2 * eric 3 * min_workers constraint before bin packing * Update resource_demand_scheduler.py * Revert "Update resource_demand_scheduler.py" This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5. * reducing diff * make get_nodes_to_launch return a dict * merge * weird merge fix * auto fill instance types for AWS * Alex/Eric * Update doc/source/cluster/autoscaling.rst * merge autofill and input from user * logger.exception * make the yaml use the default autofill * docs Eric * remove test_autoscaler_yaml from windows tests * lets try changing the test a bit * return test * lets see * edward * Limit max launch concurrency * commenting frac TODO * move to resource demand scheduler * use STATUS UP TO DATE * Eric * make logger of gc freed refs debug instead of info * add cluster name to docker mount prefix directory * grrR * fix tests * moving docker directory to sdk * move the import to prevent circular dependency * smallf fix * ian * fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running * small fix * deflake test_joblib * lint * placement groups bypass * remove space * Eric * first ocmmit * lint * exmaple * documentation * hmm * file path fix * fix test * some format issue in docs * modified docs * joblib strikes again on windows * add ability to not start autoscaler/monitor * a * remove worker_default * Remove default pod type from operator * Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types * deprecate useless fields Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan> Co-authored-by: Alex Wu <alex@anyscale.io> Co-authored-by: Alex Wu <itswu.alex@gmail.com> Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local> Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal> Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2021-01-23 12:06:51 -08:00
Kai Fricke	8804758409	[xgboost] Add XGBoost release tests (#13456 ) * Add XGBoost release tests * Add more xgboost release tests * Use failure state manager * Add release test documentation * Fix wording * Automate fault tolerance tests	2021-01-20 18:40:23 +01:00
Simon Mo	c963cbc038	Fix Docker Permission for Serve release test again (#13543 )	2021-01-19 12:23:30 -08:00
Sven Mika	93c0a5549b	[RLlib] Deprecate `vf_share_layers` in top-level PPO/MAML/MB-MPO configs. (#13397 )	2021-01-19 09:51:35 +01:00
SangBin Cho	1179db1fc2	Remove an unnecessary file (#13499 )	2021-01-15 18:29:12 -08:00
Eric Liang	ee6332dbb0	Bump dev branch to 2.0 to avoid endless version bump toil (#13497 ) * wip * fix * fix	2021-01-15 17:41:17 -08:00
SangBin Cho	d09df55b14	Update ID specification doc (#13356 )	2021-01-15 15:15:51 -08:00
Simon Mo	16e8c4a69f	[Release] Fix Serve release test (#13303 ) The Docker image we were using now uses `ray` users so we have to call sudo.	2021-01-14 12:23:53 -08:00
SangBin Cho	0428537d0b	[Object Spilling] Long running object spilling test (#13331 ) * done. * formatting.	2021-01-12 16:53:13 -08:00
Kai Fricke	518427627b	[tune] buffer trainable results (#13236 ) * Working prototype * Pass buffer length, fix tests * Don't buffer per default * Dispatch and process save in one go, added tests * Fix tests * Pass adaptive seconds to train_buffered, stop result processing after STOP decision * Fix tests, add release test * Update tests * Added detailed logs for slow operations * Update python/ray/tune/trial_runner.py Co-authored-by: Richard Liaw <rliaw@berkeley.edu> * Apply suggestions from code review * Revert tests and go back to old tuning loop * nit Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2021-01-12 18:52:47 +01:00
Simon Mo	c32ad2fef5	[Release] Use ray-ml image for logn running test (#13267 )	2021-01-07 10:31:46 -08:00
Max Fitton	5094734205	Update autoscaler-cluster yaml files for release tests (#13114 )	2021-01-07 11:44:57 -06:00
Simon Mo	01dcb993c7	[Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247 ) Now that `HeadOnly` becomes the new default HTTP location, we can re-enable the long running tests to use local multi-clusters. (also fixed the controller's API to match up to date, we should have caught these, I will open issues for this.)	2021-01-07 08:57:24 -08:00
Max Fitton	0d61ea9b06	[Release] Add 1.1.0 release test logs (#13054 ) * Add microbenchmark to release logs * check in many_tasks stress test result * Add results of placement group stress test for 1.1.0 * Add result for test_dead_actors test and correct the name of test_many_tasks.txt * Add rllib regression test result * Add pytorch test results for rllib * remove extraneous log entries	2021-01-06 11:03:16 -08:00
Max Fitton	d018212db5	[Release] Update Release Process Documentation (#13123 )	2021-01-04 11:09:43 -08:00

1 2 3

114 commits