hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
xwjiang2010	7f6578b81e	[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869 ) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 10:38:36 -07:00
xwjiang2010	b88064dbb6	[release test] remove dask/modin_xgboost test completely. (#27865 ) The original script was removed in https://github.com/ray-project/ray/pull/27816 This is just to clean up some remainings. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 08:28:33 -07:00
xwjiang2010	5ce9656b2b	[air] update pytorch_training_e2e.py to use iter_torch_batches. (#27241 ) (#27465 ) update pytorch_training_e2e.py to use iter_torch_batches. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-04 08:09:31 -07:00
Jimmy Yao	365446265b	[ray 2.0 release] fix the release test of ray lightning master (#27395 )	2022-08-03 09:40:57 -07:00
Kai Fricke	0d4d4e14a9	[release/tune/2.0.0] Fix k8s release test + node-to-node syncing (#27365 ) * [air] fix xgboost_benchmark script by passing in args (#27146) * [tune/docs] Update custom syncer example (#27252) There is a small bug in the docs example for custom command based syncers. This PR fixes them and adds a test to test these changes. Signed-off-by: Kai Fricke <kai@anyscale.com> * [tune/release] Do not use spot instances in k8s tests (#27250) Spot instances are not being booted up, so let's go without them. Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: matthewdeng <matt@anyscale.com>	2022-08-02 14:46:12 -07:00
Jun Gong	24976ef23a	[RLlib] Revert `41c9ef70`. (#27243 ) (#27270 ) Why are these changes needed? Also: Add validation to make sure multi-gpu and micro-batch is not used together. Update A2C learning test to hit the microbatching branch. Minor comment updates.	2022-07-29 12:02:23 -07:00
xwjiang2010	4bf33efd5c	[air] Add annotation for Tune module. (#27060 ) (#27210 ) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> As a follow up to #27060.	2022-07-29 11:11:45 -07:00
matthewdeng	86718071fe	[tune] Increase volume size for long running pbt failure (#27163 ) (#27247 ) Currently running into an issue: Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-29 01:16:40 -07:00
Clark Zinzow	f7b46b3ecc	[AIR - Datasets] Fix AIR release tests dealing with tensor columns. (#27221 ) (#27224 ) This PR fixes some AIR release tests that deal with tensor columns.	2022-07-28 16:40:48 -07:00
Kai Fricke	abca0ba165	[tune/release/2.0.0] Fix tune_cloud_aws_durable_upload_rllib_* release tests (#27180 )	2022-07-28 15:14:49 -07:00
Kai Fricke	dc0b445323	[rllib/release/2.0.0] Fix rllib connect test (#27162 ) Why are these changes needed? Follow-up from #27155 - this will let the connect test pass	2022-07-28 14:23:23 -07:00
Simon Mo	13c3400117	[Serve] Remove release tests for checkpoint_path (#27194 ) (#27206 ) Cherry pick commit `8beb887` to address #27189	2022-07-28 13:00:03 -07:00
scv119	3edfc78ee2	update version number to 2.0.0rc0	2022-07-27 18:43:27 +00:00
Simon Mo	e5a8b1dd55	[Serve] Add API Annotations And Move to _private (#27058 )	2022-07-27 09:08:26 -07:00
SangBin Cho	a6fe2c1e87	[Release test] Add a memory monitor to nightly test long running actor death (#27083 ) Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test	2022-07-27 07:32:10 -07:00
Amog Kamsetty	862d10c162	[AIR] Remove ML code from `ray.util` (#27005 ) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 14:24:19 +01:00
xwjiang2010	4c30325172	[air] update xgboost test (catch test failures properly). (#27023 ) - Update xgboost test (catch test failures properly) - Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-07-27 12:18:51 +01:00
Kai Fricke	ce5c5d858b	[ci/release/RLlib] Fix IMPALA long running release test. (#27086 )	2022-07-27 12:38:32 +02:00
Avnish Narayan	f5a9a44b9c	[RLlib] Revert Revert Fix apex long running test (#26928 )	2022-07-26 15:10:25 -07:00
Balaji Veeramani	89f7f2a567	[Datasets] Add `size` parameter to `ImageFolderDatasource` (#26975 ) If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.	2022-07-26 14:57:38 -07:00
matthewdeng	1bb7651e95	[air] add smoke-test flag to tensorflow_benchmark (#26999 ) Increase ratio from 1.15 to 1.2 Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-07-26 15:47:37 +01:00
Sihan Wang	8ecd928c34	[Serve] Make the checkpoint and recover only from GCS (#26753 )	2022-07-25 14:24:53 -07:00
Chen Shen	acbab51d3e	[Nightly] fix microbenchmark scripts (#26947 ) Signed-off-by: scv119 scv119@gmail.com Why are these changes needed? microbenchmarks failed complaining raise ValueError(f"Malformed address: {address}") ValueError: Malformed address: this is due to `55a0f7b` and fix it by set RAY_ADDRESS="local"	2022-07-24 14:16:43 -07:00
Avnish Narayan	a50a81a13a	Revert "[RLlib] Fix apex breakout release test performance. (#26867 )" (#26927 )	2022-07-23 17:27:50 +02:00
Avnish Narayan	2cfd6c2e97	[RLlib] Fix apex breakout release test performance. (#26867 )	2022-07-23 13:53:03 +02:00
Richard Liaw	96e8027c7e	[air] large tune/torch benchmark (#26763 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-23 01:17:25 -07:00
Jiao	840b0478aa	[AIR CUJ] Add wait_for_nodes for 4x4 gpu test	2022-07-22 16:04:54 -07:00
Steven Morad	259429bdc3	Bump gym dep to 0.24 (#26190 ) Co-authored-by: Steven Morad <smorad@anyscale.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2022-07-22 12:37:16 -07:00
Avnish Narayan	82395c4646	[RLlib] Put learning test into own folders (#26862 ) Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>	2022-07-22 11:20:47 -07:00
Avnish Narayan	67c0a69643	[Rllib] Fix broken cluster env launcher gym pinning (#26865 )	2022-07-21 20:45:16 -07:00
matthewdeng	14e2b2548c	[air] update remaining dict scaling_configs (#26856 )	2022-07-21 18:55:21 -07:00
Balaji Veeramani	ac1d21027d	[AIR] Add framework-specific checkpoints (#26777 )	2022-07-20 19:33:27 -07:00
Archit Kulkarni	e043f49957	[Serve] [CI] Increase instance size and add debug log for `autoscaling_multi_deployment` release test (#26732 )	2022-07-20 16:13:36 -07:00
Kai Fricke	2e35d47bd2	[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776 )	2022-07-20 14:07:51 -07:00
Avnish Narayan	5433c11650	[RLlib] Pin gym to 0.23.1 (#26752 )	2022-07-20 11:49:01 -07:00
matthewdeng	2a425b195c	[air] change default strategy to PACK (#26757 )	2022-07-19 23:01:24 -07:00
Jiao	e7ab969f61	[P0][Release Blocker Fix] Larger headnode for tune_scalability_network_overhead weekly test (#26742 )	2022-07-19 16:40:25 -07:00
Jiajun Yao	2603aea4c9	[CI] Chaos tests for dataset random shuffle 1tb (#26738 ) - Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle - Mark dataset_shuffle_push_based_random_shuffle_1tb as stable	2022-07-19 15:16:51 -07:00
xwjiang2010	75027eb479	[air/benchmarks] train/tune benchmark (#26564 ) Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 18:24:39 +01:00
Richard Liaw	7e62e1187c	[air/benchmark] Torch benchmarks for 4x4 (#26692 ) Add benchmark data for 4x4 GPU setup. Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 17:06:37 +01:00
Riatre	591cd22be7	Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525 ) * Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" This reverts commit `ab10890e90`. Signed-off-by: Riatre Foo <foo@riat.re> * Fix missing test data files dependency in rllib/BUILD See # 26334 and # 26517 for context. Once this is in, it should be good to roll-forwrad again. Signed-off-by: Riatre Foo <foo@riat.re> * debug: run all tests Signed-off-by: Riatre Foo <foo@riat.re> * Revert "debug: run all tests" This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970. Signed-off-by: Riatre Foo <foo@riat.re> * fix new tests since last rebase Signed-off-by: Riatre Foo <foo@riat.re>	2022-07-18 21:21:19 -07:00
Sumanth Ratna	759966781f	[air] Allow users to use instances of `ScalingConfig` (#25712 ) Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-18 15:46:58 -07:00
Jiajun Yao	40a4777bc0	Mark chaos_dataset_shuffle_push_based_sort_1tb and chaos_dataset_shuffle_sort_1tb stable (#26677 ) They passed for the past 7 runs.	2022-07-18 14:34:08 -07:00
Kai Fricke	00947fd949	[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562 )	2022-07-18 12:14:10 -07:00
matthewdeng	6670708010	[air] add placement group max CPU to data benchmark (#26649 ) Set experimental `_max_cpu_fraction_per_node` to prevent deadlock. This should technically be a no-op with the SPREAD strategy.	2022-07-18 10:34:40 -07:00
Jiao	98a07920d3	[AIR][CUJ] Make distributing training benchmark at silver tier (#26640 )	2022-07-17 22:07:09 -07:00
Jiao	77e2ef2eb6	[AIR] Update Torch benchmarks with documentation (#26631 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-16 17:58:21 -07:00
Eric Liang	0855bcb77e	[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633 )	2022-07-16 17:37:06 -07:00
Jiao	196e52ad7c	[AIR][CUJ] E2E Pytorch training (#26621 )	2022-07-16 08:23:19 -07:00
Jiao	988ffd494b	[AIR][CUJ] Add GPU bench prediction benchmark (#26614 )	2022-07-16 08:22:37 -07:00

1 2 3 4 5 ...

736 commits