hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 18:11:42 -05:00

Author	SHA1	Message	Date
Malinda	1d789aee63	[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. (#26284 )	2022-07-27 22:38:35 +02:00
Simon Mo	e5a8b1dd55	[Serve] Add API Annotations And Move to _private (#27058 )	2022-07-27 09:08:26 -07:00
SangBin Cho	a6fe2c1e87	[Release test] Add a memory monitor to nightly test long running actor death (#27083 ) Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test	2022-07-27 07:32:10 -07:00
Amog Kamsetty	862d10c162	[AIR] Remove ML code from `ray.util` (#27005 ) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 14:24:19 +01:00
xwjiang2010	4c30325172	[air] update xgboost test (catch test failures properly). (#27023 ) - Update xgboost test (catch test failures properly) - Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-07-27 12:18:51 +01:00
Kai Fricke	ce5c5d858b	[ci/release/RLlib] Fix IMPALA long running release test. (#27086 )	2022-07-27 12:38:32 +02:00
Avnish Narayan	f5a9a44b9c	[RLlib] Revert Revert Fix apex long running test (#26928 )	2022-07-26 15:10:25 -07:00
Balaji Veeramani	89f7f2a567	[Datasets] Add `size` parameter to `ImageFolderDatasource` (#26975 ) If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.	2022-07-26 14:57:38 -07:00
matthewdeng	1bb7651e95	[air] add smoke-test flag to tensorflow_benchmark (#26999 ) Increase ratio from 1.15 to 1.2 Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-07-26 15:47:37 +01:00
Sihan Wang	8ecd928c34	[Serve] Make the checkpoint and recover only from GCS (#26753 )	2022-07-25 14:24:53 -07:00
Chen Shen	acbab51d3e	[Nightly] fix microbenchmark scripts (#26947 ) Signed-off-by: scv119 scv119@gmail.com Why are these changes needed? microbenchmarks failed complaining raise ValueError(f"Malformed address: {address}") ValueError: Malformed address: this is due to `55a0f7b` and fix it by set RAY_ADDRESS="local"	2022-07-24 14:16:43 -07:00
Avnish Narayan	a50a81a13a	Revert "[RLlib] Fix apex breakout release test performance. (#26867 )" (#26927 )	2022-07-23 17:27:50 +02:00
Avnish Narayan	2cfd6c2e97	[RLlib] Fix apex breakout release test performance. (#26867 )	2022-07-23 13:53:03 +02:00
Richard Liaw	96e8027c7e	[air] large tune/torch benchmark (#26763 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-23 01:17:25 -07:00
Jiao	840b0478aa	[AIR CUJ] Add wait_for_nodes for 4x4 gpu test	2022-07-22 16:04:54 -07:00
Steven Morad	259429bdc3	Bump gym dep to 0.24 (#26190 ) Co-authored-by: Steven Morad <smorad@anyscale.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2022-07-22 12:37:16 -07:00
Avnish Narayan	82395c4646	[RLlib] Put learning test into own folders (#26862 ) Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>	2022-07-22 11:20:47 -07:00
Avnish Narayan	67c0a69643	[Rllib] Fix broken cluster env launcher gym pinning (#26865 )	2022-07-21 20:45:16 -07:00
matthewdeng	14e2b2548c	[air] update remaining dict scaling_configs (#26856 )	2022-07-21 18:55:21 -07:00
Balaji Veeramani	ac1d21027d	[AIR] Add framework-specific checkpoints (#26777 )	2022-07-20 19:33:27 -07:00
Archit Kulkarni	e043f49957	[Serve] [CI] Increase instance size and add debug log for `autoscaling_multi_deployment` release test (#26732 )	2022-07-20 16:13:36 -07:00
Kai Fricke	2e35d47bd2	[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776 )	2022-07-20 14:07:51 -07:00
Avnish Narayan	5433c11650	[RLlib] Pin gym to 0.23.1 (#26752 )	2022-07-20 11:49:01 -07:00
matthewdeng	2a425b195c	[air] change default strategy to PACK (#26757 )	2022-07-19 23:01:24 -07:00
Jiao	e7ab969f61	[P0][Release Blocker Fix] Larger headnode for tune_scalability_network_overhead weekly test (#26742 )	2022-07-19 16:40:25 -07:00
Jiajun Yao	2603aea4c9	[CI] Chaos tests for dataset random shuffle 1tb (#26738 ) - Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle - Mark dataset_shuffle_push_based_random_shuffle_1tb as stable	2022-07-19 15:16:51 -07:00
xwjiang2010	75027eb479	[air/benchmarks] train/tune benchmark (#26564 ) Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 18:24:39 +01:00
Richard Liaw	7e62e1187c	[air/benchmark] Torch benchmarks for 4x4 (#26692 ) Add benchmark data for 4x4 GPU setup. Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 17:06:37 +01:00
Riatre	591cd22be7	Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525 ) * Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" This reverts commit `ab10890e90`. Signed-off-by: Riatre Foo <foo@riat.re> * Fix missing test data files dependency in rllib/BUILD See # 26334 and # 26517 for context. Once this is in, it should be good to roll-forwrad again. Signed-off-by: Riatre Foo <foo@riat.re> * debug: run all tests Signed-off-by: Riatre Foo <foo@riat.re> * Revert "debug: run all tests" This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970. Signed-off-by: Riatre Foo <foo@riat.re> * fix new tests since last rebase Signed-off-by: Riatre Foo <foo@riat.re>	2022-07-18 21:21:19 -07:00
Sumanth Ratna	759966781f	[air] Allow users to use instances of `ScalingConfig` (#25712 ) Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-18 15:46:58 -07:00
Jiajun Yao	40a4777bc0	Mark chaos_dataset_shuffle_push_based_sort_1tb and chaos_dataset_shuffle_sort_1tb stable (#26677 ) They passed for the past 7 runs.	2022-07-18 14:34:08 -07:00
Kai Fricke	00947fd949	[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562 )	2022-07-18 12:14:10 -07:00
matthewdeng	6670708010	[air] add placement group max CPU to data benchmark (#26649 ) Set experimental `_max_cpu_fraction_per_node` to prevent deadlock. This should technically be a no-op with the SPREAD strategy.	2022-07-18 10:34:40 -07:00
Jiao	98a07920d3	[AIR][CUJ] Make distributing training benchmark at silver tier (#26640 )	2022-07-17 22:07:09 -07:00
Jiao	77e2ef2eb6	[AIR] Update Torch benchmarks with documentation (#26631 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-16 17:58:21 -07:00
Eric Liang	0855bcb77e	[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633 )	2022-07-16 17:37:06 -07:00
Jiao	196e52ad7c	[AIR][CUJ] E2E Pytorch training (#26621 )	2022-07-16 08:23:19 -07:00
Jiao	988ffd494b	[AIR][CUJ] Add GPU bench prediction benchmark (#26614 )	2022-07-16 08:22:37 -07:00
matthewdeng	e3a096f412	[air] add bulk ingest benchmarks (#26618 )	2022-07-15 22:01:23 -07:00
Richard Liaw	5ad4e75831	[air] Add initial benchmark section (#26608 )	2022-07-15 15:33:48 -07:00
xwjiang2010	a241e6a0f5	[air] Add xgboost release test for silver tier(10-node case). (#26460 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-15 13:21:10 -07:00
Artur Niederfahrenhorst	4ce9686d94	[RLlib] Fixes MARWIL release tests (#26586 )	2022-07-15 11:13:15 -07:00
Kai Fricke	213a96e239	[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519 ) Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-14 22:08:43 +01:00
Kai Fricke	cd95569b01	[tune/release] Add up/down scaling release test (#25392 ) This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 22:57:24 +01:00
Sihan Wang	b606169cb5	[Serve] Promote autoscaling feature (#26393 ) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows	2022-07-13 14:38:38 -05:00
Sven Mika	ab10890e90	Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517 )	2022-07-13 11:19:30 -07:00
Antoni Baum	a8fb194c8b	[CI] Fix nightly horovod test (#26447 ) Removes usage of deprecated Train APIs and uses Ray AIR HorovodTrainer instead.	2022-07-13 16:51:50 +01:00
Kai Fricke	e4a4f7de70	[ci/release] Fix fetching logs from staging clusters (#26515 ) Replaces a formerly hard-coded URI to anyscale prod with the respective env variable. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 11:15:50 +01:00
Kai Fricke	cf75cf7232	[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436 ) This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch. In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 10:53:24 +01:00
Riatre	2cdb76789e	Bump pytest from 5.4.3 to 7.0.1 (#26334 ) See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828. Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future. The change contains three commits with rather descriptive commit message, which I repeat here: Pass deps to py_test in py_test_module_list Bazel macro py_test_module_list takes a `deps` argument, but completely ignores it instead of passes it to `native.py_test`. Fixing that as we are going to use deps of py_test_module_list in BUILD in later changes. cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library from a py_test, which isn't working, see upstream issue: https://github.com/bazelbuild/bazel/issues/701. This is fixed by simply removing the (non-working) deps. Depend on conftest and data files in Python tests BUILD files Bazel requires that all the files used in a test run should be represented in the transitive dependencies specified for the test target. For py_test, it means srcs, deps and data. Bazel enforces this constraint by creating a "runfiles" directory, symbolic links files in the dependency closure and run the test in the "runfiles" directory, so that the test shouldn't see files not in the dependency graph. Unfortunately, the constraint does not apply for a large number of Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic links during test collection and effectively "breaks out" of the runfiles tree. pytest >= 6.0 introduces a breaking change and removed the symbolic link resolving behaviour, see pytest pull request https://github.com/pytest-dev/pytest/pull/6523 for more context. Currently, we are underspecifying dependencies in a lot of BUILD files and thus blocking us from updating to newer pytest (for Python 3.10 support). This change hopefully fixes all of them, and at least those in CI, by adding data or source dependencies (mostly for conftest.py-s) where needed. Bump pytest version from 5.4.3 to 7.0.1 We want at least pytest 6.2.5 for Python 3.10 support, but not past 7.1.0 since it drops Python 3.6 support (which Ray still supports), thus the version constraint is set to <7.1. Updating pytest, combined with earlier BUILD fixes, changed the ground truth of a few error message based unit test, these tests are updated to reflect the change. There are also two small drive-by changes for making test_traceback and test_cli pass under Python 3.10. These are discovered while debugging CI failures (on earlier Python) with a Python 3.10 install locally. Expect more such issues when adding Python 3.10 to CI.	2022-07-12 21:14:35 -07:00

1 2 3 4 5 ...

724 commits