hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Chen Shen	acbab51d3e	[Nightly] fix microbenchmark scripts (#26947 ) Signed-off-by: scv119 scv119@gmail.com Why are these changes needed? microbenchmarks failed complaining raise ValueError(f"Malformed address: {address}") ValueError: Malformed address: this is due to `55a0f7b` and fix it by set RAY_ADDRESS="local"	2022-07-24 14:16:43 -07:00
Avnish Narayan	a50a81a13a	Revert "[RLlib] Fix apex breakout release test performance. (#26867 )" (#26927 )	2022-07-23 17:27:50 +02:00
Avnish Narayan	2cfd6c2e97	[RLlib] Fix apex breakout release test performance. (#26867 )	2022-07-23 13:53:03 +02:00
Richard Liaw	96e8027c7e	[air] large tune/torch benchmark (#26763 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-23 01:17:25 -07:00
Jiao	840b0478aa	[AIR CUJ] Add wait_for_nodes for 4x4 gpu test	2022-07-22 16:04:54 -07:00
Steven Morad	259429bdc3	Bump gym dep to 0.24 (#26190 ) Co-authored-by: Steven Morad <smorad@anyscale.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2022-07-22 12:37:16 -07:00
Avnish Narayan	82395c4646	[RLlib] Put learning test into own folders (#26862 ) Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>	2022-07-22 11:20:47 -07:00
Avnish Narayan	67c0a69643	[Rllib] Fix broken cluster env launcher gym pinning (#26865 )	2022-07-21 20:45:16 -07:00
matthewdeng	14e2b2548c	[air] update remaining dict scaling_configs (#26856 )	2022-07-21 18:55:21 -07:00
Balaji Veeramani	ac1d21027d	[AIR] Add framework-specific checkpoints (#26777 )	2022-07-20 19:33:27 -07:00
Archit Kulkarni	e043f49957	[Serve] [CI] Increase instance size and add debug log for `autoscaling_multi_deployment` release test (#26732 )	2022-07-20 16:13:36 -07:00
Kai Fricke	2e35d47bd2	[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776 )	2022-07-20 14:07:51 -07:00
Avnish Narayan	5433c11650	[RLlib] Pin gym to 0.23.1 (#26752 )	2022-07-20 11:49:01 -07:00
matthewdeng	2a425b195c	[air] change default strategy to PACK (#26757 )	2022-07-19 23:01:24 -07:00
Jiao	e7ab969f61	[P0][Release Blocker Fix] Larger headnode for tune_scalability_network_overhead weekly test (#26742 )	2022-07-19 16:40:25 -07:00
Jiajun Yao	2603aea4c9	[CI] Chaos tests for dataset random shuffle 1tb (#26738 ) - Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle - Mark dataset_shuffle_push_based_random_shuffle_1tb as stable	2022-07-19 15:16:51 -07:00
xwjiang2010	75027eb479	[air/benchmarks] train/tune benchmark (#26564 ) Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 18:24:39 +01:00
Richard Liaw	7e62e1187c	[air/benchmark] Torch benchmarks for 4x4 (#26692 ) Add benchmark data for 4x4 GPU setup. Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-19 17:06:37 +01:00
Riatre	591cd22be7	Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525 ) * Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" This reverts commit `ab10890e90`. Signed-off-by: Riatre Foo <foo@riat.re> * Fix missing test data files dependency in rllib/BUILD See # 26334 and # 26517 for context. Once this is in, it should be good to roll-forwrad again. Signed-off-by: Riatre Foo <foo@riat.re> * debug: run all tests Signed-off-by: Riatre Foo <foo@riat.re> * Revert "debug: run all tests" This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970. Signed-off-by: Riatre Foo <foo@riat.re> * fix new tests since last rebase Signed-off-by: Riatre Foo <foo@riat.re>	2022-07-18 21:21:19 -07:00
Sumanth Ratna	759966781f	[air] Allow users to use instances of `ScalingConfig` (#25712 ) Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-18 15:46:58 -07:00
Jiajun Yao	40a4777bc0	Mark chaos_dataset_shuffle_push_based_sort_1tb and chaos_dataset_shuffle_sort_1tb stable (#26677 ) They passed for the past 7 runs.	2022-07-18 14:34:08 -07:00
Kai Fricke	00947fd949	[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562 )	2022-07-18 12:14:10 -07:00
matthewdeng	6670708010	[air] add placement group max CPU to data benchmark (#26649 ) Set experimental `_max_cpu_fraction_per_node` to prevent deadlock. This should technically be a no-op with the SPREAD strategy.	2022-07-18 10:34:40 -07:00
Jiao	98a07920d3	[AIR][CUJ] Make distributing training benchmark at silver tier (#26640 )	2022-07-17 22:07:09 -07:00
Jiao	77e2ef2eb6	[AIR] Update Torch benchmarks with documentation (#26631 ) Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-16 17:58:21 -07:00
Eric Liang	0855bcb77e	[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633 )	2022-07-16 17:37:06 -07:00
Jiao	196e52ad7c	[AIR][CUJ] E2E Pytorch training (#26621 )	2022-07-16 08:23:19 -07:00
Jiao	988ffd494b	[AIR][CUJ] Add GPU bench prediction benchmark (#26614 )	2022-07-16 08:22:37 -07:00
matthewdeng	e3a096f412	[air] add bulk ingest benchmarks (#26618 )	2022-07-15 22:01:23 -07:00
Richard Liaw	5ad4e75831	[air] Add initial benchmark section (#26608 )	2022-07-15 15:33:48 -07:00
xwjiang2010	a241e6a0f5	[air] Add xgboost release test for silver tier(10-node case). (#26460 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-07-15 13:21:10 -07:00
Artur Niederfahrenhorst	4ce9686d94	[RLlib] Fixes MARWIL release tests (#26586 )	2022-07-15 11:13:15 -07:00
Kai Fricke	213a96e239	[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519 ) Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-14 22:08:43 +01:00
Kai Fricke	cd95569b01	[tune/release] Add up/down scaling release test (#25392 ) This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 22:57:24 +01:00
Sihan Wang	b606169cb5	[Serve] Promote autoscaling feature (#26393 ) 1. get rid of the private attribute 2. fix unit test 3. docs and workflows	2022-07-13 14:38:38 -05:00
Sven Mika	ab10890e90	Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517 )	2022-07-13 11:19:30 -07:00
Antoni Baum	a8fb194c8b	[CI] Fix nightly horovod test (#26447 ) Removes usage of deprecated Train APIs and uses Ray AIR HorovodTrainer instead.	2022-07-13 16:51:50 +01:00
Kai Fricke	e4a4f7de70	[ci/release] Fix fetching logs from staging clusters (#26515 ) Replaces a formerly hard-coded URI to anyscale prod with the respective env variable. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 11:15:50 +01:00
Kai Fricke	cf75cf7232	[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436 ) This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch. In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-13 10:53:24 +01:00
Riatre	2cdb76789e	Bump pytest from 5.4.3 to 7.0.1 (#26334 ) See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828. Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future. The change contains three commits with rather descriptive commit message, which I repeat here: Pass deps to py_test in py_test_module_list Bazel macro py_test_module_list takes a `deps` argument, but completely ignores it instead of passes it to `native.py_test`. Fixing that as we are going to use deps of py_test_module_list in BUILD in later changes. cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library from a py_test, which isn't working, see upstream issue: https://github.com/bazelbuild/bazel/issues/701. This is fixed by simply removing the (non-working) deps. Depend on conftest and data files in Python tests BUILD files Bazel requires that all the files used in a test run should be represented in the transitive dependencies specified for the test target. For py_test, it means srcs, deps and data. Bazel enforces this constraint by creating a "runfiles" directory, symbolic links files in the dependency closure and run the test in the "runfiles" directory, so that the test shouldn't see files not in the dependency graph. Unfortunately, the constraint does not apply for a large number of Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic links during test collection and effectively "breaks out" of the runfiles tree. pytest >= 6.0 introduces a breaking change and removed the symbolic link resolving behaviour, see pytest pull request https://github.com/pytest-dev/pytest/pull/6523 for more context. Currently, we are underspecifying dependencies in a lot of BUILD files and thus blocking us from updating to newer pytest (for Python 3.10 support). This change hopefully fixes all of them, and at least those in CI, by adding data or source dependencies (mostly for conftest.py-s) where needed. Bump pytest version from 5.4.3 to 7.0.1 We want at least pytest 6.2.5 for Python 3.10 support, but not past 7.1.0 since it drops Python 3.6 support (which Ray still supports), thus the version constraint is set to <7.1. Updating pytest, combined with earlier BUILD fixes, changed the ground truth of a few error message based unit test, these tests are updated to reflect the change. There are also two small drive-by changes for making test_traceback and test_cli pass under Python 3.10. These are discovered while debugging CI failures (on earlier Python) with a Python 3.10 install locally. Expect more such issues when adding Python 3.10 to CI.	2022-07-12 21:14:35 -07:00
Kai Fricke	753f5feaf4	[tune] Remove TrialCheckpoint class (#25406 ) The old user-facing TrialCheckpoint class has been deprecated in favor of `ray.ml.Checkpoint` and will be removed with this PR. The main change in this PR is to delete the old `TrialCheckpoint` class and replace remaining API calls (e.g. `checkpoint.local_path`) with the correct AIR equivalents. One issue that comes up is that with Ray client usage, checkpoint directories are not available on the local node (the client). Thus, we can't construct `Checkpoint` objects easily. (Previously, the TrialCheckpoint object held a reference to the location, even if it is not locally available). There are ongoing discussions on how to resolve this in the future. For now, we print an error when such a checkpoint is requested. Depends on #25805 Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-11 20:08:10 +01:00
Jian Xiao	923209895d	Pipelined training test: change num of windows; log the ingestion perf (#26429 ) Why are these changes needed? Improve test perf Log the perf stats With 2 windows there are a lot of spilling, slowing down the throughput.	2022-07-11 11:03:35 -07:00
Jiajun Yao	743e2f403a	Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (#26366 ) - Record the test name for the usage stats. - Change the cluster name to indicate if it's smoke test or not.	2022-07-07 21:17:34 -07:00
Antoni Baum	ea94cda1f3	[AIR] Replace `train.` with `session.` (#26303 ) This PR replaces legacy API calls to `train.` with AIR `session.` in Train code, examples and docs. Depends on https://github.com/ray-project/ray/pull/25735	2022-07-07 16:29:04 -07:00
Antoni Baum	b9a4f64f32	[AIR/train] Use new Train API (#25735 ) Uses the new AIR Train API for examples and tests. The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers. This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs. Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled. Requires https://github.com/ray-project/ray/pull/25943 to be merged in first	2022-07-07 12:28:37 -07:00
xwjiang2010	40f9561f78	[ml/release] fix ptl ml user test. (#26365 ) Between version1 and 2 of [this](https://console.anyscale-staging.com/o/anyscale-internal/configurations/app-config-versions/apt_TsCpJCRjMJDpNFhNgJmyCniS) cluster_env, 1 fails and 2 succeeds. btw, we really should start to think about a systematic approach towards our python dependency story. - between client and server - but more importantly server side, and any conflicts among requirements - how are pip freeze result evolving over time	2022-07-07 11:45:46 -07:00
Stephanie Wang	dcc913073f	[testing] Run 100TB shuffle test nightly (#26306 ) Run this test nightly to collect more datapoints on stability and performance of 100TB shuffle.	2022-07-07 09:59:54 -07:00
Amog Kamsetty	6f683c8d1c	[Release] Use nightly base images for release tests (#25373 ) Revert back to using nightly base images instead of pinning to 1.12.1. Pinning the docker image had led to uncaught errors in the past. Instead, we should be using nightly to make sure release tests will work on the most up to date versions of docker/cluster envs. If there are any test failures, the underlying issues should be fixed rather than pinning the docker image. Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-05 10:58:53 -07:00
Stephanie Wang	a90e53b76f	[core] Add weekly test for 100TB random shuffle (#25908 ) Adds a CI test for 100TB shuffle. There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count. Related issue number Closes #24480.	2022-07-01 13:30:07 -07:00
Alex Wu	76c5122357	[ci/release] Fix wait_cluster (#26236 ) Fixes a bug in wait_cluster where we count the total number of nodes ever in the cluster rather than the alive nodes. This has causes infra/autoscaler failures (e.g. #26138) to be mislabeled as test failures (and probably messes with timing too). Co-authored-by: Alex Wu <alex@anyscale.com>	2022-06-30 16:37:32 -07:00

1 2 3 4 5 ...

714 commits