Commit graph

724 commits

Author SHA1 Message Date
Malinda
1d789aee63
[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. (#26284) 2022-07-27 22:38:35 +02:00
Simon Mo
e5a8b1dd55
[Serve] Add API Annotations And Move to _private (#27058) 2022-07-27 09:08:26 -07:00
SangBin Cho
a6fe2c1e87
[Release test] Add a memory monitor to nightly test long running actor death (#27083)
Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test
2022-07-27 07:32:10 -07:00
Amog Kamsetty
862d10c162
[AIR] Remove ML code from ray.util (#27005)
Removes all ML related code from `ray.util`

Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`

Moves `ray.util.ml_utils` to other locations

Closes #23900

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 14:24:19 +01:00
xwjiang2010
4c30325172
[air] update xgboost test (catch test failures properly). (#27023)
- Update xgboost test (catch test failures properly)
- Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-07-27 12:18:51 +01:00
Kai Fricke
ce5c5d858b
[ci/release/RLlib] Fix IMPALA long running release test. (#27086) 2022-07-27 12:38:32 +02:00
Avnish Narayan
f5a9a44b9c
[RLlib] Revert Revert Fix apex long running test (#26928) 2022-07-26 15:10:25 -07:00
Balaji Veeramani
89f7f2a567
[Datasets] Add size parameter to ImageFolderDatasource (#26975)
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
2022-07-26 14:57:38 -07:00
matthewdeng
1bb7651e95
[air] add smoke-test flag to tensorflow_benchmark (#26999)
Increase ratio from 1.15 to 1.2

Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-07-26 15:47:37 +01:00
Sihan Wang
8ecd928c34
[Serve] Make the checkpoint and recover only from GCS (#26753) 2022-07-25 14:24:53 -07:00
Chen Shen
acbab51d3e
[Nightly] fix microbenchmark scripts (#26947)
Signed-off-by: scv119 scv119@gmail.com

Why are these changes needed?
microbenchmarks failed complaining

   raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address: 
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"
2022-07-24 14:16:43 -07:00
Avnish Narayan
a50a81a13a
Revert "[RLlib] Fix apex breakout release test performance. (#26867)" (#26927) 2022-07-23 17:27:50 +02:00
Avnish Narayan
2cfd6c2e97
[RLlib] Fix apex breakout release test performance. (#26867) 2022-07-23 13:53:03 +02:00
Richard Liaw
96e8027c7e
[air] large tune/torch benchmark (#26763)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-23 01:17:25 -07:00
Jiao
840b0478aa
[AIR CUJ] Add wait_for_nodes for 4x4 gpu test 2022-07-22 16:04:54 -07:00
Steven Morad
259429bdc3
Bump gym dep to 0.24 (#26190)
Co-authored-by: Steven Morad <smorad@anyscale.com>
Co-authored-by: Avnish <avnishnarayan@gmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2022-07-22 12:37:16 -07:00
Avnish Narayan
82395c4646
[RLlib] Put learning test into own folders (#26862)
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
2022-07-22 11:20:47 -07:00
Avnish Narayan
67c0a69643
[Rllib] Fix broken cluster env launcher gym pinning (#26865) 2022-07-21 20:45:16 -07:00
matthewdeng
14e2b2548c
[air] update remaining dict scaling_configs (#26856) 2022-07-21 18:55:21 -07:00
Balaji Veeramani
ac1d21027d
[AIR] Add framework-specific checkpoints (#26777) 2022-07-20 19:33:27 -07:00
Archit Kulkarni
e043f49957
[Serve] [CI] Increase instance size and add debug log for autoscaling_multi_deployment release test (#26732) 2022-07-20 16:13:36 -07:00
Kai Fricke
2e35d47bd2
[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776) 2022-07-20 14:07:51 -07:00
Avnish Narayan
5433c11650
[RLlib] Pin gym to 0.23.1 (#26752) 2022-07-20 11:49:01 -07:00
matthewdeng
2a425b195c
[air] change default strategy to PACK (#26757) 2022-07-19 23:01:24 -07:00
Jiao
e7ab969f61
[P0][Release Blocker Fix] Larger headnode for tune_scalability_network_overhead weekly test (#26742) 2022-07-19 16:40:25 -07:00
Jiajun Yao
2603aea4c9
[CI] Chaos tests for dataset random shuffle 1tb (#26738)
- Add chaos tests for dataset random shuffle 1tb: both simple shuffle and push-based shuffle
- Mark dataset_shuffle_push_based_random_shuffle_1tb as stable
2022-07-19 15:16:51 -07:00
xwjiang2010
75027eb479
[air/benchmarks] train/tune benchmark (#26564)
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 18:24:39 +01:00
Richard Liaw
7e62e1187c
[air/benchmark] Torch benchmarks for 4x4 (#26692)
Add benchmark data for 4x4 GPU setup.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 17:06:37 +01:00
Riatre
591cd22be7
Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525)
* Revert "Revert "Bump pytest from 5.4.3 to 7.0.1""

This reverts commit ab10890e90.

Signed-off-by: Riatre Foo <foo@riat.re>

* Fix missing test data files dependency in rllib/BUILD

See # 26334 and # 26517 for context.

Once this is in, it should be good to roll-forwrad again.

Signed-off-by: Riatre Foo <foo@riat.re>

* debug: run all tests

Signed-off-by: Riatre Foo <foo@riat.re>

* Revert "debug: run all tests"

This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970.

Signed-off-by: Riatre Foo <foo@riat.re>

* fix new tests since last rebase

Signed-off-by: Riatre Foo <foo@riat.re>
2022-07-18 21:21:19 -07:00
Sumanth Ratna
759966781f
[air] Allow users to use instances of ScalingConfig (#25712)
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-18 15:46:58 -07:00
Jiajun Yao
40a4777bc0
Mark chaos_dataset_shuffle_push_based_sort_1tb and chaos_dataset_shuffle_sort_1tb stable (#26677)
They passed for the past 7 runs.
2022-07-18 14:34:08 -07:00
Kai Fricke
00947fd949
[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562) 2022-07-18 12:14:10 -07:00
matthewdeng
6670708010
[air] add placement group max CPU to data benchmark (#26649)
Set experimental `_max_cpu_fraction_per_node` to prevent deadlock.

This should technically be a no-op with the SPREAD strategy.
2022-07-18 10:34:40 -07:00
Jiao
98a07920d3
[AIR][CUJ] Make distributing training benchmark at silver tier (#26640) 2022-07-17 22:07:09 -07:00
Jiao
77e2ef2eb6
[AIR] Update Torch benchmarks with documentation (#26631)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-16 17:58:21 -07:00
Eric Liang
0855bcb77e
[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633) 2022-07-16 17:37:06 -07:00
Jiao
196e52ad7c
[AIR][CUJ] E2E Pytorch training (#26621) 2022-07-16 08:23:19 -07:00
Jiao
988ffd494b
[AIR][CUJ] Add GPU bench prediction benchmark (#26614) 2022-07-16 08:22:37 -07:00
matthewdeng
e3a096f412
[air] add bulk ingest benchmarks (#26618) 2022-07-15 22:01:23 -07:00
Richard Liaw
5ad4e75831
[air] Add initial benchmark section (#26608) 2022-07-15 15:33:48 -07:00
xwjiang2010
a241e6a0f5
[air] Add xgboost release test for silver tier(10-node case). (#26460)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-15 13:21:10 -07:00
Artur Niederfahrenhorst
4ce9686d94
[RLlib] Fixes MARWIL release tests (#26586) 2022-07-15 11:13:15 -07:00
Kai Fricke
213a96e239
[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519)
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-14 22:08:43 +01:00
Kai Fricke
cd95569b01
[tune/release] Add up/down scaling release test (#25392)
This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 22:57:24 +01:00
Sihan Wang
b606169cb5
[Serve] Promote autoscaling feature (#26393)
1. get rid of the private attribute
2. fix unit test
3. docs and workflows
2022-07-13 14:38:38 -05:00
Sven Mika
ab10890e90
Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517) 2022-07-13 11:19:30 -07:00
Antoni Baum
a8fb194c8b
[CI] Fix nightly horovod test (#26447)
Removes usage of deprecated Train APIs and uses Ray AIR HorovodTrainer instead.
2022-07-13 16:51:50 +01:00
Kai Fricke
e4a4f7de70
[ci/release] Fix fetching logs from staging clusters (#26515)
Replaces a formerly hard-coded URI to anyscale prod with the respective env variable.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 11:15:50 +01:00
Kai Fricke
cf75cf7232
[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436)
This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 10:53:24 +01:00
Riatre
2cdb76789e
Bump pytest from 5.4.3 to 7.0.1 (#26334)
See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828.

Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future.

The change contains three commits with rather descriptive commit message, which I repeat here:

Pass deps to py_test in py_test_module_list

    Bazel macro py_test_module_list takes a `deps` argument, but completely
    ignores it instead of passes it to `native.py_test`. Fixing that as we
    are going to use deps of py_test_module_list in BUILD in later changes.

    cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library
    from a py_test, which isn't working, see upstream issue:
    https://github.com/bazelbuild/bazel/issues/701.
    This is fixed by simply removing the (non-working) deps.

Depend on conftest and data files in Python tests BUILD files

    Bazel requires that all the files used in a test run should be
    represented in the transitive dependencies specified for the test
    target. For py_test, it means srcs, deps and data.

    Bazel enforces this constraint by creating a "runfiles" directory,
    symbolic links files in the dependency closure and run the test in the
    "runfiles" directory, so that the test shouldn't see files not in the
    dependency graph.

    Unfortunately, the constraint does not apply for a large number of
    Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic
    links during test collection and effectively "breaks out" of the
    runfiles tree.

    pytest >= 6.0 introduces a breaking change and removed the symbolic link
    resolving behaviour, see pytest pull request
    https://github.com/pytest-dev/pytest/pull/6523 for more context.

    Currently, we are underspecifying dependencies in a lot of BUILD files
    and thus blocking us from updating to newer pytest (for Python 3.10
    support). This change hopefully fixes all of them, and at least those in
    CI, by adding data or source dependencies (mostly for conftest.py-s)
    where needed.

Bump pytest version from 5.4.3 to 7.0.1

    We want at least pytest 6.2.5 for Python 3.10 support, but not past
    7.1.0 since it drops Python 3.6 support (which Ray still supports), thus
    the version constraint is set to <7.1.

    Updating pytest, combined with earlier BUILD fixes, changed the ground
    truth of a few error message based unit test, these tests are updated to
    reflect the change.

    There are also two small drive-by changes for making test_traceback and
    test_cli pass under Python 3.10. These are discovered while debugging CI
    failures (on earlier Python) with a Python 3.10 install locally.  Expect
    more such issues when adding Python 3.10 to CI.
2022-07-12 21:14:35 -07:00