Commit graph

83 commits

Author SHA1 Message Date
Richard Liaw
7e62e1187c
[air/benchmark] Torch benchmarks for 4x4 (#26692)
Add benchmark data for 4x4 GPU setup.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 17:06:37 +01:00
Jiajun Yao
40a4777bc0
Mark chaos_dataset_shuffle_push_based_sort_1tb and chaos_dataset_shuffle_sort_1tb stable (#26677)
They passed for the past 7 runs.
2022-07-18 14:34:08 -07:00
Kai Fricke
00947fd949
[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562) 2022-07-18 12:14:10 -07:00
Jiao
98a07920d3
[AIR][CUJ] Make distributing training benchmark at silver tier (#26640) 2022-07-17 22:07:09 -07:00
Eric Liang
0855bcb77e
[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633) 2022-07-16 17:37:06 -07:00
Jiao
196e52ad7c
[AIR][CUJ] E2E Pytorch training (#26621) 2022-07-16 08:23:19 -07:00
Jiao
988ffd494b
[AIR][CUJ] Add GPU bench prediction benchmark (#26614) 2022-07-16 08:22:37 -07:00
matthewdeng
e3a096f412
[air] add bulk ingest benchmarks (#26618) 2022-07-15 22:01:23 -07:00
xwjiang2010
a241e6a0f5
[air] Add xgboost release test for silver tier(10-node case). (#26460)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-15 13:21:10 -07:00
Kai Fricke
213a96e239
[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519)
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-14 22:08:43 +01:00
Kai Fricke
cd95569b01
[tune/release] Add up/down scaling release test (#25392)
This adds a nightly release test that asserts that autoscaling a cluster up and down in a Ray Tune run works.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 22:57:24 +01:00
Kai Fricke
cf75cf7232
[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436)
This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 10:53:24 +01:00
Jian Xiao
923209895d
Pipelined training test: change num of windows; log the ingestion perf (#26429)
Why are these changes needed?
Improve test perf
Log the perf stats
With 2 windows there are a lot of spilling, slowing down the throughput.
2022-07-11 11:03:35 -07:00
Stephanie Wang
dcc913073f
[testing] Run 100TB shuffle test nightly (#26306)
Run this test nightly to collect more datapoints on stability and performance of 100TB shuffle.
2022-07-07 09:59:54 -07:00
Stephanie Wang
a90e53b76f
[core] Add weekly test for 100TB random shuffle (#25908)
Adds a CI test for 100TB shuffle.

There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count.

Related issue number

Closes #24480.
2022-07-01 13:30:07 -07:00
Kai Fricke
e2d8e7a6ae
[ci/release/ml] Run ML release tests on staging (#26168)
This moves all ML release tests to staging.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-06-30 13:24:28 -07:00
Kai Fricke
f9e787115f
[ci/release/core] Run many_nodes test on staging (#26164)
This moves many_nodes to anyscale staging.
2022-06-29 11:07:32 -07:00
Kai Fricke
7091a32fe1
[ci/release] Support running tests on staging (#25889)
This adds "environments" to the release package that can be used to configure some environment variables. These variables will be loaded either by an `--env` argument or a `env` definition in the test definition and can be used to e.g. run release tests on staging.
2022-06-28 10:14:01 -07:00
Jun Gong
c026374acb
[RLlib] Fix the 2 failing RLlib release tests. (#25603) 2022-06-14 14:51:08 +02:00
Kai Fricke
c3b608f757
[tune] Fix cloud tests, mark as stable (#25583)
#25063 broke release tests, but they've been consistently stable before. This PR fixes the tests and marks tune cloud tests as stable.
2022-06-08 17:47:54 +01:00
Stephanie Wang
f7692e4602
[core] Remove more expensive shuffle tests (#25165)
Now that the "smaller_instances" versions of these tests are stable, we can stop running the version that uses bigger instances.
2022-05-24 18:05:18 -07:00
Jiajun Yao
00cdd8dce5
Add chaos test for dataset shuffle (#25161)
Add chaos tests for dataset shuffle: both push-based and non-push-based.
2022-05-24 15:12:20 -07:00
Jiajun Yao
b825a839f9
Mark dataset_shuffle_push_based_sort_1tb as stable (#25162)
dataset_shuffle_push_based_sort_1tb is consistently passing for weeks.
2022-05-24 11:07:27 -07:00
mwtian
36d6d59169
[Datasets] mark nightly test dataset_shuffle_random_shuffle_1tb_small_instances stable #24861 2022-05-17 09:56:05 -07:00
Kai Fricke
6c5229295e
[ci/release] Support running tests with different python versions (#24843)
OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions. 
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments. 

Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
2022-05-17 17:03:12 +01:00
Sven Mika
0cd7bc4054
[RLlib] Re-establish dashboard performance tests. (#24728) 2022-05-16 13:13:49 +02:00
Kai Fricke
de69b0d6d6
[train/release] Fix horovod user test master app config (#24734) 2022-05-14 21:20:45 -07:00
Amog Kamsetty
a36e2a8f51
[Tune] Deprecate DistributedTrainableCreator (#24453)
Fully deprecate DistributedTrainableCreator for Ray 2.0

Closes #24453
2022-05-10 11:06:43 -07:00
mwtian
918d3601c6
[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481) 2022-05-06 15:55:59 -07:00
SangBin Cho
295b4436b3
[Nightly tests] Increase wait for nodes timeout (#24457)
Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)
2022-05-04 19:42:21 -07:00
Sihan Wang
3f5da8af7a
[Serve] Add serve handle graph workload nightly tests (#24435) 2022-05-04 09:07:50 -07:00
Stephanie Wang
fbbc9c33d6
Add nightly tests for push-based shuffle (#24352)
Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.
2022-05-02 11:35:14 -07:00
Jiao
ba7cc1803a
[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246) 2022-04-29 17:03:33 -07:00
SangBin Cho
46cd7f1830
Make large multi tests to nightly + remove k8s tests (#24302)
As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests
2022-04-29 03:40:12 -07:00
mwtian
afdfd20a5b
[Release tests] Create compute config for new dataset shuffle tests (#24239)
Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests.

Related issue number

#24176: the added tests would verify the instance types which support the new shuffle implementations.
2022-04-27 11:50:12 -07:00
Amog Kamsetty
ae9c68e75f
[Train] Fully deprecate Ray SGD v1 (#24038)
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.

Closes #16435
2022-04-25 16:12:57 -07:00
Stephanie Wang
1de9f3457e
[nightly tests] Mark Datasets shuffle tests stable (#24175)
dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.
2022-04-25 09:01:37 -07:00
Stephanie Wang
71e142b1fa
[core][tests] Add nightly test for datasets random_shuffle and sort (#23807)
Copied from #23784.

Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.

Modified to fix lint.
2022-04-12 12:53:57 -07:00
Eric Liang
1ff874e8e8
[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817) 2022-04-10 16:12:53 -07:00
Archit Kulkarni
7a1a7e1844
Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784)" (#23805)
This reverts commit ba484feac0.

Broke lint.
2022-04-08 13:18:13 -07:00
Stephanie Wang
ba484feac0
[core][tests] Add nightly test for datasets random_shuffle and sort (#23784)
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
2022-04-08 11:31:10 -07:00
Kai Fricke
0b804e5162
[ci/release] Move ML long running tests to sdk file manager (#23745)
What: Long running tests should use sdk file manager
Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.
2022-04-06 10:50:49 -07:00
Archit Kulkarni
582bf4e8f8
Add basic jobs release test with Tune script (#23474)
Adds basic jobs release tests that connects to the test cluster and runs a basic tune script.  Specifies `ray[tune]` in the `runtime_env` `pip` dependencies.  Two tests:

(1) Uses a local `working_dir`
(2) Uses a remote working_dir from a zip github URL.
2022-04-05 13:31:11 -05:00
Chen Shen
3e80da7e9f
[ci/release] long running / change failed test to sdk (#23602)
close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.
2022-03-30 12:57:21 -07:00
Kai Fricke
e8abffb017
[tune/release] Improve Tune cloud release tests for durable storage (#23277)
This PR addresses recent failures in the tune cloud tests.

In particular, this PR changes the following:

    The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
    We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
    We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
    Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).

Local release test runs succeeded.

https://buildkite.com/ray-project/release-tests-branch/builds/189
https://buildkite.com/ray-project/release-tests-branch/builds/191
2022-03-30 09:28:33 -07:00
Kai Fricke
922367d158
[ci/release] Fix smoke test compute templates (#23561)
The smoke test definitions of a few tests were faulty for compute template override.

Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294
2022-03-29 13:48:09 -07:00
Chen Shen
c3e04ab275
[nighly-test] try out spot instances for chaos test #23507 2022-03-27 20:10:21 -07:00
Stephanie Wang
aa6f773283
Switch long running tests to SDK (#23433)
These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.
2022-03-23 17:44:26 -07:00
Amog Kamsetty
6d776976c1
[Train] Fix multi node horovod bug (#22564)
Closes #20956
2022-03-22 16:22:53 -07:00
Jiajun Yao
bab19e8e68
Add perf metrics for test_many_tasks.py (#23318)
Add perf metrics for test_many_tasks.py
Use the new smoke test structure
2022-03-22 16:16:42 -07:00