Commit graph

58 commits

Author SHA1 Message Date
Sven Mika
0cd7bc4054
[RLlib] Re-establish dashboard performance tests. (#24728) 2022-05-16 13:13:49 +02:00
Kai Fricke
de69b0d6d6
[train/release] Fix horovod user test master app config (#24734) 2022-05-14 21:20:45 -07:00
Amog Kamsetty
a36e2a8f51
[Tune] Deprecate DistributedTrainableCreator (#24453)
Fully deprecate DistributedTrainableCreator for Ray 2.0

Closes #24453
2022-05-10 11:06:43 -07:00
mwtian
918d3601c6
[Datasets] mark nightly test dataset_shuffle_sort_1tb_small_instances stable (#24481) 2022-05-06 15:55:59 -07:00
SangBin Cho
295b4436b3
[Nightly tests] Increase wait for nodes timeout (#24457)
Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)
2022-05-04 19:42:21 -07:00
Sihan Wang
3f5da8af7a
[Serve] Add serve handle graph workload nightly tests (#24435) 2022-05-04 09:07:50 -07:00
Stephanie Wang
fbbc9c33d6
Add nightly tests for push-based shuffle (#24352)
Adds 1TB tests for push-based random shuffle and sort. Initially marked unstable.
2022-05-02 11:35:14 -07:00
Jiao
ba7cc1803a
[Deployment Graph] Add release test for long chain & wide fanout pattern (#24246) 2022-04-29 17:03:33 -07:00
SangBin Cho
46cd7f1830
Make large multi tests to nightly + remove k8s tests (#24302)
As discussed, to reduce backlog for large tests, we will (1) remove k8s tests (2) make large multi daily tests to nightly tests
2022-04-29 03:40:12 -07:00
mwtian
afdfd20a5b
[Release tests] Create compute config for new dataset shuffle tests (#24239)
Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests.

Related issue number

#24176: the added tests would verify the instance types which support the new shuffle implementations.
2022-04-27 11:50:12 -07:00
Amog Kamsetty
ae9c68e75f
[Train] Fully deprecate Ray SGD v1 (#24038)
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.

Closes #16435
2022-04-25 16:12:57 -07:00
Stephanie Wang
1de9f3457e
[nightly tests] Mark Datasets shuffle tests stable (#24175)
dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.
2022-04-25 09:01:37 -07:00
Stephanie Wang
71e142b1fa
[core][tests] Add nightly test for datasets random_shuffle and sort (#23807)
Copied from #23784.

Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.

Modified to fix lint.
2022-04-12 12:53:57 -07:00
Eric Liang
1ff874e8e8
[spelling] Add linter rule for mis-capitalizations of RLLib -> RLlib (#23817) 2022-04-10 16:12:53 -07:00
Archit Kulkarni
7a1a7e1844
Revert "[core][tests] Add nightly test for datasets random_shuffle and sort (#23784)" (#23805)
This reverts commit ba484feac0.

Broke lint.
2022-04-08 13:18:13 -07:00
Stephanie Wang
ba484feac0
[core][tests] Add nightly test for datasets random_shuffle and sort (#23784)
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
2022-04-08 11:31:10 -07:00
Kai Fricke
0b804e5162
[ci/release] Move ML long running tests to sdk file manager (#23745)
What: Long running tests should use sdk file manager
Why: Job submission server seems to crash under load, using the sdk file manager ensures we can still fetch results after a run.
2022-04-06 10:50:49 -07:00
Archit Kulkarni
582bf4e8f8
Add basic jobs release test with Tune script (#23474)
Adds basic jobs release tests that connects to the test cluster and runs a basic tune script.  Specifies `ray[tune]` in the `runtime_env` `pip` dependencies.  Two tests:

(1) Uses a local `working_dir`
(2) Uses a remote working_dir from a zip github URL.
2022-04-05 13:31:11 -05:00
Chen Shen
3e80da7e9f
[ci/release] long running / change failed test to sdk (#23602)
close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.
2022-03-30 12:57:21 -07:00
Kai Fricke
e8abffb017
[tune/release] Improve Tune cloud release tests for durable storage (#23277)
This PR addresses recent failures in the tune cloud tests.

In particular, this PR changes the following:

    The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
    We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
    We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
    Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).

Local release test runs succeeded.

https://buildkite.com/ray-project/release-tests-branch/builds/189
https://buildkite.com/ray-project/release-tests-branch/builds/191
2022-03-30 09:28:33 -07:00
Kai Fricke
922367d158
[ci/release] Fix smoke test compute templates (#23561)
The smoke test definitions of a few tests were faulty for compute template override.

Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294
2022-03-29 13:48:09 -07:00
Chen Shen
c3e04ab275
[nighly-test] try out spot instances for chaos test #23507 2022-03-27 20:10:21 -07:00
Stephanie Wang
aa6f773283
Switch long running tests to SDK (#23433)
These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.
2022-03-23 17:44:26 -07:00
Amog Kamsetty
6d776976c1
[Train] Fix multi node horovod bug (#22564)
Closes #20956
2022-03-22 16:22:53 -07:00
Jiajun Yao
bab19e8e68
Add perf metrics for test_many_tasks.py (#23318)
Add perf metrics for test_many_tasks.py
Use the new smoke test structure
2022-03-22 16:16:42 -07:00
Kai Fricke
e48c407b13
[release] long running many drivers: Use SDK file manager (#23379)
This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_
2022-03-21 09:56:59 -07:00
Stephanie Wang
5ab634f285
[core] Disable threaded_actors_stress_test (#23292)
* disable

* smoke
2022-03-18 15:57:53 -07:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
SangBin Cho
b350fe9ee8
[Nightly test] Fix additional k8s issues + add new tests (#23231)
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
2022-03-16 16:37:29 -07:00
Stephanie Wang
ce71c5bbbd
[core][tests] Mark threaded_actors_stress_test as unstable 2022-03-16 15:31:19 -07:00
Kai Fricke
e3987d85c3
[tune] Mark cloud OSS release tests as unstable (#23240)
These tests have been flaky for a while. Until this is addressed, mark them as unstable.
2022-03-16 17:37:58 +00:00
Kai Fricke
830238cce2
[ci/release] Migrate ML user tests (#22953)
Most recent tests:

https://buildkite.com/ray-project/release-tests-branch/builds/156
https://buildkite.com/ray-project/release-tests-branch/builds/158
2022-03-14 11:50:16 +00:00
Kai Fricke
430ea3e636
[ci/release] Migrate golden notebook tests (#22949)
Migrating golden notebook tests to new release test package.
Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155
2022-03-13 21:39:41 +00:00
Kai Fricke
956ad95d67
[ci/release] Fix release test config (#23122)
Currently the test is failing due to an invalid config (merged before validation was properly enforced).
2022-03-13 19:48:34 +00:00
Kai Fricke
76a939c820
[ci/release] Migrate long running (+distributed) tests (#22955)
Migrating to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/103
Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_
2022-03-13 18:47:17 +00:00
SangBin Cho
8c1a6f9138
[Nightly Test] Fix a dataset test (#23106)
Fix a broken dataset test (due to incorrect working dir)
2022-03-12 08:16:08 -08:00
SangBin Cho
c0f8de9c3c
[Nightly tests] Run benchmark tests on k8s as well (#23100)
Run benchmark tests on k8s as well.

Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration
2022-03-11 19:40:37 -08:00
SangBin Cho
97383e4c1b
[Nightly test] Fix a broken nightly test due to the wrong config (#23097) 2022-03-11 16:47:06 -08:00
SangBin Cho
2b38fe89e2
[Nightly tests] Migrate rest of core tests (#23085)
MIgrate the rest of core tests
2022-03-11 10:41:14 -08:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address (#23067)
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
SangBin Cho
965d609627
[Nightly test] Fix a minor syntax issue for core nightly tests (#23069)
Add frequency to smoke tests
Remove unnecessary alerts
2022-03-11 04:58:40 -08:00
Kai Fricke
5b2d58674b
[ci/release] Migrate horovod tests (#22951)
Migrating horovod tests to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/125
2022-03-11 09:53:29 +00:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
SangBin Cho
92b50ff5da
Migrate multi nightly tests (#23005) 2022-03-11 01:32:10 +09:00
SangBin Cho
4fa294ca49
[Nightly tests] Stop running broken tests (#22993) 2022-03-10 06:59:51 -08:00
SangBin Cho
e88abe4c8e
[Nightly tests] migrated most of daily tests (#22960)
* migrated most of daily tests

* Addressed code review.
2022-03-10 05:49:16 -08:00
Kai Fricke
007cf03d7a
[ci/release] Migrate RLLib tests (#22967)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/111
2022-03-10 10:26:03 +00:00
Kai Fricke
fee4065daf
[ci/release] Migrate SGD tests (#22966)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/110
2022-03-10 10:23:50 +00:00
Kai Fricke
614dc6b511
[ci/release] Migrate Serve tests (#22965)
Migrate to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/109
2022-03-10 10:23:25 +00:00
Kai Fricke
ccda1555cc
[ci/release] Migrate Runtime Env tests (#22963)
Migrating to new release test package.

https://buildkite.com/ray-project/release-tests-branch/builds/108
2022-03-10 10:22:57 +00:00