Commit graph

754 commits

Author SHA1 Message Date
Kai Fricke
40a8183e05
[ci/release] Fix job-based file download (#23657)
have to wrap download call in a lambda to be compatible with run_with_retry
2022-04-04 08:06:31 -07:00
Kai Fricke
9071b39f3e
[ci/release] Add buildkite output groups (#23658)
This makes the buildkite output easier to parse and interpret.
2022-04-01 13:04:22 -07:00
Kai Fricke
fe27dbcd9a
[air/release] Improve file packing/unpacking (#23621)
We use tarfile to pack/unpack directories in several locations. Instead of using temporary files, we can just use io.BytesIO to avoid unnecessary disk writes.

Note that this functionality is present in 3 different modules - in Ray (AIR), in the release test package, and in a specific release test. The implementations should live in the three modules independently, so we don't add a common utility for this (e.g. the ray_release package should be independent of the Ray package).
2022-04-01 07:38:14 -07:00
Avnish Narayan
161d95c31b
[RLlib] Increase slateq workers to decrease runtime on prod (#23609) 2022-03-30 17:38:21 -07:00
Chen Shen
3e80da7e9f
[ci/release] long running / change failed test to sdk (#23602)
close #23592. Talking with @krfricke and he suggested we move to use sdk for those long running tasks.
2022-03-30 12:57:21 -07:00
Jiajun Yao
2959294f02
[CI] Filter release tests by attr regex (#23485)
Support filtering tests by test attr regex filters. Multiple filters can be specified with one line for each filter. The format is attr:regex (e.g. team:serve)
2022-03-30 09:41:18 -07:00
Kai Fricke
e8abffb017
[tune/release] Improve Tune cloud release tests for durable storage (#23277)
This PR addresses recent failures in the tune cloud tests.

In particular, this PR changes the following:

    The trial runner will now wait for potential previous syncs to finish before syncing once more if force=True is supplied. This is to make sure that the final experiment checkpoints exist in the most recent version on remote storage. This likely fixes some flakiness in the tests.
    We switched to new cloud buckets that don't interfere with other tests (and are less likely to be garbage collected)
    We're now using dated subdirectories in the cloud buckets so that we don't interfere if two tests are run in parallel. Objects are cleaned up afterwards. The buckets are configured to remove objects after 30 days.
    Lastly, we fix an issue in the cloud tests where the RELEASE_TEST_OUTPUT file was unavailable when run in Ray client mode (as e.g. in kubernetes).

Local release test runs succeeded.

https://buildkite.com/ray-project/release-tests-branch/builds/189
https://buildkite.com/ray-project/release-tests-branch/builds/191
2022-03-30 09:28:33 -07:00
Kai Fricke
922367d158
[ci/release] Fix smoke test compute templates (#23561)
The smoke test definitions of a few tests were faulty for compute template override.

Core tests @rkooo567: https://buildkite.com/ray-project/release-tests-branch/builds/294
2022-03-29 13:48:09 -07:00
Artur Niederfahrenhorst
9a64bd4e9b
[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842) 2022-03-29 14:44:40 +02:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Chen Shen
c3e04ab275
[nighly-test] try out spot instances for chaos test #23507 2022-03-27 20:10:21 -07:00
Sven Mika
22c9c4aa39
[RLlib] Slate-Q +GPU torch bug fix. (#23464) 2022-03-24 17:39:33 +01:00
Avnish Narayan
9040f54060
[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452) 2022-03-24 09:17:30 +01:00
Stephanie Wang
aa6f773283
Switch long running tests to SDK (#23433)
These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.
2022-03-23 17:44:26 -07:00
Kai Fricke
724377163f
[ci/release] Unstable tests should only soft fail the build (#23403)
This will leave the tests green if the test is failing but marked as unstable.
2022-03-23 09:38:56 +00:00
Amog Kamsetty
6d776976c1
[Train] Fix multi node horovod bug (#22564)
Closes #20956
2022-03-22 16:22:53 -07:00
Jiajun Yao
bab19e8e68
Add perf metrics for test_many_tasks.py (#23318)
Add perf metrics for test_many_tasks.py
Use the new smoke test structure
2022-03-22 16:16:42 -07:00
SangBin Cho
0cd687cc19
[Nightly test] Fix job download retry (#23401)
Currently when we download a file to the cluster using a job, we don't do the retry.
2022-03-22 08:31:24 -07:00
Kai Fricke
02644ab4d8
[ci/release] Retry cluster env build on failure (#23378)
Failed cluster env builds should be retried.
2022-03-22 09:45:22 +00:00
Avnish Narayan
754bcd16f8
[rllib] Pin gym everywhere (#23384)
This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.
2022-03-22 09:44:22 +00:00
Kai Fricke
e48c407b13
[release] long running many drivers: Use SDK file manager (#23379)
This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_
2022-03-21 09:56:59 -07:00
Kai Fricke
7085749d50
[tune] Adjust release test timeouts (#23362)
Currently release tests fail because they exceed the (rather arbitrary) timeout by 1-2 seconds.
2022-03-20 17:05:20 +00:00
Avnish Narayan
e008a48ef2
[release tests] Pin gym everywhere (#23349) 2022-03-19 02:52:54 -07:00
Dmitri Gekhtman
561e7a9677
[RELEASE] Add autoscaler env to fix nightly tests (#23345)
The product backend doesn't yet understand that nightly Ray uses GCS-Ray. (This will be fixed when the next time the product control plane is deployed.)
This PR introduces the env required to signal to the product backend that we're using GCS-Ray so that the autoscaler can startup correctly.
2022-03-18 17:48:27 -07:00
Archit Kulkarni
db2c37c760
[serve] [release] Disable smoke test by default (#23334) 2022-03-18 18:40:48 -05:00
Stephanie Wang
5ab634f285
[core] Disable threaded_actors_stress_test (#23292)
* disable

* smoke
2022-03-18 15:57:53 -07:00
Kai Fricke
ca5354ffb1
[ci/release] Fix test_wheels (#23329) 2022-03-18 14:39:36 +00:00
Kai Fricke
3cf8116df2
[ci/release] Re-enable commit sanity check (#23327)
Commit sanity checks are currently seemingly disabled. This PR re-enables them by parsing wheel URLs.
2022-03-18 12:57:41 +00:00
Kai Fricke
da140a80e9
[ci/release] Legacy field should be optional (#23326)
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
2022-03-18 11:34:05 +00:00
Kai Fricke
e510d81c71
[ci/release] Save test config and results as artifacts (#23278)
It is good to have these information readily available when checking test results, as it will reveal both the original configuration (that could change over time) as well as the achieved results.
Also gets rid of the unneeded old alerts directory.

https://buildkite.com/ray-project/release-tests-branch/builds/190#ef531787-412c-40ec-81e6-beb495830c60
2022-03-18 09:26:42 +00:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
mwtian
391901f86b
[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233)
Continue to clean up Redis and other related Redis references, for
- gcs_utils.py
- log_monitor.py
- `publish_error_to_driver()`
2022-03-16 19:34:57 -07:00
SangBin Cho
b350fe9ee8
[Nightly test] Fix additional k8s issues + add new tests (#23231)
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
2022-03-16 16:37:29 -07:00
Stephanie Wang
ce71c5bbbd
[core][tests] Mark threaded_actors_stress_test as unstable 2022-03-16 15:31:19 -07:00
Kai Fricke
e3987d85c3
[tune] Mark cloud OSS release tests as unstable (#23240)
These tests have been flaky for a while. Until this is addressed, mark them as unstable.
2022-03-16 17:37:58 +00:00
Kai Fricke
eca5bcfc87
[ci/release] Reload modules after installing matching Ray (#23227)
Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel.
Additionally, job submission should also install a matching local ray to match with the job submission server.
2022-03-16 15:44:43 +00:00
Avnish Narayan
6c20e9d898
[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168) 2022-03-16 09:15:59 +01:00
Kai Fricke
15aeb33e50
[ci/release] Support PR wheels (#23084)
This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.
2022-03-14 17:24:13 +00:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
d93fa95dd5
[ci/release] Only report results for scheduled builds (#23135)
Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.
2022-03-14 15:10:16 +00:00
Kai Fricke
fce49694fc
[ci/release] Disable infra retries for now (#23132)
Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now.
We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.
2022-03-14 11:51:11 +00:00
Kai Fricke
830238cce2
[ci/release] Migrate ML user tests (#22953)
Most recent tests:

https://buildkite.com/ray-project/release-tests-branch/builds/156
https://buildkite.com/ray-project/release-tests-branch/builds/158
2022-03-14 11:50:16 +00:00
SangBin Cho
2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
Jiajun Yao
e4620669a1
[Release Test] Add perf metrics for core scalability tests (#23110)
* Add perf metrics for core scalability tests

* lint
2022-03-14 10:20:39 +09:00
Kai Fricke
430ea3e636
[ci/release] Migrate golden notebook tests (#22949)
Migrating golden notebook tests to new release test package.
Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155
2022-03-13 21:39:41 +00:00
Kai Fricke
956ad95d67
[ci/release] Fix release test config (#23122)
Currently the test is failing due to an invalid config (merged before validation was properly enforced).
2022-03-13 19:48:34 +00:00
Kai Fricke
c7303f538c
[ci/release] Validate smoke test fields, enforce frequency (#23075)
Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.
2022-03-13 18:48:03 +00:00
Kai Fricke
76a939c820
[ci/release] Migrate long running (+distributed) tests (#22955)
Migrating to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/103
Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_
2022-03-13 18:47:17 +00:00
SangBin Cho
8c1a6f9138
[Nightly Test] Fix a dataset test (#23106)
Fix a broken dataset test (due to incorrect working dir)
2022-03-12 08:16:08 -08:00
SangBin Cho
c0f8de9c3c
[Nightly tests] Run benchmark tests on k8s as well (#23100)
Run benchmark tests on k8s as well.

Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration
2022-03-11 19:40:37 -08:00