Commit graph

546 commits

Author SHA1 Message Date
Artur Niederfahrenhorst
9a64bd4e9b
[RLlib] Simple-Q uses training iteration fn (instead of execution_plan); ReplayBuffer API for Simple-Q (#22842) 2022-03-29 14:44:40 +02:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Chen Shen
c3e04ab275
[nighly-test] try out spot instances for chaos test #23507 2022-03-27 20:10:21 -07:00
Sven Mika
22c9c4aa39
[RLlib] Slate-Q +GPU torch bug fix. (#23464) 2022-03-24 17:39:33 +01:00
Avnish Narayan
9040f54060
[RLlib] Pin Gym Everywhere and turn off gpu for recsim tests (#23452) 2022-03-24 09:17:30 +01:00
Stephanie Wang
aa6f773283
Switch long running tests to SDK (#23433)
These tests are flakey on the job-based test submission system. Switching them to the SDK-based test runner for now.
2022-03-23 17:44:26 -07:00
Kai Fricke
724377163f
[ci/release] Unstable tests should only soft fail the build (#23403)
This will leave the tests green if the test is failing but marked as unstable.
2022-03-23 09:38:56 +00:00
Amog Kamsetty
6d776976c1
[Train] Fix multi node horovod bug (#22564)
Closes #20956
2022-03-22 16:22:53 -07:00
Jiajun Yao
bab19e8e68
Add perf metrics for test_many_tasks.py (#23318)
Add perf metrics for test_many_tasks.py
Use the new smoke test structure
2022-03-22 16:16:42 -07:00
SangBin Cho
0cd687cc19
[Nightly test] Fix job download retry (#23401)
Currently when we download a file to the cluster using a job, we don't do the retry.
2022-03-22 08:31:24 -07:00
Kai Fricke
02644ab4d8
[ci/release] Retry cluster env build on failure (#23378)
Failed cluster env builds should be retried.
2022-03-22 09:45:22 +00:00
Avnish Narayan
754bcd16f8
[rllib] Pin gym everywhere (#23384)
This PR Pins gym in the app config.yaml's for rllib and tune so that release tests are no longer broken by the new gym version.
2022-03-22 09:44:22 +00:00
Kai Fricke
e48c407b13
[release] long running many drivers: Use SDK file manager (#23379)
This will make the test pass again: https://buildkite.com/ray-project/release-tests-branch/builds/226#_
2022-03-21 09:56:59 -07:00
Kai Fricke
7085749d50
[tune] Adjust release test timeouts (#23362)
Currently release tests fail because they exceed the (rather arbitrary) timeout by 1-2 seconds.
2022-03-20 17:05:20 +00:00
Avnish Narayan
e008a48ef2
[release tests] Pin gym everywhere (#23349) 2022-03-19 02:52:54 -07:00
Dmitri Gekhtman
561e7a9677
[RELEASE] Add autoscaler env to fix nightly tests (#23345)
The product backend doesn't yet understand that nightly Ray uses GCS-Ray. (This will be fixed when the next time the product control plane is deployed.)
This PR introduces the env required to signal to the product backend that we're using GCS-Ray so that the autoscaler can startup correctly.
2022-03-18 17:48:27 -07:00
Archit Kulkarni
db2c37c760
[serve] [release] Disable smoke test by default (#23334) 2022-03-18 18:40:48 -05:00
Stephanie Wang
5ab634f285
[core] Disable threaded_actors_stress_test (#23292)
* disable

* smoke
2022-03-18 15:57:53 -07:00
Kai Fricke
ca5354ffb1
[ci/release] Fix test_wheels (#23329) 2022-03-18 14:39:36 +00:00
Kai Fricke
3cf8116df2
[ci/release] Re-enable commit sanity check (#23327)
Commit sanity checks are currently seemingly disabled. This PR re-enables them by parsing wheel URLs.
2022-03-18 12:57:41 +00:00
Kai Fricke
da140a80e9
[ci/release] Legacy field should be optional (#23326)
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
2022-03-18 11:34:05 +00:00
Kai Fricke
e510d81c71
[ci/release] Save test config and results as artifacts (#23278)
It is good to have these information readily available when checking test results, as it will reveal both the original configuration (that could change over time) as well as the achieved results.
Also gets rid of the unneeded old alerts directory.

https://buildkite.com/ray-project/release-tests-branch/builds/190#ef531787-412c-40ec-81e6-beb495830c60
2022-03-18 09:26:42 +00:00
Eric Liang
015181ab9a
Add random access support for Datasets (experimental feature) (#22749)
This PR adds experimental support for random access to datasets. A Dataset can be random access enabled by calling `ds.to_random_access_dataset(key, num_workers=N)`. This creates a RandomAccessDataset.

RandomAccessDataset partitions the dataset across the cluster by the given sort key, providing efficient random access to records via binary search. A number of worker actors are created, each of which has zero-copy access to the underlying sorted data blocks of the Dataset.

Performance-wise, you can expect each worker to provide ~3000 records / second via ``get_async()``, and ~10000 records / second via ``multiget()``.

Since Ray actor calls go direct from worker->worker, throughput scales linearly with the number of workers.
2022-03-17 15:01:12 -07:00
mwtian
391901f86b
[Remove Redis Pubsub 2/n] clean up remaining Redis references in gcs_utils.py (#23233)
Continue to clean up Redis and other related Redis references, for
- gcs_utils.py
- log_monitor.py
- `publish_error_to_driver()`
2022-03-16 19:34:57 -07:00
SangBin Cho
b350fe9ee8
[Nightly test] Fix additional k8s issues + add new tests (#23231)
Fix bug from the previous fixes.
Add more tests
Stop using m5.xlarge (not supported now)
There are 2 hard blockers from the infra: 1. Large size disk is not supported. 2. m5.xlarge is not supported. Both are considered as a high priority to be fixed soon.
2022-03-16 16:37:29 -07:00
Stephanie Wang
ce71c5bbbd
[core][tests] Mark threaded_actors_stress_test as unstable 2022-03-16 15:31:19 -07:00
Kai Fricke
e3987d85c3
[tune] Mark cloud OSS release tests as unstable (#23240)
These tests have been flaky for a while. Until this is addressed, mark them as unstable.
2022-03-16 17:37:58 +00:00
Kai Fricke
eca5bcfc87
[ci/release] Reload modules after installing matching Ray (#23227)
Apparently, ray gets imported somewhere before running the client runner (maybe from an anyscale package). This means that we need to reload the ray package after installing a matching local ray wheel.
Additionally, job submission should also install a matching local ray to match with the job submission server.
2022-03-16 15:44:43 +00:00
Avnish Narayan
6c20e9d898
[RLlib] Change the slateq regression learning test with GPU to use torch only (#23168) 2022-03-16 09:15:59 +01:00
Kai Fricke
15aeb33e50
[ci/release] Support PR wheels (#23084)
This PR adds support to find wheels for PRs to run OSS release tests on, i.e. --ray-wheels user:branch to work.
2022-03-14 17:24:13 +00:00
Kai Fricke
8608b64885
[ci/release] Remove old OSS release test infrastructure (#23134)
Now that we've migrated all OSS release tests to the new infrastructure, we can remove old config files and infra scripts.
2022-03-14 15:10:52 +00:00
Kai Fricke
d93fa95dd5
[ci/release] Only report results for scheduled builds (#23135)
Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.
2022-03-14 15:10:16 +00:00
Kai Fricke
fce49694fc
[ci/release] Disable infra retries for now (#23132)
Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now.
We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.
2022-03-14 11:51:11 +00:00
Kai Fricke
830238cce2
[ci/release] Migrate ML user tests (#22953)
Most recent tests:

https://buildkite.com/ray-project/release-tests-branch/builds/156
https://buildkite.com/ray-project/release-tests-branch/builds/158
2022-03-14 11:50:16 +00:00
SangBin Cho
2c2d96eeb1
[Nightly tests] Improve k8s testing (#23108)
This PR improves broken k8s tests.

Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
2022-03-14 03:49:15 -07:00
Jiajun Yao
e4620669a1
[Release Test] Add perf metrics for core scalability tests (#23110)
* Add perf metrics for core scalability tests

* lint
2022-03-14 10:20:39 +09:00
Kai Fricke
430ea3e636
[ci/release] Migrate golden notebook tests (#22949)
Migrating golden notebook tests to new release test package.
Tests are passing: https://buildkite.com/ray-project/release-tests-branch/builds/155
2022-03-13 21:39:41 +00:00
Kai Fricke
956ad95d67
[ci/release] Fix release test config (#23122)
Currently the test is failing due to an invalid config (merged before validation was properly enforced).
2022-03-13 19:48:34 +00:00
Kai Fricke
c7303f538c
[ci/release] Validate smoke test fields, enforce frequency (#23075)
Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.
2022-03-13 18:48:03 +00:00
Kai Fricke
76a939c820
[ci/release] Migrate long running (+distributed) tests (#22955)
Migrating to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/103
Tests pass: https://buildkite.com/ray-project/release-tests-branch/builds/143#_
2022-03-13 18:47:17 +00:00
SangBin Cho
8c1a6f9138
[Nightly Test] Fix a dataset test (#23106)
Fix a broken dataset test (due to incorrect working dir)
2022-03-12 08:16:08 -08:00
SangBin Cho
c0f8de9c3c
[Nightly tests] Run benchmark tests on k8s as well (#23100)
Run benchmark tests on k8s as well.

Note that until k8s cluster stability is confirmed, we will run the same tests twice at AWS and k8s. Once all benchmark tests look stable, we will start full migration
2022-03-11 19:40:37 -08:00
SangBin Cho
97383e4c1b
[Nightly test] Fix a broken nightly test due to the wrong config (#23097) 2022-03-11 16:47:06 -08:00
SangBin Cho
2b38fe89e2
[Nightly tests] Migrate rest of core tests (#23085)
MIgrate the rest of core tests
2022-03-11 10:41:14 -08:00
Kai Fricke
04ea180dfb
[ci/release] Add "tiny" concurrency group, change limits (#23065)
E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups.
Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).
2022-03-11 10:19:38 -08:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address (#23067)
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
SangBin Cho
965d609627
[Nightly test] Fix a minor syntax issue for core nightly tests (#23069)
Add frequency to smoke tests
Remove unnecessary alerts
2022-03-11 04:58:40 -08:00
Kai Fricke
5b2d58674b
[ci/release] Migrate horovod tests (#22951)
Migrating horovod tests to new release package.

https://buildkite.com/ray-project/release-tests-branch/builds/125
2022-03-11 09:53:29 +00:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
SangBin Cho
92b50ff5da
Migrate multi nightly tests (#23005) 2022-03-11 01:32:10 +09:00