Commit graph

14 commits

Author SHA1 Message Date
Kai Fricke
c7303f538c
[ci/release] Validate smoke test fields, enforce frequency (#23075)
Of all smoke test arguments, frequency is the only required one, so we should check for it. Additionally, not all fields should be able to be overwritten (e.g. legacy or name), so we enforce this as well.
2022-03-13 18:48:03 +00:00
Kai Fricke
04ea180dfb
[ci/release] Add "tiny" concurrency group, change limits (#23065)
E.g. long running tests run on small clusters (often 8 CPUs) but block other jobs for a long time. We should thus add more granularity to the concurrency groups.
Additionally, limits have been slightly adjusted to make more sense (e.g. 8 GPUs are now small-gpu, 9+ GPUs large-gpu, instead of 7 for small-gpu and 8 for large-gpu).
2022-03-11 10:19:38 -08:00
Kai Fricke
a8bed94ed6
[ci/release] Always use full cluster address (#23067)
Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09
2022-03-11 16:31:21 +00:00
SangBin Cho
ebac18d163
[Nightly test] Support Job based file manager + runner (#22860)
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.

The PR handles edge cases that originally existed in the old e2e.py job-based runners.
2022-03-10 15:03:50 -08:00
SangBin Cho
d192ec30fd
[Nightly Tests] Readjust the concurrency limit. (#23002)
This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota.

Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.
2022-03-10 07:19:38 -08:00
Kai Fricke
ac654dbb9d
[ci/release] Fix schema validation for single tests / add stable field (#22947)
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
2022-03-09 15:22:49 +00:00
Kai Fricke
cac9d30909
[ci/release] Add schema validation for release test config (#22919)
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
2022-03-09 09:50:51 +00:00
SangBin Cho
529911ee78
[Nightly tests] Add missing patches (#22862)
These changes are added to the old e2e.py, but not to the new infra
2022-03-07 19:48:43 +00:00
Jiajun Yao
1b5efb588e
[Release Test] Change release test db reporter report_time to report_timestamp_ms (#22844)
This's easier to sort and compare timestamp and avoid timezone issue.
2022-03-07 04:54:19 -08:00
SangBin Cho
9d0148dbbe
[Test] Migrate the first test to the new infra (#22770)
This migrate the simplest nightly test to the new infra. I will also explore k8s migration with this test
2022-03-06 18:24:54 -08:00
Jiajun Yao
23f2862067
[Release Test] Send release test result to db pipeline for new test infra (#22813)
* Send release test result to db pipeline for new test infra

* address comment
2022-03-05 07:34:40 +09:00
Kai Fricke
7425fa6212
[ci/release] Add support for concurrency groups (#22728)
This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources.

Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)
2022-03-02 16:35:54 +01:00
Kai Fricke
3695408a85
[release] Fix special cases in release test package (e.g. smoke test) (#22442)
Fixing special cases (e.g. smoke tests, long running tests) in the release test package infrastructure. Prepare migration of Tune and XGBoost tests.
2022-02-28 21:05:01 +01:00
Kai Fricke
331b71ea8d
[ci/release] Refactor release test e2e into package (#22351)
Adds a unit-tested and restructured ray_release package for running release tests.

Relevant changes in behavior:

Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).

The main subpackages are:

    Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
    Command runner: Runs commands, e.g. as client command or sdk command
    File manager: Uploads/downloads files to/from session
    Reporter: Reports results (e.g. to database)

Much of the code base is unit tested, but there are probably some pieces missing.

Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
2022-02-16 17:35:02 +00:00