The new buildkite pipeline prints out faulty results due to a confusion of -ge/-gt and -le/-lt in the retry script. This is a cosmetic error (so behavior was still correct) that is resolved with this PR.
This currently leads to failing builds for schema validation errors after #22901 was merged (the stable column was incorrectly not added to the schema before).
To avoid breakage like in #22905, this PR adds schema validation to the release test package.
In a follow-up PR, we'll likely switch this to use pydantic instead.
This PR migrates scalability tests to the new infra.
I had to copy the benchmarks folder to the release folder to make it work. I will remove some unnecessary files (e.g., benchmark.yaml or wait_for_cluster file) Alternatively we can support a different path than /release from the tool, but I think this way is cleaner. I am open to suggestion though cc @krfricke
Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.
Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
horovod_user_test_master is failing with recent horovod release[[link](https://buildkite.com/ray-project/periodic-ci/builds/2960#61dabda8-eea0-4b7b-93bf-9e341926d3fd)].
Error message is saying:
```
AttributeError: Can't get attribute '_ExecutorDriver' on <module 'horovod.ray.runner' from '/home/ray/anaconda3/lib/python3.7/site-packages/horovod/ray/runner.py'>
```
The horovod test is set up in such a way that it has the "driver" (a.k.a. client) part (which is the code that runs in a buildkite agent) and the "cluster" (a.k.a. server) part (which runs in Anyscale cluster). Driver's dependency is specified by `release/ml_user_tests/horovod/driver_setup_master.sh` while cluster's dependency is specified by `release/horovod_tests/app_config_master.yaml`.
The two communicate via Anyscale client.
The above error message is complaining that while client's horovod version has _ExecutorDriver in runner.py, the server's horovod doesn't. This is due to the version mismatch of the above two files. This PR brings the two horovod dependency to both point to horovod master.
This PR adds concurrency groups to Buildkite release test runs with new release test package. Five concurrency groups are defined (large-gpu, small-gpu, large, medium, small). If not specified manually, concurrency groups are inferred from used cluster resources.
Example pipeline: https://buildkite.com/ray-project/release-tests-branch/builds/55#09109eac-d22e-43bc-889e-078cfb037373 (click on Artifacts --> pipeline.json)
This PR **enables the usage stats only on the release test infrastructure** (large scale tests Ray runs on a daily basis in a private infra). Note it is still disabled by default in Ray.
This PR enables stage fusion for dataset pipelines. This also requires:
1. Removing the num_cpus=0.5 default for the read stage, to enable fusion of the read stage.
2. Removing spread_resource_prefix (not supported for now).
- Separate spread scheduling and default hydra scheduling (i.e. SpreadScheduling != HybridScheduling(threshold=0)): they are already separated in the API layer and they have the different end goals so it makes sense to separate their implementations and evolve them independently.
- Simple round robin for spread scheduling: this is just a starting implementation, can be optimized later.
- Prefer not to spill back tasks that are waiting for args since the pull is already in progress.
Adds a unit-tested and restructured ray_release package for running release tests.
Relevant changes in behavior:
Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).
The main subpackages are:
Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
Command runner: Runs commands, e.g. as client command or sdk command
File manager: Uploads/downloads files to/from session
Reporter: Reports results (e.g. to database)
Much of the code base is unit tested, but there are probably some pieces missing.
Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.
This is an action item from the Ray 1.10.0 release retrospective.
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.