OSS release tests currently run with hardcoded Python 3.7 base. In the future we will want to run tests on different python versions.
This PR adds support for a new `python` field in the test configuration. The python field will determine both the base image used in the Buildkite runner docker container (for Ray client compatibility) and the base image for the Anyscale cluster environments.
Note that in Buildkite, we will still only wait for the python 3.7 base image before kicking off tests. That is acceptable, as we can assume that most wheels finish in a similar time, so even if we wait for the 3.7 image and kick off a 3.8 test, that runner will wait maybe for 5-10 more minutes.
Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)
The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.
For debugging client environments, it is helpful to print the installed pip packages.
Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.
The PR handles edge cases that originally existed in the old e2e.py job-based runners.
Adds a unit-tested and restructured ray_release package for running release tests.
Relevant changes in behavior:
Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).
The main subpackages are:
Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
Command runner: Runs commands, e.g. as client command or sdk command
File manager: Uploads/downloads files to/from session
Reporter: Reports results (e.g. to database)
Much of the code base is unit tested, but there are probably some pieces missing.
Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023