Fix the failure to unbreak nightly and unblock 1.13 release.
The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
This PR adds a utility script to automatically fetch release test results from the Buildkite pipeline for a release branch. This was previously a manual process.
Although there's enough quota, it is possible the AWS doesn't have enough capacity to start up new nodes. According to @allenyin55, the current wait for node timeout is too short. This PR increases the timeout to 3000 seconds (50 minutes) from 600 seconds. Let's see if this can resolve the issue. If it makes things worse, I will revert it quickly (I will closely monitor the infra failure rate)
The local environment setup of release tests (in client tests) can sometimes update dependencies of the `anyscale` package to an unsupported version. By re-installing the `anyscale` package after local env setup, we make sure that we can connect to the cluster. Note that this may lead to incompatibilities of the test script, however.
Currently nightly tests are unable to finish in a day because of concurrency group limit on `large` tests. This is an attempt to adjust the limits so buildkite can run / finish more tests. I will observe which tests fall into the `enormous` group and adjust the test resource / concurrency group limits again.
Fix CQL getting stuck when deprecated timesteps_per_iteration is used (use min_train_timesteps_per_reporting instead).
CQL does not perform sampling timesteps and the deprecated timesteps_per_iteration is automatically translated into the new min_sample_timesteps_per_reporting, but should be translated (only for CQL and other purely offline RL algos) into min_train_timesteps_per_reporting.
If timesteps_per_iteration, CQL lever leaves the first iteration as it thinks it's not done yet (sample timesteps always remain at 0).
For debugging client environments, it is helpful to print the installed pip packages.
Additionally, a fix for the environment of the ml_user_tune_rllib_connect_test is added. Additionally, anyscale import errors are reported verbosely to help debug missing packages.
After https://github.com/ray-project/ray/pull/24066, some release tests are running into:
```
ModuleNotFoundError: No module named 'ray.train.impl'
```
This PR simply adds a `__init__.py` file to resolve this.
We also add a 5 wecond delay for client runners in release test to give clusters a bit of slack to come up (and avoid ray client connection errors)
Currently concurrency groups are always calculated based on the full test cluster compute. Instead, smoke tests should use the smoke test cluster compute.
Use a separate compute config that uses smaller instance types and no object store memory limit for the new shuffle implementation. I verified that the config works on master for dataset_shuffle_* tests.
Related issue number
#24176: the added tests would verify the instance types which support the new shuffle implementations.
Ray SGD v1 has been denoted as a deprecated API for a while. This PR fully deprecates Ray SGD v1. An error will be raised if ray.util.sgd package is attempted to be imported.
Closes#16435
dataset_shuffle_random_shuffle_1tb was previously failing due to OOM but has now passed on the last 4 runs due to changing the node type. These tests should be stable now, although we will want to look into the OOM issue later.
* Revert "Revert "[tune] Also interrupt training when SIGUSR1 received" (#24085)"
This reverts commit 00595653ed.
Failure in windows has been addressed by conditionally registering the signal handler if available.
Serve stores context state, including the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` in `api.py`. However, these data structures are referenced throughout the codebase, causing circular dependencies. This change introduces two new files:
* `context.py`
* Intended to expose process-wide state to internal Serve code as well as `api.py`
* Stores the `_INTERNAL_REPLICA_CONTEXT` and the `_global_client` global variables
* `client.py`
* Stores the definition for the Serve `Client` object, now called the `ServeControllerClient`
Ray Tune currently gracefully stops training on SIGINT. However, the Ray core worker prevents SIGINT (and SIGTERM) to be processed by child tasks, which means that Ray Tune runs that are started in remote tasks (e.g. via Ray client) cannot be gracefully interrupted.
In k8s-based cloud tests that used the Ray client to kick off a Ray Tune run, this lead to test flakiness, as final experiment state could not be gracefully persisted to cloud storage.
This PR adds support for SIGUSR1 in addition to SIGINT to interrupt training gracefully.
In xgboost 1.6, support for older GPU architectures was removed (dmlc/xgboost#7767).
This PR updates the instance types used in our xgboost-ray gpu release tests to use Volta GPUs instead of Kepler GPUs so that xgboost-ray can run successfully with xgboost v1.6.
Closes#24048
Xgboost released a new version a few days ago. Due to caching of the Anyscale cluster env, this resulted in the server having an outdated xgboost version while the client has the most recent version causing the test to fail.
Instead, we reinstall xgboost-ray and xgboost in the post build commands so that these dependencies are not being cached in the cluster env.
A legacy K8s test fails due to incorrect usage of @ray.method which only started raising errors after the Ray 1.12.0 branch cut.
This PR removes the use of @ray.method in the test.
Some context in #23271 and #23471
In addition, I noticed some of the test were flakey due to out-of-memory issues. For that reason, I've doubled the memory request and limits in the legacy operator's example files.
I've also added CPU limits in an example file that was missing them -- it makes the most sense for consistency with Ray's resource model to use CPU limits in K8s configs.
Finally, I added an extra note to the instructions for running the tests.
What: Adds a setting "prefer_smoke_tests" to the Buildkite settings. With this, user can specify to kick off smoke tests, if available.
Why: The filtering interface of the release testing dialog is a bit complicated at the moment - in order to kick off smoke tests, users have to know with which frequency they are configured to run. Instead users should usually just filter the tests they want to run (using frequency ANY) and optionally specify to run smoke tests, if available.
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
What: Quotes pip install packages in local environment setup for client runner.
Why: Strings like pyarrow>=6.0.1<7.0.0 currently don't work as they are interpreted as output redirection.
Copied from #23784.
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.
Modified to fix lint.
In test_many_tasks.py case, we usually found the case failing and found the reason.
We sleep for sleep_time seconds to wait all tasks to be finished, but the computation of actual sleep time is done by 0.1 * #rounds, where 0.1 is the sleep time every round.
It looks perfect but one factor was missed, and that's the computation time elapsed. In this case, it is the time consumed by
cur_cpus = ray.available_resources().get("CPU", 0)
min_cpus_available = min(min_cpus_available, cur_cpus)
especially the ray.available_resources() took a quite time when the cluster is large. (in our case it took beyond 1s with 1500 nodes).
The situation we thought it would be:
for _ in range(sleep_time / 0.1):
sleep(0.1)
The actual situation happens:
for _ in range(sleep_time / 0.1):
do_something(); # it costs time, sometimes pretty much
sleep(0.1)
We don't know why ray.available_resources() is slow and if it's logical, but we can add a time checker to make the sleep time precise.
Adding a large-scale nightly test for Datasets random_shuffle and sort. The test script generates random blocks and reports total run time and peak driver memory.