The old user-facing TrialCheckpoint class has been deprecated in favor of `ray.ml.Checkpoint` and will be removed with this PR.
The main change in this PR is to delete the old `TrialCheckpoint` class and replace remaining API calls (e.g. `checkpoint.local_path`) with the correct AIR equivalents.
One issue that comes up is that with Ray client usage, checkpoint directories are not available on the local node (the client). Thus, we can't construct `Checkpoint` objects easily. (Previously, the TrialCheckpoint object held a reference to the location, even if it is not locally available). There are ongoing discussions on how to resolve this in the future. For now, we print an error when such a checkpoint is requested.
Depends on #25805
Signed-off-by: Kai Fricke <kai@anyscale.com>
Uses the new AIR Train API for examples and tests.
The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.
This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.
Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.
Requires https://github.com/ray-project/ray/pull/25943 to be merged in first
Revert back to using nightly base images instead of pinning to 1.12.1. Pinning the docker image had led to uncaught errors in the past. Instead, we should be using nightly to make sure release tests will work on the most up to date versions of docker/cluster envs. If there are any test failures, the underlying issues should be fixed rather than pinning the docker image.
Co-authored-by: Kai Fricke <kai@anyscale.com>
Adds a CI test for 100TB shuffle.
There is a custom config for this nightly test to: (1) make sure each node gets 4TB of storage, (2) head node has 0 CPUs, (3) worker nodes have half their actual vCPU count.
Related issue number
Closes#24480.
Fixes a bug in wait_cluster where we count the total number of nodes ever in the cluster rather than the alive nodes. This has causes infra/autoscaler failures (e.g. #26138) to be mislabeled as test failures (and probably messes with timing too).
Co-authored-by: Alex Wu <alex@anyscale.com>
This adds "environments" to the release package that can be used to configure some environment variables. These variables will be loaded either by an `--env` argument or a `env` definition in the test definition and can be used to e.g. run release tests on staging.
There are mysterious memory usage growth in Ray clusters that disappear when running with jemalloc. Before we are able to figure out the root cause, it seems using jemalloc by default can be a good walkaround. Because of its efficiency, using jemalloc by default can be beneficial, but we need to run more benchmarks to verify.
Enable checking of the ray core module, excluding serve, workflows, and tune, in ./ci/lint/check_api_annotations.py. This required moving many files to ray._private and associated fixes.
The error message in #25638 indicates we should use protobuf>3.19.0 to generated code so that we can work with python protobuf >= 4.21.1. Try generating wheels to see if this works.
Error message suggests:
Wait timeout after 30 seconds for key(s): 0. You may want to increase the timeout via HOROVOD_GLOO_TIMEOUT_SECONDS
Bumped up to 120 seconds.
Tests run successfully: https://buildkite.com/ray-project/release-tests-pr/builds/6906
m5.16xlarge instances have 64 CPU and 256GB memory, which are overkill for scheduling tests that do not have a lot of computations. Use smaller instance m5.4xlarge to save cost and make allocating instances easier.
The package "ml" should be renamed to "air".
Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.
We're currently installing matching wheels on the fly in the python script for Ray client tests. However, we can't reload modules with changed protobuf configurations, and thus can't reload ray completely. Since the `anyscale` pacakge depends on Ray, this effectively prevents us from installing matching wheels within the python script.
There are a few possible solutions to this. First, we could separate out the local environment preparation from the test running - this will duplicate some logic and is thus a bit more involved, but should be considered in the future. For now, we adjust the `run_release_tests.sh` shell script to install any passed `--ray-wheels` wheels locally. We only do this in CI instances per default as these wheels will not be compatible with e.g. MacOS.
Link to successful build: https://buildkite.com/ray-project/release-tests-branch/builds/619#_
This line:
```
pip3 install -U --force-reinstall xgboost xgboost_ray lightgbm_ray petastorm
```
also re-installs the dependencies of these packages, and the `--force-reinstall` means we overwrite existing ones. This leads us to re-install the latest ray release, overwriting the wheels to be tested:
```
[INFO] 5/31/2022, 12:12:16 AM: Successfully installed ... ray-1.12.1 ...
[INFO] 5/31/2022, 12:12:17 AM: * Executed RUN pip3 install -U --force-reinstall xgboost xgboost_ray petastorm (ff6ae9f9)
```
Instead, we should use `--no-deps` to avoid re-installing dependencies. Also, the wheels sanity check is moved to after installing additional packages in order to catch these errors earlier.
Currently, `name:many_actors` matches e.g. `many_actors` and `many_actors_smoke_test`, but it should just match one test. Thus we should use `re.fullmatch` instead of `re.match` (which would require `name:many_actors.*` to match both.
Many release tests are currently failing for cuda version incompatibilities. Pinning the base image to 1.12.1 seems to resolve the problem for the time being.
Currently the release test runner prefers the first successfully version of a cluster env, instead of the last version. But sometimes a cluster env may build successfully on Anyscale but cannot launch cluster successfully (e.g. version 2 here) or new dependencies need to be installed, so a new version needs to be built. The existing logic always picks up the 1st successful build and cannot pick up the new cluster env version.
Although this is an edge case (tweaking cluster env versions, with the same Ray wheel or cluster env name), I believe it is possible for others to run into it.
Also, avoid running most of the CI tests for changes under release/ray_release/.