Moving debug_state.txt to the log directory. This will help us finding debug_state.txt from the dashboard. See below.
Add debug_state_gcs.txt. This will display GCS' debug state. GCS will also dump debug state to the file every 10 seconds
For periodic printing of debug state, I made it happen every 1 minute. This is because every 10 seconds usually is very spammy.
- Removing scale_to logic from object store. We don't need to scale during tests, which will disambiguate infra failures vs app failures.
- Run microbenchmark in core nightly, meaning it will run even more often
- Run weekly scalability tests daily instead. (They are not too expensive).
- Run some core daily tests separately to avoid infra failures.
This PR is mostly for implementing "fixture" for nightly test. Note that the current fixture implementation is not that great, and we can probably improve this in the future after refactoring e2e.py.
Instead of wrapping the whole training run in a remote call, we only query the files on the node in a remote call. XGBoost-Ray is then started from the local node.
## Why are these changes needed?
`base_image: "anyscale/ray-ml:pinned-nightly-py37"` doesn't exist anymore which fails a lot of nightly tests, change to `base_image: "anyscale/ray-ml:nightly-py37-gpu"`
## Related issue number
## Checks
The ray-ml image depends on numpy ~=1.19.2 via the tensorflow==2.6 requirement. Unfortunately that's incompatible with Dataset (see here #20258 (comment)).
This PR upgrades the numpy dependency only for the nightly test.
This should fix the long running release tests that are failing to build their app configs.
It seems like pip install ray[all] now downgrades the ray version. It's unclear why, but most likely, a dependency has pinned the ray version now. This PR explicitely install the version of Ray that we want after the pip install ray[all] to fix the problem.
Xgboosts train_small timed out because of a CPU borrowing feature related to placement groups. The root bug will be fixed in the coming weeks, but this PR makes the release test consistently pass by requesting 0 CPUs for the remote wrapper script.
Why are these changes needed?
In the past, there was a regression the placement group creation time gets slower as time goes. I believe the issue is fixed in the master, but this PR verifies if that's actually fixed.
This PR adds a long running test for the placement group. There are 2 purposes of the test.
Make sure the placement group creation / removal doesn't get slower as time goes. The test basically measure the first 20 iteration P50 creation time and run very long iteration. After all iteration, it checks if the p50 creation time is not too slow compared to the initial round.
Make sure placement group removal / creation works consistently for a long time without an issue.
Q: Should we make it a real long running test? (that runs for a day?)
## Why are these changes needed?
The .boto files are already added to the base image and ACL'ed to root, adding them again during app config build causes permission issues.
## Related issue number
* [xgboost] Fix release test app configs
* Revert full app config
* Update base docker image
* Only change cpu base image
* default
* Pin xgboost to 1.5. in cpu tests
* Remove numpy hack
* Revert one line
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* use nightly
* switch ml cpu to ray cpu
* fix
* add pytest
* add more pytest
* add constraint
* add tensorflow
* fix merge conflict
* add tblib
* fix
* add back uninstall