Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs.
**Notes:**
1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future.
2. This is just for CI now, but is applicable to release tests as well.
Signed-off-by: Matthew Deng <matt@anyscale.com>
This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`.
Notable changes:
- Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call)
- Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor`
- Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set
- parametrized existing predictor tests to use GPU for both CPU & GPU coverage
- Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho)
Follow ups:
https://github.com/ray-project/ray/issues/26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.
The package "ml" should be renamed to "air".
Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.
Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by:
* Running the `py_test_run_all_notebooks` macro inside the relevant subpackages
* Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks`
* Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early
* Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files
This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
Instead of installing dependencies in each Buildkite job, let's move this to the Dockerfile instead.
This will update GPU tests to always use Python 3.7.
* [ci/tune] Add Tune GPU pipeline step to CI
* cont.
* add sgd gpu tests
* format yaml, fix imports
* install horovod; fix line wrapping
* set GPU per worker to 0.5
* fix import
* move test to 4gpu machine
* fix lint
* lint
* set visible devices
* pull in tf gpu fix
* Fix Tune GPU pipeline step
* nit
* Disable GPU tests until we have some
* Re-add empty rllib tests
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>