Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
Root cause:
https://www.shell-tips.com/bash/source-dot-command/#gsc.tab=0
Using . will execute the command in the "current shell" in a bash script. It looks like removing . command from ci.sh init means that we will lose the set -eo command used within ci.sh init applied to next test running commands because set -eo is called within a child process, not the current shell (so the future command won't have the set -eo configured).
Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs.
**Notes:**
1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future.
2. This is just for CI now, but is applicable to release tests as well.
Signed-off-by: Matthew Deng <matt@anyscale.com>
Removes all ML related code from `ray.util`
Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`
Moves `ray.util.ml_utils` to other locations
Closes#23900
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
When cleaning up after the k8s operator tests, we should always delete the k8s cluster even if something went wrong (in fact, it's not clear we even need to clean up the resources within the cluster.
Signed-off-by: Alex Wu <itswu.alex@gmail.com>
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Latest Pytorch version has wheels for CUDA 11.6. Per user request, adding a 11.6 image as part of our build pipeline.
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.
The previously observed Python grpc warning / logspam seems to have been fixed for grpcio >= 1.48. And users would like to upgrade beyond grpcio 1.43 for better M1 support. However, grpcio 1.48 has not been released yet, so there is still a risk this change needs to be reverted if any problem is discovered later with Ray nightly + grpcio 1.48.
- Stop using dot command to run ci.sh script: it doesn't fail the build if the command fails for windows and is generally dangerous since it will make unexpected changes to the current shell.
- Fix uncovered windows build issues.
This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`.
Notable changes:
- Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call)
- Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor`
- Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set
- parametrized existing predictor tests to use GPU for both CPU & GPU coverage
- Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho)
Follow ups:
https://github.com/ray-project/ray/issues/26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.
The tests has been running for 1-2 months, and the overall observation is that it's not very useful to catch the actual regression. Basically, we didn't notice any regression. Stop this test for now to save some resources.
Simplify isort filters and move it into isort cfg file.
With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.
From the message:
```
[ OK ] SyncerTest.TestMToN (13132 ms)
[----------] 5 tests from SyncerTest (43175 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 2 test suites ran. (43176 ms total)
[ PASSED ] 8 tests.
external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable'
```
This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API.
Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.
The package "ml" should be renamed to "air".
Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.
To make it close to the real-world case, the docker is used for isolation:
It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
The AIR CI build has been failing on master since #25022.
#25022 moved the tests that require credentials, but we left the bazel command in the build pipeline still. So even though all the tests are passing, the buildkite stage itself was failing since it tries run tests that require credentials, but these tests no longer exist in the directory. This is only a problem for master build since we don't run this command for PR builds.
Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e.
The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit.
In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok.
This PR is part of GCS HA testing-related work.
Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by:
* Running the `py_test_run_all_notebooks` macro inside the relevant subpackages
* Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks`
* Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early
* Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files
This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.