Commit graph

274 commits

Author SHA1 Message Date
Amog Kamsetty
ea6d53dbf3
[CI/AIR] Cleanup Deprecated ml_utils (#28278)
ray.util.ml_utils was deprecated in Ray 2.0. This PR does some final cleanup of our CI pipeline

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-09-06 08:39:21 +01:00
Kai Fricke
57484b28cf
[ci/air] Only run examples that need credentials in branch builds (#28260) 2022-09-02 09:10:36 -07:00
Kai Fricke
5d31f2d4bc
[tune] Run SigOpt tests in CI (#28225) 2022-09-02 09:10:01 -07:00
zcin
4c970cc882
[serve] Visualize Deployment Graph with Gradio (#27897) 2022-09-01 10:46:15 -07:00
Amog Kamsetty
acc4903db1
[AIR/Serve] Auto-enable GPU Prediction (#26549)
Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-08-29 13:47:56 -07:00
Akash Patel
96d579a4fe
Add support for Python 3.10 (#21221)
Signed-off-by: acxz <17132214+acxz@users.noreply.github.com>
2022-08-26 11:01:12 -07:00
zcin
8cb09a9fc5
Revert "Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications"" (#27662) 2022-08-12 15:12:20 -07:00
zcin
64c550a2b1
Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)" (#27587)
This reverts commit 8a9d994dd0.
2022-08-06 21:38:55 -07:00
zcin
8a9d994dd0
[serve] Integrate and Document Bring-Your-Own Gradio Applications (#26403)
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
2022-08-05 11:31:00 -05:00
SangBin Cho
5298ee83b2
[Test] Revert (partially) Fix windows buildkite (#26615) (#27495)
Root cause:
https://www.shell-tips.com/bash/source-dot-command/#gsc.tab=0
Using . will execute the command in the "current shell" in a bash script. It looks like removing . command from ci.sh init means that we will lose the set -eo command used within ci.sh init applied to next test running commands because set -eo is called within a child process, not the current shell (so the future command won't have the set -eo configured).
2022-08-04 13:55:48 -07:00
matthewdeng
01d473b355
[ci] print out test environment info for all python tests (#27312)
Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs.

**Notes:**
1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future.
2. This is just for CI now, but is applicable to release tests as well.

Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-08-01 09:55:13 +01:00
Simon Mo
1f1234fde1
[Serve] Disable Serve on macOS Round 2 (#27271) 2022-07-29 12:04:34 -07:00
Chen Shen
fda345335a
Revert "Allow grpcio >= 1.48 (#26765)" (#27244)
This reverts commit 6acd0a4c9b.
2022-07-28 22:25:21 -07:00
Simon Mo
ca9e8b3d0b
[Serve] Disable macOS tests (#27218) 2022-07-28 16:34:46 -07:00
Amog Kamsetty
862d10c162
[AIR] Remove ML code from ray.util (#27005)
Removes all ML related code from `ray.util`

Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`

Moves `ray.util.ml_utils` to other locations

Closes #23900

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 14:24:19 +01:00
Alex Wu
837ef777a3
[ci][kuberay] Always cleanup KinD cluster (#27073)
When cleaning up after the k8s operator tests, we should always delete the k8s cluster even if something went wrong (in fact, it's not clear we even need to clean up the resources within the cluster. 

Signed-off-by: Alex Wu <itswu.alex@gmail.com>
2022-07-26 21:46:19 -07:00
Amog Kamsetty
aa8a7dcb48
[Docker] Add Cuda 11.6 support (#26695)
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

Latest Pytorch version has wheels for CUDA 11.6. Per user request, adding a 11.6 image as part of our build pipeline.
2022-07-26 10:15:53 -07:00
Yi Cheng
33997da299
[core] Introduce a flag which allows a longer timeout for raylet when GCS restarts. (#26919)
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.

Once GCS get the first ping, it'll just use the regular timeout instead.
2022-07-25 16:57:19 -07:00
mwtian
6acd0a4c9b
Allow grpcio >= 1.48 (#26765)
The previously observed Python grpc warning / logspam seems to have been fixed for grpcio >= 1.48. And users would like to upgrade beyond grpcio 1.43 for better M1 support. However, grpcio 1.48 has not been released yet, so there is still a risk this change needs to be reverted if any problem is discovered later with Ray nightly + grpcio 1.48.
2022-07-21 10:03:41 -07:00
Jiajun Yao
1b2b526a2b
Fix windows buildkite (#26615)
- Stop using dot command to run ci.sh script: it doesn't fail the build if the command fails for windows and is generally dangerous since it will make unexpected changes to the current shell.
- Fix uncovered windows build issues.
2022-07-18 09:15:49 -07:00
Tao Wang
6ddbdaa81a
[CI]Split C++, Java tests in MacOS from the big one (#26434) 2022-07-14 18:33:47 -07:00
Antoni Baum
9b2cd29511
[CI] Install Horovod in doc tests to fix notebook (#26476)
Fixes the Horovod notebook example as found in #26410 by installing Horovod in doc tests jobs.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-13 16:27:20 +01:00
xwjiang2010
03671c961e
[CI] run air related doc/example tests as part of pre-submit CI. (#26466)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
2022-07-12 18:30:37 +01:00
Dmitri Gekhtman
8f8f036957
[autoscaler][kuberay] Deflake KubeRay autoscaling test (#26411)
Improves stability of KubeRay autoscaling test.
2022-07-12 00:56:36 -07:00
Jiao
d95dc2f2e5
[AIR][GPU Batch Prediction] Add basic support for GPU batch prediction (#26251)
This PR adds GPU support for pytorch and tensorflow predictor, as well as automatic setting `use_gpu` flag in `BatchPredictor`.

Notable changes:
- Added `use_gpu` flag in the constructor of `TorchPredictor` and `TensorflowPredictor` (note it's slightly different from our latest design doc that puts this flag at `predict()` call)
- Added `use_gpu` flag to `SklearnPredictor` so its interface is compatible with `BatchPredictor`
- Code to move both model weights and input tensor to default visible GPU at index 0 if flag is set 
- parametrized existing predictor tests to use GPU for both CPU & GPU coverage
- Changed BUILD CI tests with an added `gpu` tag (I'm not 100% sure if that's a right way tho)

Follow ups:

https://github.com/ray-project/ray/issues/26249 is created in case our host has multiple GPU devices. It's a bit out of scope for this PR, but for GPU batch inference ideally we should be able to evenly use all GPU devices on host where CPU & DRAM are busy with pre-fetching + data movement to GPU. We might approximately do the same by scheduling same # of Predictor instances on the host, but that's worth verifying once benchmarks are set.
2022-07-11 13:04:15 -07:00
Amog Kamsetty
b01e11d721
[Docker] Add support for Cuda 11.3 (#26233)
Start building Ray docker images with cuda 11.3
2022-07-10 21:50:42 -07:00
Yi Cheng
818bb78542
[ci] Stop syncer staging tests (#26273)
The tests has been running for 1-2 months, and the overall observation is that it's not very useful to catch the actual regression. Basically, we didn't notice any regression. Stop this test for now to save some resources.
2022-07-03 11:17:10 -07:00
Sven Mika
96693055bd
[RLlib] More Trainer -> Algorithm renaming cleanups. (#25869) 2022-06-20 15:54:00 +02:00
clarng
2b270fd9cb
apply isort uniformly for a subset of directories (#25824)
Simplify isort filters and move it into isort cfg file.

With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.
2022-06-17 13:40:32 -07:00
Jiao
f6735f90c7
[Ray DAG] Move dag project folder out of experimental (#25532) 2022-06-16 19:15:39 -07:00
Simon Mo
503e197f8c
[CI] Upload macOS bazel test files (#25744) 2022-06-15 10:09:04 -07:00
Antoni Baum
5e9a8eb5f6
[AIR/data] Move preprocessors to ray.data (#25599)
Moves ray.air.Preprocessor and ray.air.preprocessors to ray.data to converge on the agreed upon package structure discussed internally.
2022-06-13 12:57:59 -07:00
Amog Kamsetty
1316a2d05e
[AIR/Train] Move ray.air.train to ray.train (#25570) 2022-06-08 21:34:18 -07:00
Amog Kamsetty
80ae651f25
[Train] Clean up ray.train package (#25566) 2022-06-08 10:22:36 -07:00
Yi Cheng
acf210fcac
[flakey] Skip ray_syncer_test for ubsan. (#25477)
From the message:
```
[       OK ] SyncerTest.TestMToN (13132 ms)
[----------] 5 tests from SyncerTest (43175 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 2 test suites ran. (43176 ms total)
[  PASSED  ] 8 tests.
external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable'
```

This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API. 
Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.
2022-06-04 23:06:57 -07:00
Sven Mika
b5bc2b93c3
[RLlib] Move all remaining algos into algorithms directory. (#25366) 2022-06-04 07:35:24 +02:00
Kai Fricke
4b9a89ad90
[air] Move python/ray/ml to python/ray/air (#25449)
The package "ml" should be renamed to "air".

Main question: Keep a `ml.py` with `from ray.air import *` for some level of backwards compatibility?
I'd go for no to force people to use the new structure.
2022-06-03 21:53:44 +01:00
Yi Cheng
fd0f967d2e
Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346)" (#25420)
This reverts commit e4ceae19ef.

Reverts #25346

linux://python/ray/tests:test_client_library_integration never fail before this PR.

In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR.

And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)
2022-06-02 20:38:44 -07:00
Sven Mika
e4ceae19ef
[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir and rename policy and trainer classes. (#25346) 2022-06-02 16:47:05 +02:00
Yi Cheng
cb1f08a3c1
[core] Basic end-2-end multi-node tests for GCS HA in CI. (#25114)
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
2022-06-02 02:41:38 +00:00
Amog Kamsetty
983d8b3db2
[AIR] Fix failing CI on master (#25201)
The AIR CI build has been failing on master since #25022.

#25022 moved the tests that require credentials, but we left the bazel command in the build pipeline still. So even though all the tests are passing, the buildkite stage itself was failing since it tries run tests that require credentials, but these tests no longer exist in the directory. This is only a problem for master build since we don't run this command for PR builds.
2022-05-26 11:34:57 +02:00
Antoni Baum
2b6c6301e2
[CI] Fix typo in CI label (#25185) 2022-05-25 17:31:29 +02:00
Kai Fricke
d57ba750f5
[docs/air] Move upload example to docs (#25022) 2022-05-21 12:16:33 -07:00
Kai Fricke
e76efffec6
[air/docs] Move RL examples to docs (#24962)
Following #24959, this PR moves the RL examples (online/offline/serving) into the Ray ML docs. It also splits the online and offline parts.
2022-05-20 14:55:01 +01:00
Yi Cheng
8ec558dcb9
[core] Reenable GCS test with redis as backend. (#23506)
Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e.

The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit.

In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok.

This PR is part of GCS HA testing-related work.
2022-05-19 21:46:55 -07:00
mwtian
502c3e132d
Revert "[Core] allow using grpcio > 1.44.0 (#23722)" (#24935)
This reverts commit b02029b29f.
2022-05-18 18:16:39 -07:00
SangBin Cho
fb60d68bbb
[WIP] Run minimal tests against all supported python version (#24830)
Run minimal CI tests to all Python versions.
2022-05-18 09:42:26 -07:00
Antoni Baum
1d5e6d908d
[AIR] HuggingFace Text Classification example (#24402) 2022-05-18 09:35:12 -07:00
Chen Shen
1325cf7876
[python3.10] Build py310 images (#24859)
Build python 3.10 images so we can run release tests.
2022-05-18 08:48:20 -07:00
Antoni Baum
c74886a55e
[CI] Run doc notebooks in CI (#24816)
Currently, we are not running doc notebooks in CI due to a bazel misconfiguration - we are using `glob` in a top level package in order to get the paths for the notebooks, but those are contained inside subpackages, which glob purposefully ignores. Therefore, the lists of notebooks to run are empty. This PR fixes that by:
* Running the `py_test_run_all_notebooks` macro inside the relevant subpackages
* Editing the `test_myst_doc.py` script to allow for recursive search for the target file, allowing to deal with mismatches between `name` and `data` arguments in `py_test_run_all_notebooks`
* Setting the `allow_empty=False` flag inside `glob` calls in our macros to ensure that this oversight is caught early
* Enabling detection of changes in doc folder for `*.ipynb` and `BUILD` files

This PR also adds a GPU runner for doc tests, allowing one of our examples to pass - and setting the infra for more to come. Finally, a misconfigured path for one set of doc tests is also fixed.
2022-05-17 09:50:42 +01:00