Automatically enable GPU prediction for Predictors if num_gpus is set for the PredictorDeployment.
Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Integration between Ray Serve and Gradio. Users of Gradio can wrap their Gradio app in a Serve deployment by using `GradioIngress`, and scale it up through more replicas or more CPU/GPU resources.
Recently there have been a number of CI test failures due to direct or transitive dependency version upgrades. Printing out environment information for each test suite allows us to quickly check the diff between failed and successful runs.
**Notes:**
1. In this PR I just manually added `./ci/env/env_info.sh` to each test suite. We may want to generalize this in the future.
2. This is just for CI now, but is applicable to release tests as well.
Signed-off-by: Matthew Deng <matt@anyscale.com>
When cleaning up after the k8s operator tests, we should always delete the k8s cluster even if something went wrong (in fact, it's not clear we even need to clean up the resources within the cluster.
Signed-off-by: Alex Wu <itswu.alex@gmail.com>
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com
Latest Pytorch version has wheels for CUDA 11.6. Per user request, adding a 11.6 image as part of our build pipeline.
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.
The previously observed Python grpc warning / logspam seems to have been fixed for grpcio >= 1.48. And users would like to upgrade beyond grpcio 1.43 for better M1 support. However, grpcio 1.48 has not been released yet, so there is still a risk this change needs to be reverted if any problem is discovered later with Ray nightly + grpcio 1.48.
- Stop using dot command to run ci.sh script: it doesn't fail the build if the command fails for windows and is generally dangerous since it will make unexpected changes to the current shell.
- Fix uncovered windows build issues.
The tests has been running for 1-2 months, and the overall observation is that it's not very useful to catch the actual regression. Basically, we didn't notice any regression. Stop this test for now to save some resources.
From the message:
```
[ OK ] SyncerTest.TestMToN (13132 ms)
[----------] 5 tests from SyncerTest (43175 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 2 test suites ran. (43176 ms total)
[ PASSED ] 8 tests.
external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable'
```
This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API.
Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.
To make it close to the real-world case, the docker is used for isolation:
It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
Since ray supports Redis as a storage backend, we should ensure the code path with Redis as storage is still being covered e2e.
The tests don't run for a while after we switch to memory mode by default. This PR tries to fix this and make it run with every commit.
In the future, if we support more and more storage backends, this should be revised to be more efficient and selective. But now I think the cost should be ok.
This PR is part of GCS HA testing-related work.
Currently all jobs that build wheels put them into the artifacts directory and upload them. This leads to the wheels being overwritten on S3 multiple times. This is not a huge problem as ingress is free, but in order to have a single point of reference, it might be beneficial to limit the wheels uploading to a single Buildkite job. Recently, this has led to interference with stale artifact directories.
The downside here is that if the "Wheels & Jars" build fails randomly, the wheels will not be available on S3 - previously they've been also uploaded by several other jobs.
- Closes#23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size.
Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage
Make ray syncer test work even when we add more components in the protobuf
Make ray syncer able to reconnect to a new node.
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
This PR adds a test of KubeRay autoscaler integration to the Ray CI.
- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
Adds a unit-tested and restructured ray_release package for running release tests.
Relevant changes in behavior:
Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).
The main subpackages are:
Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
Command runner: Runs commands, e.g. as client command or sdk command
File manager: Uploads/downloads files to/from session
Reporter: Reports results (e.g. to database)
Much of the code base is unit tested, but there are probably some pieces missing.
Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.
**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.