- Closes#23874 by fixing a typo ("num_gpus" -> "num-gpus").
- Adds end-to-end test logic confirming the fix.
- Adds end-to-end test logic confirming autoscaling with custom resources works.
- Slightly refines developer instructions.
- Deflakes test logic a bit by allowing for the event that the head pod changes its identity as the Ray cluster starts up.
This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size.
Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage
Make ray syncer test work even when we add more components in the protobuf
Make ray syncer able to reconnect to a new node.
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
Why are these changes needed?
Linkcheck is inherently flaky, so separate it from the normal LINT build which is never flaky. This also separates the verbose linkcheck logs, making it easier to read the LINT output.
We sometimes end up with stale wheel uploads from previous runs of a Buildkite agent. The result is that commit wheels are being overwritten from old build jobs - effectively breaking the wheel build logic.
Example:
This Agent: https://buildkite.com/organizations/ray-project/agents/4b955117-2f6c-4849-b703-3457daf69f89
- builds wheels (in post-wheels tests) for a35ebc945b
- and then runs both the Ray CPP worker and the Train + Tune tests in 6746e9f
- Usually these two tests shouldn't provide artifacts at all, but they do - these are the wheels from a35ebc945b though! Meaning these are uncleaned leftovers from the first build task.
- See here for proof of artifact upload: https://buildkite.com/ray-project/ray-builders-pr/builds/27622#d11bc514-ebd8-4e0c-a2ce-826b9bad27de
The solution is thus to always clean up the artifacts directory in the worker, i.e. `rm -rf /artifact-mount/*`
This PR adds two of such clean up instructions - once before commands are run and once after artifacts are uploaded. We can probably just do either, but it doesn't hurt to have both.
This PR adds a test of KubeRay autoscaler integration to the Ray CI.
- Tests scaling with autoscaler.sdk.request_resources
- Tests autoscaler response to RayCluster CR change
#22749 broke release unit tests by not providing a legacy key - that key should be optional because we will b dealing with non-legacy tests soon.
Additionally, for some reason the unit tests pass on buildkite while they fail locally and in the release test pipeline. I'm investigating this now...
Follow-up to #22748, enabling tests in CI.
Conditions: A new RAY_CI_ML_AFFECTED condition is added for this test suite. The package currently depends on Ray Data, and will be triggered accordingly.
Dependencies: Adding DATA_PROCESSING_TESTING dependencies (set for install-dependencies.sh) for now.
There are problems with running C++ tests in MacOS 10.15 Catalina, when upgrading to the newest grpc due to dynamic linking: #22384 (comment). The problem does not exist for Python tests in Catalina, or in C++ tests of other systems.
Upgrading MacOS CI from Catalina is also blocked in the short term: ray-project/buildkite-ci-stack#24 (comment)
So working around the issue by using static linking for C++ tests on Mac.
This updates the GPU image to run on the same Ubuntu version as the regular (non-GPU) image. This implicitly updates cmake etc for compatibility with newer versions of downstream libraries, e.g. Horovod.
Adds a unit-tested and restructured ray_release package for running release tests.
Relevant changes in behavior:
Per default, Buildkite will wait for the wheels of the current commit to be available. Alternatively, users can a) specify a different commit hash, b) a wheels URL (which we will also wait for to be available) or c) specify a branch (or user/branch combination), in which case the latest available wheels will be used (e.g. if master is passed, behavior matches old default behavior).
The main subpackages are:
Cluster manager: Creates cluster envs/computes, starts cluster, terminates cluster
Command runner: Runs commands, e.g. as client command or sdk command
File manager: Uploads/downloads files to/from session
Reporter: Reports results (e.g. to database)
Much of the code base is unit tested, but there are probably some pieces missing.
Example build (waited for wheels to be built): https://buildkite.com/ray-project/kf-dev/builds/51#_
Wheel build: https://buildkite.com/ray-project/ray-builders-branch/builds/6023
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.
**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
Instead of installing dependencies in each Buildkite job, let's move this to the Dockerfile instead.
This will update GPU tests to always use Python 3.7.
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.
Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).
This PR makes sure each test case shuts down its Ray cluster.
This PR fixed and reenabled tests in HA mode
- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api
- //python/ray/tests:test_ray_debugger
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.
Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.
Also, re-enable some additional tests in GCS HA mode.
Following #18987 this PR adds a docker-compose based local multi node cluster.
The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.
This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.