Instead of installing dependencies in each Buildkite job, let's move this to the Dockerfile instead.
This will update GPU tests to always use Python 3.7.
This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation.
Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.
In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).
This PR makes sure each test case shuts down its Ray cluster.
This PR fixed and reenabled tests in HA mode
- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api
- //python/ray/tests:test_ray_debugger
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.
Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.
Also, re-enable some additional tests in GCS HA mode.
Following #18987 this PR adds a docker-compose based local multi node cluster.
The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.
This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
This is part of #21129
This PR tries to cover the cpp/ray part of the bootstrap, some updates there:
remove the unused function/tests
some API updates
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
`//python/ray/tests:test_client_reconnect` seems to only flake under GCS HA build. The client server starts to shutdown under injected failures, unlike the behavior without GCS KV or pubsub.
`//python/ray/tests:test_multi_node_3` seems to flake more often under GCS HA build, although it is still flaky without GCS HA feature flags. It seems raylet termination did not notify other processes properly.
Disable these two tests before they are fixed.
This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode.
There are two ways to address this:
- Update store client and use store client in internal kv
- Add memory table into internal kv directly.
The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change.
So the easier way to do this is 2) and it's what this PR doing.
Next: use the flag for store client
## Why are these changes needed?
This change adds Python publisher and subscriber in `gcs_utils.py`, and GRPC handler on GCS for publishing iva GCS. Error info is migrated to use the GCS-based pubsub, if feature flag `RAY_gcs_grpc_based_pubsub=true`.
Also, add a `--gcs-address` flag to some Python processes. It is not set anywhere yet, but will be set aftering Redis-less bootstrapping work.
Unit tests are added for the Python publisher and subscriber. Migrated error info publishers and subscribers are tested with existing unit tests, e.g. tests calling `ray._private.test_utils.get_error_message()` to ensure error info is published.
GCS based pubsub has gaps in handling deadline, cancelled requests and GCS restarts. So 3 more unit tests are disabled in the `HA GCS` mode. They will be addressed in a separate change.
## Related issue number