In test_client_reconnect.py, each test case starts a Ray cluster via client server's default_connect_handler(). The Ray cluster shuts down implicitly when the start_middleman_server() ended and Python GC'es the client server. After turning on GCS pubsub, the time when client server is GC'ed changes. Sometimes the Ray cluster from a previous test cases stays alive after the next test case starts and shuts down later, leading to test failures due to lost data or crashes (race during worker shutdown, will be investigated separately).
This PR makes sure each test case shuts down its Ray cluster.
This PR fixed and reenabled tests in HA mode
- //python/ray/tests:test_healthcheck
- //python/ray/tests:test_autoscaler_drain_node_api
- //python/ray/tests:test_ray_debugger
When cleanup the function table, we use the prefix to delete the data. But right now prefix contains binary data and it won't work well with redis keys/scan which use `*` in the pattern.
For example, when job id increases to 41, it'll delete the keys for job 1 which leads to the new worker failing to import the function.
This PR uses hex of job id to avoid this.
This PR adds pandas block format support by implementing `PandasRow`, `PandasBlockBuilder`, `PandasBlockAccessor`.
Note that `sort_and_partition`, `combine`, `merge_sorted_blocks`, `aggregate_combined_blocks` in `PandasBlockAccessor` redirects to arrow block format implementation for now. They'll be implemented in a later PR.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Why are these changes needed?
fix dlmalloc allocate bug, details in here #21310
* fix dlmalloc bug
* make lint happy
* make lint happy
* fix by comment
* use _check_spilled_mb
* add cpp UT
If a task is re-executed on failure, it will deterministically generate the same IDs for any ray.put or .remote task calls because it uses its own task ID as a seed. This can cause problems if those objects conflict with previous versions that still exist in the cluster.
This PR adds the execution attempt number to the current task ID seed. This avoids collisions with any ObjectIDs generated by the previous execution attempt of the task.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.
Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
External Redis should still be supported with GCS bootstrapping, to avoid breaking users.
In GCS mode, some logic are removed for external Redis:
- Printing external Redis addresses to terminal: hard to implement across `ray start`, `ray.init()` and Ray cluster util.
- Starting local Redis if external Redis is unavailable: failing loudly here seems more appropriate.
Also, re-enable a few tests which restarts GCS in GCS bootstrapping mode, by using external Redis for KV storage.
Currently we install OpenSSH on the fly in fake multinode docker testing. Instead we can speed testing up a fair bit by building a Docker image which includes OpenSSH first and then run tests with this image.
This PR adds a new method to the Searcher class, add_evaluated_trials. This method wraps around add_evaluated_point and allows the user to pass a Trial, list of Trials or ExperimentAnalysis to load into the searcher. Furthermore, this PR updates the HEBO version to the latest and removes outdated documentation, and adds add_evaluated_point methods to Dragonfly and SkOpt searchers.
tune does not run smoothly on Windows. This cleans up some blockers:
- use cross-platform shutils.get_terminal_size instead of Popen(stty)
- somehow Trainer.workers is None at the end of test_commands.py, so the cleanup command was erroring. The error was not fatal, but was printing in the logs.
- if run locally, the log files are all written to the same location, so the rync-based syncing solution is not needed. This is the real fix for issue #20747
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.
Also, suppress exceptions when sync subscribers are disconnected from GCS.
I can push changes in this PR to #21513 as well.
Fixes a small bug where we pop from the resources dict without making a copy, emptying the head node resources. This sometimes leads to empty head node resources.
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.
Also, re-enable some additional tests in GCS HA mode.
Following #18987 this PR adds a docker-compose based local multi node cluster.
The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.
This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.
These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:
1. Get rapid insights into their cluster state via CloudWatch dashboards.
2. Allow users to update their CloudWatch dashboard JSON configuration files during Ray up execution time.
Notes:
1. This PR is a follow-up PR for #18619, adds dashboard support.
In a [recent review](https://discuss.python.org/t/experience-with-python-3-11-in-fedora/12911) of the experience of the Fedora team porting packages to the upcoming python 3.11, they remarked that most of the work was in removing deprecated aliases in unittest. I came across a few of these when looking at unrelated test failures, the DeprecationWarnings caught my eye. So a made a quick sweep of the code, using `git grep` to find occurances of the deprecated aliases:
old | new
---|---
assertEquals | assertEqual
assertNotEquals | assertNotEqual
assertRaisesRegexp | assertRaisesRegex