In some cases, the task that's added to the `running_tasks` is never removed and introduces wait time for all the following tasks due to worker cap. One such case is lease request cancellation: the request is cancelled after `PopWorker` is called and the task is never removed from `running_tasks`.
This PR adds a new method to the Searcher class, add_evaluated_trials. This method wraps around add_evaluated_point and allows the user to pass a Trial, list of Trials or ExperimentAnalysis to load into the searcher. Furthermore, this PR updates the HEBO version to the latest and removes outdated documentation, and adds add_evaluated_point methods to Dragonfly and SkOpt searchers.
tune does not run smoothly on Windows. This cleans up some blockers:
- use cross-platform shutils.get_terminal_size instead of Popen(stty)
- somehow Trainer.workers is None at the end of test_commands.py, so the cleanup command was erroring. The error was not fatal, but was printing in the logs.
- if run locally, the log files are all written to the same location, so the rync-based syncing solution is not needed. This is the real fix for issue #20747
`test_failure_2.py::test_gcs_server_failiure_report` and `test_gcs_fault_tolerance.py::test_gcs_server_restart_during_actor_creation` cannot pass in GCS pubsub mode with the existing logic. Disable these tests in GCS pubsub mode and add comment about how we may fix them.
Also, suppress exceptions when sync subscribers are disconnected from GCS.
I can push changes in this PR to #21513 as well.
Fixes a small bug where we pop from the resources dict without making a copy, emptying the head node resources. This sometimes leads to empty head node resources.
(Comment from the PR:)
If a GRPC call exceeds timeout, the calls is cancelled at client side but server may still reply to it, leading to missed messages and test failures. Using a sequence number to ensure no message is dropped can be the long term solution,
but its complexity and the fact the Ray subscribers do not use deadline in production makes it less preferred.
Therefore, a simpler workaround is used instead: a different subscriber is used for each get_error_message() call.
Also, re-enable some additional tests in GCS HA mode.
I tried reproducing the many pg mini integration failure from this PR; https://github.com/ray-project/ray/pull/21216, but I failed to do that. (this was the only test that became flaky when we turned on the flag last time).
I tried
- Run tests:test_placement_group_mini_integration 5 times instead of 3 (the default)
- Re-run the PR 3 times.
So I think it is worth trying re-enabling it again.
In Xlang(Python call Java), a Java method which overrides a `default` method of the super class is not able to be invoked successfully, due to we treat it as overloaded method instead of overrided method. This PR correctly handle it at the case it overrides a `default` method.
Before this PR, the following usage is not able to be invoked from Python -> Java.
```Java
public interface ExampleInterface {
default String echo(String inp) {
return inp;
}
}
public class ExampleImpl implements ExampleInterface {
@Override
public String echo(String inp) {
return inp + " echo";
}
}
```
```python
/// Invoke it in Python.
cls = ray.java_actor_class("io.ray.serve.util.ExampleImpl")
handle = cls.remote()
print(ray.get(handle.echo.remote("hi")))
```
Following #18987 this PR adds a docker-compose based local multi node cluster.
The fake multinode docker comprises two parts. The docker_monitor.py script is a watch script calling docker compose up whenever the docker-compose.yaml changes. The node provider creates and updates the docker compose according to the autoscaling requirements.
This mode fully supports autoscaling and comes with test utilities to start and connect to docker-compose autoscaling environments. There's also a sample test case showing how this can be used.
These changes add a set of improvements to enable automatic creation and update of CloudWatch dashboards when provisioning AWS Autoscaling clusters. Successful implementation of these improvements will allow AWS Autoscaler users to:
1. Get rapid insights into their cluster state via CloudWatch dashboards.
2. Allow users to update their CloudWatch dashboard JSON configuration files during Ray up execution time.
Notes:
1. This PR is a follow-up PR for #18619, adds dashboard support.
In a [recent review](https://discuss.python.org/t/experience-with-python-3-11-in-fedora/12911) of the experience of the Fedora team porting packages to the upcoming python 3.11, they remarked that most of the work was in removing deprecated aliases in unittest. I came across a few of these when looking at unrelated test failures, the DeprecationWarnings caught my eye. So a made a quick sweep of the code, using `git grep` to find occurances of the deprecated aliases:
old | new
---|---
assertEquals | assertEqual
assertNotEquals | assertNotEqual
assertRaisesRegexp | assertRaisesRegex
CoreWorker hangs there before exiting if gcs exits first due to in correct ordering of destruction. This PR fixed this. It'll stop gcs client first and then job the thread.
This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons:
- internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit.
- for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs
One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key
There are test flakiness where GCS client failed to be created, but there is not enough information for debugging. The exception message will be printed after GCS client creation failure. Also, this PR breaks down GCS client creation to two steps: reading GCS address from Redis, and creating GCS client, which should help locating the issue.