Currently, when an actor has `max_restarts` > 0 and has crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors.
```
-> Worker A (actor)
/
/
HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor)
\
\
-> ...
```
For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and calls `ray.get()` to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and `ray.get()` will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK.
But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it's not ready to serve requests. `ray.get()` on subsequent requests sent to worker A will hang until worker A is back online (ALIVE state). The proxy won't be able to reschedule these requests to another worker because currently there's no way to know if worker A is alive or not before sending a request. We can't say worker A is not alive just based on whether `ray.get()` hangs either.
To solve this issue, we change the semantics of `max_task_retries`.
* When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages.
* When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state.
TODO
- [x] Add test cases.
- [ ] Update docs.
- [x] API change review.
This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code.
Fix the failure to unbreak nightly and unblock 1.13 release.
The root cause is the upgrade of GRPC to 1.45.2 made it slightly slow; this is an acceptable regression which is needed to make this upgrade.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-32-136.us-west-2.compute.internal>
To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it.
IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows.
This is a notebook showing how to tune an xgboost model and analyze the results.
Also adds a `get_dataframe()` method to `ResultsGrid` to fetch the trial results.
Depends on #24483 for toctree.
Aiming to:
1. addressing the bug about concurrency group, see #19593
2. improving the stability of the ray call latency perf in online applications.
we're proposing using async post instead of `PostBlocking` in threadpool.
Note that since we have already had back pressure in the caller side, I believe this change is safe to merge and it doesn't break any behavior.
This PR uses the job id and group name as the prefix for storing meta information, aiming to provide the isolate ability for different jobs and different groups.
Before this PR, we can't use 2 groups in 1 Ray cluster, and we can not rerun a collective job once the last one is failed at initializing.
Some tests relying on AutoScalingCluster are flaky because ray.init() after AutoscalingCluster.start() is not guaranteed to work. Sometimes, it cannot find any running ray instances.
This was a holdover from when local resources for URIs were deleted directly from the runtime env agent, and the URI name itself needed to store the information of which plugin it corresponded to so the appropriate plugin's `delete()` function could be called. After the URI reference refactor, I don't think this is needed anymore.
Adds `get_dataframe()` method to pass through experiment analysis to result grid. Also fixes config dictionary returns which previously did not flatten the dict (even though the docstring suggested it would)
This PR supports setting the jars for an actor in Ray API. The API looks like:
```java
class A {
public boolean findClass(String className) {
try {
Class.forName(className);
} catch (ClassNotFoundException e) {
return false;
}
return true;
}
}
RuntimeEnv runtimeEnv = new RuntimeEnv.Builder()
.addJars(ImmutableList.of("https://github.com/ray-project/test_packages/raw/main/raw_resources/java-1.0-SNAPSHOT.jar"))
.build();
ActorHandle<A> actor1 = Ray.actor(A::new).setRuntimeEnv(runtimeEnv).remote();
boolean ret = actor1.task(A::findClass, "io.testpackages.Foo").remote().get();
System.out.println(ret); // true
```
This PR support reconnection of the GCS client in gRPC channel layer. Previously this is implemented in the application layer:
- Health check is in the application layer by starting a new channel.
- Monitor the GCS address change and do resubscribe.
- Always retry the failed request and do reconnection in case of a failure.
However, there are several issues with this approach:
- We need service to discover for GCS address change.
- Monitor is too heavy since it always creates a channel.
- Reconnection is a blocking call that prevents the code from running.
This new approach moves the reconnection to gRPC layer directly to fix these issues.
- DNS name resolution is done by gRPC so we don't need to write this.
- Health check is done by checking the channel state.
- Queue the failed call and retry once GCS is up so that it's not a blocking call.
Under heap memory pressure, the raylet is often killed by the OS OOM killer. This is bad because it can cause whole system crashes and it is difficult to find the error afterwards. This PR adjusts the OOM score for any workers that the raylet spawns so that the OOM killer will hopefully prioritize killing those instead of the raylet.