ray/java
Kai Yang f5c6c7d28f
[Core] Allow failing new tasks immediately while the actor is restarting (#22818)
Currently, when an actor has `max_restarts` > 0 and has crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors.

```
                                                        -> Worker A (actor)
                                                       /
                                                      /
HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor)
                                                      \
                                                       \
                                                        -> ...
```

For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and calls `ray.get()` to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and `ray.get()` will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK.

But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it's not ready to serve requests. `ray.get()` on subsequent requests sent to worker A will hang until worker A is back online (ALIVE state). The proxy won't be able to reschedule these requests to another worker because currently there's no way to know if worker A is alive or not before sending a request. We can't say worker A is not alive just based on whether `ray.get()` hangs either.

To solve this issue, we change the semantics of `max_task_retries`.

* When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages.
* When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state.

TODO

- [x] Add test cases.
- [ ] Update docs.
- [x] API change review.
2022-05-14 10:48:47 +08:00
..
api [runtime env] [java] Support jars in runtime env for Java (#24170) 2022-05-12 09:34:40 +08:00
performance_test [Java] Remove auto-generated pom.xml files. (#19475) 2021-10-19 17:35:37 +08:00
runtime [runtime env] [java] Support jars in runtime env for Java (#24170) 2022-05-12 09:34:40 +08:00
serve [Serve] Add test for controller managing Java Replica (#22628) 2022-02-28 23:13:56 -08:00
test [Core] Allow failing new tasks immediately while the actor is restarting (#22818) 2022-05-14 10:48:47 +08:00
build-jar-multiplatform.sh Remove streaming deploying process. (#21603) 2022-01-17 23:37:48 +08:00
BUILD.bazel [Java] Add javac.activative dependency for java worker. (#22538) 2022-02-23 16:24:47 +08:00
checkstyle-suppressions.xml [Java] Format ray java code (#13056) 2020-12-29 10:36:16 +08:00
checkstyle.xml [Java] Support parallel actor in experimental. (#21701) 2022-04-21 22:54:33 +08:00
cleanup.sh Shellcheck comments (#9595) 2020-07-21 16:47:09 -05:00
dependencies.bzl [Java] upgrade protobuf-java version (#23627) 2022-03-31 09:12:58 -07:00
generate_jni_header_files.sh Use javac -h instead of javah. (#19311) 2021-10-12 22:37:14 +08:00
java-release-guide.md [Java] Add Java release guideline. (#22288) 2022-02-11 14:56:20 +08:00
pom.xml [Java] Support parallel actor in experimental. (#21701) 2022-04-21 22:54:33 +08:00
shade_rule [Java] Shade jackson to avoid conflict. (#24535) 2022-05-07 10:44:31 +08:00
test.sh [Java] Shade some widely used dependencies in bazel_jar_jar rule. (#21237) 2021-12-23 16:54:31 +08:00
testng.xml [Serve] Define Java Backend (#16169) 2021-07-01 20:41:17 -07:00