mirror of
https://github.com/vale981/ray
synced 2025-03-04 17:41:43 -05:00
![]() Currently, when an actor has `max_restarts` > 0 and has crashed, the actor will enter RESTARTING state and then ALIVE. Imagine this scenario: an online service provides HTTP service and the proxy actor receives requests, forwards them to worker actors, and replies to clients with the execution results from worker actors. ``` -> Worker A (actor) / / HTTP requests -------> Proxy (actor with HTTP server) ---> Worker B (actor) \ \ -> ... ``` For each HTTP request, the proxy picks one worker (e.g. worker A) based on some algorithm, sends the request to it, and calls `ray.get()` to wait for the result. If for some reason the picked worker crashed, Ray will restart the actor, and `ray.get()` will throw an error. The proxy may pick another worker (e.g. worker B) and re-send the request to it. This is OK. But new requests keep coming. The proxy may pick worker A again. But because worker A is still in RESTARTING state, it's not ready to serve requests. `ray.get()` on subsequent requests sent to worker A will hang until worker A is back online (ALIVE state). The proxy won't be able to reschedule these requests to another worker because currently there's no way to know if worker A is alive or not before sending a request. We can't say worker A is not alive just based on whether `ray.get()` hangs either. To solve this issue, we change the semantics of `max_task_retries`. * When max_task_retries is 0 (which is the default value), if the callee actor is in the RESTARTING state, subsequently submitted tasks will fail immediately with a RayActorError. Users can catch the RayActorError and implement their own fallback strategies to improve service availability and mitigate service outages. * When max_task_retries is not 0, subsequently submitted tasks will be queued on the caller side and we only send them to the callee when the callee actor is back to the ALIVE state. TODO - [x] Add test cases. - [ ] Update docs. - [x] API change review. |
||
---|---|---|
.. | ||
api | ||
performance_test | ||
runtime | ||
serve | ||
test | ||
build-jar-multiplatform.sh | ||
BUILD.bazel | ||
checkstyle-suppressions.xml | ||
checkstyle.xml | ||
cleanup.sh | ||
dependencies.bzl | ||
generate_jni_header_files.sh | ||
java-release-guide.md | ||
pom.xml | ||
shade_rule | ||
test.sh | ||
testng.xml |