mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00
66 lines
2.6 KiB
ReStructuredText
66 lines
2.6 KiB
ReStructuredText
Fault Tolerance
|
|
===============
|
|
|
|
This document describes the handling of failures in Ray.
|
|
|
|
Machine and Process Failures
|
|
----------------------------
|
|
|
|
Currently, each **local scheduler** and each **plasma manager** send heartbeats
|
|
to a **monitor** process. If the monitor does not receive any heartbeats from a
|
|
given process for some duration of time (about ten seconds), then it will mark
|
|
that process as dead. The monitor process will then clean up the associated
|
|
state in the Redis servers. If a manager is marked as dead, the object table
|
|
will be updated to remove all occurrences of that manager so that other managers
|
|
don't try to fetch objects from the dead manager. If a local scheduler is marked
|
|
as dead, all of the tasks that are marked as executing on that local scheduler
|
|
in the task table will be marked as lost and all actors associated with that
|
|
local scheduler will be recreated by other local schedulers.
|
|
|
|
Lost Objects
|
|
------------
|
|
|
|
If an object is needed but is lost or was never created, then the task that
|
|
created the object will be re-executed to create the object. If necessary, tasks
|
|
needed to create the input arguments to the task being re-executed will also be
|
|
re-executed.
|
|
|
|
Actors
|
|
------
|
|
|
|
When a local scheduler is marked as dead, all actors associated with that local
|
|
scheduler that were still alive will be recreated by other local schedulers. By
|
|
default, all of the actor methods will be re-executed in the same order that
|
|
they were initially executed. If actor checkpointing is enabled, then the actor
|
|
state will be loaded from the most recent checkpoint and the actor methods that
|
|
occurred after the checkpoint will be re-executed. Note that actor checkpointing
|
|
is currently an experimental feature.
|
|
|
|
|
|
Current Limitations
|
|
-------------------
|
|
|
|
At the moment, Ray does not handle all failure scenarios. We are working on
|
|
addressing these known problems.
|
|
|
|
Process Failures
|
|
~~~~~~~~~~~~~~~~
|
|
|
|
1. Ray does not recover from the failure of any of the following processes:
|
|
a Redis server, the global scheduler, the monitor process.
|
|
2. If a driver fails, that driver will not be restarted and the job will not
|
|
complete.
|
|
|
|
Lost Objects
|
|
~~~~~~~~~~~~
|
|
|
|
1. If an object is constructed by a call to ``ray.put`` on the driver, is then
|
|
evicted, and is later needed, Ray will not reconstruct this object.
|
|
2. If an object is constructed by an actor method, is then evicted, and is later
|
|
needed, Ray will not reconstruct this object.
|
|
|
|
Actor Reconstruction
|
|
~~~~~~~~~~~~~~~~~~~~
|
|
|
|
1. Actor reconstruction follows the order of initial execution, but new tasks
|
|
may get interleaved with the re-executed tasks.
|