ray/doc/source/ray-core/fault-tolerance.rst

Fault Tolerance
===============

This document describes how Ray handles machine and process failures.

Tasks
-----

When a worker is executing a task, if the worker dies unexpectedly, either
because the process crashed or because the machine failed, Ray will rerun
the task until either the task succeeds or the maximum number of retries is
exceeded. The default number of retries is 3 and can be overridden by
specifying ``max_retries`` in the ``@ray.remote`` decorator. Specifying -1
allows infinite retries, and 0 disables retries.

You can experiment with this behavior by running the following code.

.. code-block:: python

    import numpy as np
    import os
    import ray
    import time

    ray.init(ignore_reinit_error=True)

    @ray.remote(max_retries=1)
    def potentially_fail(failure_probability):
        time.sleep(0.2)
        if np.random.random() < failure_probability:
            os._exit(0)
        return 0

    for _ in range(3):
        try:
            # If this task crashes, Ray will retry it up to one additional
            # time. If either of the attempts succeeds, the call to ray.get
            # below will return normally. Otherwise, it will raise an
            # exception.
            ray.get(potentially_fail.remote(0.5))
            print('SUCCESS')
        except ray.exceptions.WorkerCrashedError:
            print('FAILURE')

.. _object-reconstruction:

Ray also implements *lineage reconstruction* to recover task outputs that are
lost from the distributed object store. This can occur during node failures.
Ray will first automatically attempt to recover the value by looking for copies
of the same object on other nodes. If none are found, then Ray will
automatically recover the value by re-executing the task that created the
value. Arguments to the task are recursively reconstructed with the same
method.

Note that lineage reconstruction can cause higher than usual driver memory
usage because the driver keeps the descriptions of any tasks that may be
re-executed in case of a failure. To limit the amount of memory used by
lineage, set the environment variable ``RAY_max_lineage_bytes`` (default 1GB)
to evict lineage if the threshold is exceeded.

To disable this behavior, set the environment variable
``RAY_lineage_pinning_enabled=0`` during ``ray start`` or ``ray.init``.  With
this setting, if there are no copies of an object left, an ``ObjectLostError``
will be raised.

.. _actor-fault-tolerance:

Actors
------

Ray will automatically restart actors that crash unexpectedly.
This behavior is controlled using ``max_restarts``,
which sets the maximum number of times that an actor will be restarted.
If 0, the actor won't be restarted. If -1, it will be restarted infinitely.
The default value of ``max_restarts`` is 0.
When an actor is restarted, its state will be recreated by rerunning its
constructor.
After the specified number of restarts, subsequent actor methods will
raise a ``RayActorError``.
You can experiment with this behavior by running the following code.

.. code-block:: python

    import os
    import ray
    import time

    ray.init(ignore_reinit_error=True)

    @ray.remote(max_restarts=5)
    class Actor:
        def __init__(self):
            self.counter = 0

        def increment_and_possibly_fail(self):
            self.counter += 1
            time.sleep(0.2)
            if self.counter == 10:
                os._exit(0)
            return self.counter

    actor = Actor.remote()

    # The actor will be restarted up to 5 times. After that, methods will
    # always raise a `RayActorError` exception. The actor is restarted by
    # rerunning its constructor. Methods that were sent or executing when the
    # actor died will also raise a `RayActorError` exception.
    for _ in range(100):
        try:
            counter = ray.get(actor.increment_and_possibly_fail.remote())
            print(counter)
        except ray.exceptions.RayActorError:
            print('FAILURE')

By default, actor tasks execute with at-most-once semantics
(``max_task_retries=0`` in the ``@ray.remote`` decorator). This means that if an
actor task is submitted to an actor that is unreachable, Ray will report the
error with ``RayActorError``, a Python-level exception that is thrown when
``ray.get`` is called on the future returned by the task. Note that this
exception may be thrown even though the task did indeed execute successfully.
For example, this can happen if the actor dies immediately after executing the
task.

Ray also offers at-least-once execution semantics for actor tasks
(``max_task_retries=-1`` or ``max_task_retries > 0``). This means that if an
actor task is submitted to an actor that is unreachable, the system will
automatically retry the task until it receives a reply from the actor. With
this option, the system will only throw a ``RayActorError`` to the application
if one of the following occurs: (1) the actor’s ``max_restarts`` limit has been
exceeded and the actor cannot be restarted anymore, or (2) the
``max_task_retries`` limit has been exceeded for this particular task. The
limit can be set to infinity with ``max_task_retries = -1``.

You can experiment with this behavior by running the following code.

.. code-block:: python

    import os
    import ray

    ray.init(ignore_reinit_error=True)

    @ray.remote(max_restarts=5, max_task_retries=-1)
    class Actor:
        def __init__(self):
            self.counter = 0

        def increment_and_possibly_fail(self):
            # Exit after every 10 tasks.
            if self.counter == 10:
                os._exit(0)
            self.counter += 1
            return self.counter

    actor = Actor.remote()

    # The actor will be reconstructed up to 5 times. The actor is
    # reconstructed by rerunning its constructor. Methods that were
    # executing when the actor died will be retried and will not
    # raise a `RayActorError`. Retried methods may execute twice, once
    # on the failed actor and a second time on the restarted actor.
    for _ in range(50):
        counter = ray.get(actor.increment_and_possibly_fail.remote())
        print(counter)  # Prints the sequence 1-10 5 times.

    # After the actor has been restarted 5 times, all subsequent methods will
    # raise a `RayActorError`.
    for _ in range(10):
        try:
            counter = ray.get(actor.increment_and_possibly_fail.remote())
            print(counter)  # Unreachable.
        except ray.exceptions.RayActorError:
            print('FAILURE')  # Prints 10 times.

For at-least-once actors, the system will still guarantee execution ordering
according to the initial submission order. For example, any tasks submitted
after a failed actor task will not execute on the actor until the failed actor
task has been successfully retried. The system will not attempt to re-execute
any tasks that executed successfully before the failure (unless :ref:`object reconstruction <object-reconstruction>` is enabled).

At-least-once execution is best suited for read-only actors or actors with
ephemeral state that does not need to be rebuilt after a failure. For actors
that have critical state, it is best to take periodic checkpoints and either
manually restart the actor or automatically restart the actor with at-most-once
semantics. If the actor’s exact state at the time of failure is needed, the
application is responsible for resubmitting all tasks since the last
checkpoint.

.. note::
    For :ref:`async or threaded actors <async-actors>`, the tasks might
    be executed out of order. Upon actor restart, the system will only retry
    *incomplete* tasks. Previously completed tasks will not be
    re-executed.
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
+								Fault Tolerance
 								===============
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								This document describes how Ray handles machine and process failures.
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								Tasks
 								-----
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								When a worker is executing a task, if the worker dies unexpectedly, either
 								because the process crashed or because the machine failed, Ray will rerun
-												[core] Enable lineage reconstruction by default (#22816)

Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
											
										
										
											2022-03-07 17:40:30 -05:00
+								the task until either the task succeeds or the maximum number of retries is
 								exceeded. The default number of retries is 3 and can be overridden by
 								specifying ``max_retries`` in the ``@ray.remote`` decorator. Specifying -1
 								allows infinite retries, and 0 disables retries.
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								You can experiment with this behavior by running the following code.
 								.. code-block:: python
 								    import numpy as np
 								    import os
 								    import ray
 								    import time
 								    ray.init(ignore_reinit_error=True)
 								    @ray.remote(max_retries=1)
 								    def potentially_fail(failure_probability):
 								        time.sleep(0.2)
 								        if np.random.random() < failure_probability:
 								            os._exit(0)
 								        return 0
 								    for _ in range(3):
 								        try:
 								            # If this task crashes, Ray will retry it up to one additional
 								            # time. If either of the attempts succeeds, the call to ray.get
 								            # below will return normally. Otherwise, it will raise an
 								            # exception.
 								            ray.get(potentially_fail.remote(0.5))
 								            print('SUCCESS')
-												[api] Second round of 1.0 API changes: exceptions, num_return_vals (#10377)


											
										
										
											2020-08-28 19:57:02 -07:00
+								        except ray.exceptions.WorkerCrashedError:
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								            print('FAILURE')
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												[core] Enable lineage reconstruction by default (#22816)

Enables lineage reconstruction, which allows automatic recovery of task outputs, by default.

Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).
											
										
										
											2022-03-07 17:40:30 -05:00
+								.. _object-reconstruction:
 								Ray also implements *lineage reconstruction* to recover task outputs that are
 								lost from the distributed object store. This can occur during node failures.
 								Ray will first automatically attempt to recover the value by looking for copies
 								of the same object on other nodes. If none are found, then Ray will
 								automatically recover the value by re-executing the task that created the
 								value. Arguments to the task are recursively reconstructed with the same
 								method.
 								Note that lineage reconstruction can cause higher than usual driver memory
 								usage because the driver keeps the descriptions of any tasks that may be
 								re-executed in case of a failure. To limit the amount of memory used by
 								lineage, set the environment variable ``RAY_max_lineage_bytes`` (default 1GB)
 								to evict lineage if the threshold is exceeded.
 								To disable this behavior, set the environment variable
 								``RAY_lineage_pinning_enabled=0`` during ``ray start`` or ``ray.init``.  With
 								this setting, if there are no copies of an object left, an ``ObjectLostError``
 								will be raised.
-												[Doc] Add Architecture Doc for Ray Serve (#10204)


											
										
										
											2020-08-20 11:40:47 -07:00
+								.. _actor-fault-tolerance:
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
+								Actors
 								------
-												Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
											
										
										
											2020-05-14 08:30:29 -07:00
+								Ray will automatically restart actors that crash unexpectedly.
 								This behavior is controlled using ``max_restarts``,
 								which sets the maximum number of times that an actor will be restarted.
 								If 0, the actor won't be restarted. If -1, it will be restarted infinitely.
-												 [Doc] Add actor max restarts default value to fault tolerance doc (#22481)


											
										
										
											2022-02-18 15:48:22 -08:00
+								The default value of ``max_restarts`` is 0.
-												Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
											
										
										
											2020-05-14 08:30:29 -07:00
+								When an actor is restarted, its state will be recreated by rerunning its
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								constructor.
-												Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
											
										
										
											2020-05-14 08:30:29 -07:00
+								After the specified number of restarts, subsequent actor methods will
 								raise a ``RayActorError``.
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								You can experiment with this behavior by running the following code.
 								.. code-block:: python
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								    import os
 								    import ray
 								    import time
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								    ray.init(ignore_reinit_error=True)
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
											
										
										
											2020-05-14 08:30:29 -07:00
+								    @ray.remote(max_restarts=5)
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								    class Actor:
 								        def __init__(self):
 								            self.counter = 0
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								        def increment_and_possibly_fail(self):
 								            self.counter += 1
 								            time.sleep(0.2)
 								            if self.counter == 10:
 								                os._exit(0)
 								            return self.counter
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								    actor = Actor.remote()
-												Initial fault tolerance documentation. (#845)

* Initial fault tolerance documentation.

* Update documentation.

											
										
										
											2017-08-19 17:38:14 -07:00
-												Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
											
										
										
											2020-05-14 08:30:29 -07:00
+								    # The actor will be restarted up to 5 times. After that, methods will
-												Option to retry failed actor tasks (#8330)

* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
											
										
										
											2020-05-15 20:15:15 -07:00
+								    # always raise a `RayActorError` exception. The actor is restarted by
 								    # rerunning its constructor. Methods that were sent or executing when the
 								    # actor died will also raise a `RayActorError` exception.
-												Document fault tolerance behavior. (#6698)


											
										
										
											2020-01-06 22:34:06 -08:00
+								    for _ in range(100):
 								        try:
 								            counter = ray.get(actor.increment_and_possibly_fail.remote())
 								            print(counter)
 								        except ray.exceptions.RayActorError:
 								            print('FAILURE')
-												Option to retry failed actor tasks (#8330)

* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
											
										
										
											2020-05-15 20:15:15 -07:00
 								By default, actor tasks execute with at-most-once semantics
 								(``max_task_retries=0`` in the ``@ray.remote`` decorator). This means that if an
 								actor task is submitted to an actor that is unreachable, Ray will report the
 								error with ``RayActorError``, a Python-level exception that is thrown when
 								``ray.get`` is called on the future returned by the task. Note that this
 								exception may be thrown even though the task did indeed execute successfully.
 								For example, this can happen if the actor dies immediately after executing the
 								task.
 								Ray also offers at-least-once execution semantics for actor tasks
 								(``max_task_retries=-1`` or ``max_task_retries > 0``). This means that if an
 								actor task is submitted to an actor that is unreachable, the system will
 								automatically retry the task until it receives a reply from the actor. With
 								this option, the system will only throw a ``RayActorError`` to the application
 								if one of the following occurs: (1) the actor’s ``max_restarts`` limit has been
 								exceeded and the actor cannot be restarted anymore, or (2) the
 								``max_task_retries`` limit has been exceeded for this particular task. The
 								limit can be set to infinity with ``max_task_retries = -1``.
 								You can experiment with this behavior by running the following code.
 								.. code-block:: python
 								    import os
 								    import ray
 								    ray.init(ignore_reinit_error=True)
 								    @ray.remote(max_restarts=5, max_task_retries=-1)
 								    class Actor:
 								        def __init__(self):
 								            self.counter = 0
 								        def increment_and_possibly_fail(self):
 								            # Exit after every 10 tasks.
 								            if self.counter == 10:
 								                os._exit(0)
 								            self.counter += 1
 								            return self.counter
 								    actor = Actor.remote()
 								    # The actor will be reconstructed up to 5 times. The actor is
 								    # reconstructed by rerunning its constructor. Methods that were
 								    # executing when the actor died will be retried and will not
 								    # raise a `RayActorError`. Retried methods may execute twice, once
 								    # on the failed actor and a second time on the restarted actor.
 								    for _ in range(50):
 								        counter = ray.get(actor.increment_and_possibly_fail.remote())
 								        print(counter)  # Prints the sequence 1-10 5 times.
 								    # After the actor has been restarted 5 times, all subsequent methods will
 								    # raise a `RayActorError`.
 								    for _ in range(10):
 								        try:
 								            counter = ray.get(actor.increment_and_possibly_fail.remote())
 								            print(counter)  # Unreachable.
 								        except ray.exceptions.RayActorError:
 								            print('FAILURE')  # Prints 10 times.
 								For at-least-once actors, the system will still guarantee execution ordering
 								according to the initial submission order. For example, any tasks submitted
 								after a failed actor task will not execute on the actor until the failed actor
-												[core] Enable object reconstruction for retryable actor tasks (#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node
											
										
										
											2020-07-23 21:15:12 -07:00
+								task has been successfully retried. The system will not attempt to re-execute
 								any tasks that executed successfully before the failure (unless :ref:`object reconstruction <object-reconstruction>` is enabled).
-												Option to retry failed actor tasks (#8330)

* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
											
										
										
											2020-05-15 20:15:15 -07:00
 								At-least-once execution is best suited for read-only actors or actors with
 								ephemeral state that does not need to be rebuilt after a failure. For actors
 								that have critical state, it is best to take periodic checkpoints and either
 								manually restart the actor or automatically restart the actor with at-most-once
 								semantics. If the actor’s exact state at the time of failure is needed, the
 								application is responsible for resubmitting all tasks since the last
 								checkpoint.
-												[core] Enable object reconstruction for retryable actor tasks (#9557)

* Test actor plasma reconstruction

* Allow resubmission of actor tasks

* doc

* Test for actor constructor

* Kill PID before removing node

* Kill pid before node
											
										
										
											2020-07-23 21:15:12 -07:00
-												[Core] Actor Retries Out of Order Tasks on Restart (#12338)


											
										
										
											2020-12-01 09:35:54 -08:00
+								.. note::
 								    For :ref:`async or threaded actors <async-actors>`, the tasks might
-												[Core][actor out-of-order execution 6/n] plumbing work to make it work e2e (#20177)

This PR is the last PR that enables out of order execution. Previous PR: #20176

In this PR specifically, we added an execute_out_of_order option to .options call, which creates the actor with both out_of_order_submit_queue and out_of_order_scheduling queue.

this PR also added @simon-mo original case for testing.


											
										
										
											2021-11-19 11:05:18 -08:00
+								    be executed out of order. Upon actor restart, the system will only retry
 								    *incomplete* tasks. Previously completed tasks will not be
 								    re-executed.