ray/python
Stephanie Wang 6ef26cd8ff
[core] Cancel pending dependency resolution before failing a task (#26267)
Actor tasks are sometimes failed while their dependencies are still being resolved. This can cause hanging or crashes when we resolve the dependencies for a task that has already been canceled. It can lead to a crash from the ref counter when, for the same actor, actor task 1 depends on actor task 2. The sequence is:

    Actor tasks 1 and 2 queued, 1 depends on 2.
    Fail actor task 1. We clear its refs, including its dependency on 2.
    Fail actor task 2. We store an error as its return value. Since task 1 depends on it, we inline the dependency and try to clear task 1's refs again, causing a ref counting error because we already cleared them in step 2.

This PR fixes the issue by canceling dependency resolution for tasks before failing them. This involves some refactoring of the LocalDependencyResolver. Most of the changes are for testing (split out the unit tests for LocalDependencyResolver into their own suite).
Related issue number

Closes #18908.
2022-07-13 14:39:11 -07:00
..
ray [core] Cancel pending dependency resolution before failing a task (#26267) 2022-07-13 14:39:11 -07:00
requirements [Tune/CI] Fix tune-sklearn notebook example (#26470) 2022-07-13 18:14:36 +01:00
asv.conf.json [docs] Move all /latest links to /master (#11897) 2020-11-10 10:53:28 -08:00
build-wheel-macos-arm64.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-macos.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-manylinux2014.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
build-wheel-windows.sh [python3.10] build python310 wheels (#24829) 2022-05-16 12:36:33 -07:00
MANIFEST.in [hotfix] Revert "Exclude Bazel build files from Ray wheels (#25679)" (#25950) 2022-06-20 20:59:48 -07:00
README-building-wheels.md [build] Build wheels with manylinux2014 (#11621) 2020-11-03 19:36:32 -08:00
requirements.txt Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517) 2022-07-13 11:19:30 -07:00
requirements_linters.txt Add import sorting to format.sh (#25678) 2022-06-13 14:08:51 -07:00
requirements_ml_docker.txt [AIR] Add distributed torch_geometric example (#23580) 2022-04-21 09:48:43 -07:00
setup.py Add ray/widgets/templates/ files to wheel (fix #26452) (#26457) 2022-07-12 11:23:57 -07:00