ray/python
Amog Kamsetty 02cb974c6c
[Train] Fix fault tolerance for Tensorflow (#22508)
Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang.

Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after.
2022-02-24 11:50:20 -08:00
..
ray [Train] Fix fault tolerance for Tensorflow (#22508) 2022-02-24 11:50:20 -08:00
requirements [rllib] Upper bound gym version (#22510) 2022-02-18 17:39:22 -08:00
asv.conf.json [docs] Move all /latest links to /master (#11897) 2020-11-10 10:53:28 -08:00
build-wheel-macos-arm64.sh Upgrade cython to 0.29.26 for py310 (#21244) 2021-12-26 20:26:08 -08:00
build-wheel-macos.sh Upgrade cython to 0.29.26 for py310 (#21244) 2021-12-26 20:26:08 -08:00
build-wheel-manylinux2014.sh Upgrade cython to 0.29.26 for py310 (#21244) 2021-12-26 20:26:08 -08:00
build-wheel-windows.sh [CI] Migrate Windows Wheels to Buildkite (#21388) 2022-01-05 12:49:19 -08:00
MANIFEST.in Includes .pyi files in package data. (#21247) 2021-12-27 11:50:02 -08:00
README-building-wheels.md [build] Build wheels with manylinux2014 (#11621) 2020-11-03 19:36:32 -08:00
requirements.txt [ci] Fix grpcio 1.44 break test_output (#22494) 2022-02-22 13:59:25 -08:00
requirements_linters.txt [CI] Add support for Black formatting (#21281) 2022-01-03 10:06:41 -08:00
requirements_ml_docker.txt [Deps] Bump tensorflow on Docker image and add Codeowners (#20041) 2021-11-05 00:58:34 -07:00
setup.py [Usage Stats] Implement usage stats report "Turned off by default". (#22249) 2022-02-22 15:32:02 -08:00