mirror of
https://github.com/vale981/ray
synced 2025-03-05 18:11:42 -05:00
![]() Soft restarts don't work for tensorflow since there is still some leftover communication state in the actors which may lead to undefined behavior, such as causing training to hang. Instead, this PR changes the failure handling for tensorflow to match torch and horovod, and recreates all the workers in case of failure. Also adds a test to check if fault tolerance works correctly for an actual tensorflow example. When testing locally, the test failed before the change, but passes after. |
||
---|---|---|
.. | ||
ray | ||
requirements | ||
asv.conf.json | ||
build-wheel-macos-arm64.sh | ||
build-wheel-macos.sh | ||
build-wheel-manylinux2014.sh | ||
build-wheel-windows.sh | ||
MANIFEST.in | ||
README-building-wheels.md | ||
requirements.txt | ||
requirements_linters.txt | ||
requirements_ml_docker.txt | ||
setup.py |