Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,
import ray
ray.init()
@ray.remote
def f():
import ctypes;
ctypes.string_at(0)
ray.get(f.remote())
Ray currently only prints the following stack trace:
(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @ 0x7fff203d9552 (unknown) _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @ 0x7fff203d9552 (unknown) _platform_strlen
With this change, Python stack trace will be printed in addition to the stack trace above:
(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693) File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693) File "stack.py", line 7 in f
(pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>
This should make debugging crashes in Python worker easier, for users and Ray devs.
Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
## Why are these changes needed?
There are some issues left from previous PRs.
- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags.
## Related issue number
Why are these changes needed?
Related issue number
##19177
Quoting #19177 (comment) here,
The following tests fail when not skipped,
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...
Results (395.22s):
36 passed
3 failed
- ray\tests/test_basic.py:197 test_user_setup_function
- ray\tests/test_basic.py:220 test_disable_cuda_devices
- ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23
Results (198.33s):
1 failed
- ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.
def test_oversized_function(ray_start_shared_local_modes)
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.
The third tests print a lot of
(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.
Related issue number
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :
This lint rule cannot apply only to changed lines because currently Ray has `-Winconsistent-missing-override` as a build flag. Either all or none of member functions from a derived class can have the `override` / `final` annocation.
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:
the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.
we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.
Related issue number
Closes#19270
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :(
## Why are these changes needed?
There are two issues fixed in this PR:
- make sure wait for session count alive node
- upgrade the machine to match what's tested in oss ray.
## Related issue number
https://github.com/ray-project/ray/issues/19084