ray/cpp
SangBin Cho d89c8aa9f9
[Core] Add more accurate worker exit (#24468)
This PR adds precise reason details regarding worker failures. All information is available either by 
- ray list workers
- exceptions from actor failures.

Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)
```
RayActorError: The actor died unexpectedly before finishing this task.
	class_name: G
	actor_id: e818d2f0521a334daf03540701000000
	pid: 61251
	namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
	ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
```

## Design
Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls `Disconnect`.
(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.

The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.

## Exit types
Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes. 

### Status quo
- SYSTEM ERROR EXIT
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Direct call failure
- INTENDED EXIT
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- UNUSED_RESOURCE_REMOVED
     - Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state
- PG_REMOVED
    - When pg is removed, all workers fate share
- CREATION_TASK (INIT ERROR)
    - When actor init has an error
- IDLE
    - When worker is idle and num workers > soft limit (by default num cpus)
- NODE DIED
    - Only can detect when the node of the owner is dead (need improvement)

### New proposal
Remove unnecessary states and add “details” field. We can categorize failures by 4 types

- UNEXPECTED_SYSTEM_ERROR_EXIT
     - When the worker is crashed unexpectedly
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Node died
    - Direct call failure
- INTENDED_USER_EXIT. 
    - When the worker is requested to be killed by users. No workflow required. Just correctly store the state.
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- INTENDED_SYSTEM_EXIT
    - When the worker is requested to be killed by system (without explicit user request)
    - Unused resource removed
    - Pg removed
    - Idle
- ACTOR_INIT_FAILURE (CREATION_TASK_FAILED)
     - When the actor init is failed, we fate share the process with the actor. 
     - Actor init failed

## Limitation (Follow up)
Worker failures are not reported under following circumstances
- Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon).
- Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata. 

I will create issues to track these problems.
2022-05-19 19:48:52 -07:00
..
example [Doc] Fix bad doc and recover doc of c++ api (#22213) 2022-02-08 19:04:37 +08:00
include/ray [C++ Worker]Python call cpp actor (#23061) 2022-03-15 19:54:10 -07:00
src/ray [Core] Add more accurate worker exit (#24468) 2022-05-19 19:48:52 -07:00
BUILD.bazel [C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
test_python_call_cpp.py [C++ Worker]Python call cpp actor (#23061) 2022-03-15 19:54:10 -07:00