## Why are these changes needed?
Previously, we don't send requests if there is an in-flight request. But this is actually bad, because it prevent raylet get the latest information. For example, if the request needs 200ms to arrive at the raylet, the raylet will lose one update. In this case, the next request will arrive after 200 + 100 + (in flight time) ms. So we still should send the request.
TODO:
- Push the snapshot to raylet if the message is lost.
- Handle message loss in raylet better.
## Related issue number
#19438
This RPC is from legacy code and not needed anymore (the task spec is already in the actor table), but it adds quite amount of keys to Redis.
The below is the sum of bytes size(? I am not sure if it is bytes size, but I grabbed the length of the value when I queried Redis) of each prefix when running many_ppo. As you can see Task& and Task takes a lot of part although they are not really used.
�[0m ��[12A�[9C�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[0mb�[?7h�[0m�[?12l�[?25h�[?25l�[?7l�[10D�[0m�[J�[0;38;5;28mIn [�[0;92;1m82�[0;38;5;28m]: �[0mb�[10D�[0m
�[J�[?7h�[0m�[?12l�[?25h�[?2004l�[0m�[?7h�[0;38;5;88mOut[�[0;91;1m82�[0;38;5;88m]: �[0m�[0m
defaultdict(int,
{b'WORKE': 1080864,
b'ACTOR': 1470931,
b'TASK&': 1020646,
b'TASK:': 870551,
b'PROFI': 360000,
b'PLACE': 10107,
b'JOB:\x01': 8,
b'JOB:\x04': 8,
b'NODE:': 99,
b'NODE_': 126,
b'INTER': 44,
b'JOB:\x03': 8,
b'redis': 16,
b'JOB:\x02': 8,
b'JOB:\x05': 8})
## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.
This PR filter out the node that's not available when select the node.
## Related issue number
#19438
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
- Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.
Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.
Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.
This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.
## Related issue number
## Why are these changes needed?
Recently we found that gcs is using a lot of CPU in scheduling actors and it's because the code is not well organized. This PR improved the SelectNodes function. From profiling, for many nodes actor test, 50% of CPU is wasted and could be saved here.
## Related issue number
Although event framework seems to work, it has an issue that it prints ERROR level severity events to the stderr, which eventually is streamed to the driver. Before we add this to the prod, we should fix this issue. To have enough time to fix it, we will turn off the feature temporarily.
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.
### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
def __init__(self):
pass
@ray.method(concurrency_group="io")
async def f1(self):
pass
@ray.method(concurrency_group="io")
def f2(self):
pass
@ray.method(concurrency_group="compute")
def f3(self):
pass
@ray.method(concurrency_group="compute")
def f4(self):
pass
def f5(self):
pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group. Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.
Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.
The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```
### Implementation
The straightforward implementation details are:
- Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.
## Related issue number
#16047
## Why are these changes needed?
It looks like the metrics set on server side are wrong. The time the query is constructed sometimes is not the time we get the query. This PR fixed this.
## Related issue number
Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,
import ray
ray.init()
@ray.remote
def f():
import ctypes;
ctypes.string_at(0)
ray.get(f.remote())
Ray currently only prints the following stack trace:
(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @ 0x7fff203d9552 (unknown) _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @ 0x7fff203d9552 (unknown) _platform_strlen
With this change, Python stack trace will be printed in addition to the stack trace above:
(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693) File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693) File "stack.py", line 7 in f
(pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693) File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>
This should make debugging crashes in Python worker easier, for users and Ray devs.
Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
## Why are these changes needed?
There are some issues left from previous PRs.
- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags.
## Related issue number
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:
the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.
we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.
Related issue number
Closes#19270
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :(
## Why are these changes needed?
Before this PR, there is a race condition where:
- job register starts
- driver start to launch actor
- gcs register actor ===> crash
- job register ends
Actor registration should be forced to be after driver registration. This PR enforces that.
## Related issue number
Closes#19172