## Why are these changes needed?
When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this.
This PR filter out the node that's not available when select the node.
## Related issue number
#19438
## Why are these changes needed?
The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel.
- Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message.
Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`.
Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing.
This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client.
## Related issue number
## Why are these changes needed?
Recently we found that gcs is using a lot of CPU in scheduling actors and it's because the code is not well organized. This PR improved the SelectNodes function. From profiling, for many nodes actor test, 50% of CPU is wasted and could be saved here.
## Related issue number
* round robin on failure to launch
* still round-robin spot instances
* prioritize first AZ
* no more round-robining
* doc updates
* Order subnets by AZ
* add spot instance advisor link
* ensure we try all AZs
* fix typos
Although event framework seems to work, it has an issue that it prints ERROR level severity events to the stderr, which eventually is streamed to the driver. Before we add this to the prod, we should fix this issue. To have enough time to fix it, we will turn off the feature temporarily.
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.
### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
def __init__(self):
pass
@ray.method(concurrency_group="io")
async def f1(self):
pass
@ray.method(concurrency_group="io")
def f2(self):
pass
@ray.method(concurrency_group="compute")
def f3(self):
pass
@ray.method(concurrency_group="compute")
def f4(self):
pass
def f5(self):
pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group. Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.
Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.
The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```
### Implementation
The straightforward implementation details are:
- Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.
## Related issue number
#16047
## Why are these changes needed?
It looks like the metrics set on server side are wrong. The time the query is constructed sometimes is not the time we get the query. This PR fixed this.
## Related issue number