Signed-off-by: Clarence Ng clarence.wyng@gmail.com
Why are these changes needed?
This PR adds a memory monitor in cpp that runs periodically to check if the node memory usage is above a certain threshold. The caller may provide a callback to the monitor to execute at each interval to determine whether an action should be taken.
This PR is a no-op since the monitor is disabled by default. Another PR based on this will implement the monitor to take action when memory is running low
## Why are these changes needed?
When GCS restarts, sometimes, raylet needs a while to reconnect to the GCS, for example, in k8s env, it needs a while to move GSC to the service. This PR try to fix this by allowing a longer timeout for the first ping when GCS restarts.
Once GCS get the first ping, it'll just use the regular timeout instead.
Actor tasks are sometimes failed while their dependencies are still being resolved. This can cause hanging or crashes when we resolve the dependencies for a task that has already been canceled. It can lead to a crash from the ref counter when, for the same actor, actor task 1 depends on actor task 2. The sequence is:
Actor tasks 1 and 2 queued, 1 depends on 2.
Fail actor task 1. We clear its refs, including its dependency on 2.
Fail actor task 2. We store an error as its return value. Since task 1 depends on it, we inline the dependency and try to clear task 1's refs again, causing a ref counting error because we already cleared them in step 2.
This PR fixes the issue by canceling dependency resolution for tasks before failing them. This involves some refactoring of the LocalDependencyResolver. Most of the changes are for testing (split out the unit tests for LocalDependencyResolver into their own suite).
Related issue number
Closes#18908.
CheckAlive in GCS is only for checking GCS's liveness. But we also need to check the liveness for raylet.
In KubeRay, we can check the liveness directly by monitoring the raylet's liveness. But it's not good enough given that raylet's process liveness is not directly related to raylet's liveness.
For example, during a network partition, raylet is not able to connect to GCS. GCS mark raylet as dead. So for the cluster, although raylet process is still alive, it can't be treated as alive because GCS has told all the nodes that it's dead.
So for KubeRay, it also needs to talk with GCS to check whether it's alive or not.
This PR extends the CheckAlive API to include raylet address so that we can query GCS to get the cluster status directly.
To enable one storage be able to be shared by multiple ray clusters, a special prefix is added to isolate the data between clusters: "<EXTERNAL_STORAGE_NAMESPACE>@"
The namespace is given by an os environment: `RAY_external_storage_namespace` when start the head: `RAY_external_storage_namespace=1234 ray start --head`
This flag is very important in HA GCS environment. For example, in ray serve operator, when the operator tries to bring up a new one, it's hard to just start a new db, but it's relatively easy to generate a new cluster id.
Another example is that, the user might only be able to maintain one HA Redis DB, and the namespace enable the user to start multiple ray clusters which share the same db.
This config should be moved to storage config in the future once we build that.
## Why are these changes needed?
This PR adds data truncation when there are more than N number of entries. The policy is as follow;
By default, we return 100 entries at max. Users can adjust this value, but we won't allow to increase more than 10K.
By default, all internal RPCs truncate data if it's > 10K.
For distributed sources, we query each source with 10K limit and we apply limit again at the end.
## Related issue number
Closes https://github.com/ray-project/ray/issues/25984#issue-1279280673
Part of https://github.com/ray-project/ray/issues/25718#issue-1268968400
In [PR](https://github.com/ray-project/ray/pull/24764) we move the reconnection to GcsRPCClient. In case of a GCS failure, we'll queue the requests and resent them once GCS is back.
This actually breaks request with timeout because now, the request will be queued and never got a response. This PR fixed it.
For all requests, it'll be stored by the time it's supposed to be timeout. When GCS is down, we'll check the queued requests and make sure if it's timeout, we'll reply immediately with a Timeout error message.
Ray (on K8s) fails silently when running out of disk space.
Today, when running a script that has a large amount of object spilling, if the disk runs out of space then Kubernetes will silently terminate the node. Autoscaling will kick in and replace the dead node. There is no indication that there was a failure due to disk space.
Instead, we should fail tasks with a good error message when the disk is full.
We monitor the disk usage, when node disk usage grows over the predefined capacity (like 90%), we fail new task/actor/object put that allocates new objects.
## Why are these changes needed?
When schedule actors on pg, instead of iterating all nodes in the cluster resource, This optimize will directly queries corresponding nodes by looking at pg location index.
This optimization can reduce the complexity of the algorithm from O (N) to o (1),and N is the number of nodes. In particular, the more nodes in large-scale clusters, the better the optimization effect.
**This PR only optimize schedule by gcs, I will submit a PR for raylet scheduling later.**
In ant group, Now we have achieved the optimization in the GCS scheduling mode and obtained the following performance test results.
1、The average time of selecting nodes is reduced from 330us to 30us, and the performance is improved by about 11 times.
2、The total time of creating & executing 12,000 actors ranges from 271 (s) - > 225 (s) on average. Reduce time consumption by 17%.
More detailed solution information is in the issue.
## Related issue number
[Core/PG/Schedule]Optimize the scheduling performance of actors/tasks with PG specified #23881
Turns out https://github.com/ray-project/ray/pull/25342 wasn't the root cause of the ray shutdown flakiness. I realized there's another PR that could affect this test suite. Let's try reverting it and see if things get better.
I have seen errors where the prometheus view data dictionary is changed when iterating over it. So make a copy (which is atomic) before iterating.
Also, use a separate thread for GCS client heartbeat RPC. This avoids the issue of missing heartbeat when raylet is too busy.
This PR support reconnection of the GCS client in gRPC channel layer. Previously this is implemented in the application layer:
- Health check is in the application layer by starting a new channel.
- Monitor the GCS address change and do resubscribe.
- Always retry the failed request and do reconnection in case of a failure.
However, there are several issues with this approach:
- We need service to discover for GCS address change.
- Monitor is too heavy since it always creates a channel.
- Reconnection is a blocking call that prevents the code from running.
This new approach moves the reconnection to gRPC layer directly to fix these issues.
- DNS name resolution is done by gRPC so we don't need to write this.
- Health check is done by checking the channel state.
- Queue the failed call and retry once GCS is up so that it's not a blocking call.
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
Clean up the ci/ directory. This means getting rid of the travis/ path completely and moving the files into sensible subdirectories.
Details:
- Moves everything under ci/travis into subdirectories, e.g. ci/build, ci/lint, etc.
- Minor adjustments to some scripts (variable renames)
- Removes the outdated (unused) asan tests
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:
1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.
After this PR, the node-ip custom resource can be deprecated.
As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler.
There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs.
This is [3/n]:
Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy
Delete gcs_resource_scheduler
Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test
BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment)
To be clear:
scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes.
Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy.
Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)
## Why are these changes needed?
This PR refactor the resource syncer to decouple it from GCS and raylet. GCS and raylet will use the same module to sync data. The integration will happen in the next PR.
There are several new introduced components:
* RaySyncer: the place where remote and local information sits. It's a coordinator layer.
* NodeState: keeps track of the local status, similar to NodeSyncConnection.
* NodeSyncConnection: keeps track of the sending and receiving information and make sure not sending the information the remote node knows.
The core protocol is that each node will send {what it has} - {what the target has} to the target.
For example, think about node A <-> B. A will send all A has exclude what B has to B.
Whenever when there is new information (from NodeState or NodeSyncConnection), it'll be passed to RaySyncer broadcast message to broadcast.
NodeSyncConnection is for the communication layer. It has two implementations Client and Server:
* Server => Client: client will send a long-polling request and server will response every 100ms if there is data to be sent.
* Client => Server: client will check every 100ms to see whether there is new data to be sent. If there is, just use RPC call to send the data.
Here is one example:
```mermaid
flowchart LR;
A-->B;
B-->C;
B-->D;
```
It means A initialize the connection to B and B initialize the connections to C and D
Now C generate a message M:
1. [C] RaySyncer check whether there is new message generated in C and get M
2. [C] RaySyncer will push M to NodeSyncConnection in local component (B)
3. [C] ServerSyncConnection will wait until B send a long polling and send the data to B
4. [B] B received the message from C and push it to local sync connection (C, A, D)
5. [B] ClientSyncConnection of C will not push it to its local queue since it's received by this channel.
6. [B] ClientSyncConnection of D will send this message to D
7. [B] ServerSyncConnection of A will be used to send this message to A (long-polling here)
8. [B] B will update NodeState (local component) with this message M
9. [D] D's pipelines is similar to 5) (with ServerSyncConnection) and 8)
10. [A] A's pipeline is similar to 5) and 8)
This is the first PR to refactor scheduler data structures (See #22850).
Major changes:
- Hid the implementation details in the `ResourceRequest` and `TaskResourceInstnaces` classes, which expose public methods such as algebra operators and comparison operators.
- Hid the differences between "predefined" and "custom" resources inside these 2 classes. Call sites can simply use the resource ID to access the resource, no matter it is predefined or custom.
- The predefined_resources vector now always has the full length. So no more "resize"s are needed.
- Removed the `ResourceCapacity` class. Now "total" and "available" resources are stored in separate fields in "NodeResources".
- Moved helper functions for FixedPoint vectors from "cluster_resource_data.h" to "fixed_point.h"
- "ResourceID" now has static methods to get the resource ids of predefined resources, e.g. "ResourceID::CPU()".
- Encapsulated unit-instance resource logic to "ResourceID"
Other planned changes that are not included in this PR:
- Rename ResourceRequest to ResourceSet, and move it to its own file.
- Remove the predefined vectors and always use maps.
Co-authored-by: Chong-Li <lc300133@antgroup.com>
Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules.
The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated.
The follow-up PRs would be migrating in core worker, placement group and others.
To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects.
Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported.
More details in https://github.com/ray-project/ray/pull/22526
Mobius has applied this change in https://github.com/ray-project/mobius/pull/28.
Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>
Currently object transfers assume that the object size is fixed. This is a bad assumption during failures, especially with lineage reconstruction enabled and tasks with nondeterministic outputs.
This PR cleans up the handling and hopefully guards against two cases where the object size may change during a transfer:
1. The object manager's size information does not match the object in the local plasma store (due to async notifications). --> the object manager overwrites its own information if it finds that the physical object has a different size.
2. The receiver's created buffer size does not match the sender's object size. --> the receiver destroys the previous buffer and creates a new buffer with the correct size. This might cause some transient errors but eventually object transfer should succeed.
Unfortunately I couldn't trigger this from Python because it depends on some pretty specific timing conditions. However, I did add some unit tests for case 2 (this is the majority of the PR).
In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource.
Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.
- This PR moves the `ObjectManager::Wait` related logic to a separate WaitManager class.
- Fix the wait hang issue by not relying on the async object location notification, but checking if wait is complete when the local object is added, at that time the object is guaranteed to be local.
We need to get not only ray_namespace config of a job. In this PR, we cache the job_configs instead of ray_namespaces, so that we can use it for other PR(For example, this PR #21249 needs the num_java_worker_pre_process item).
Also, before this PR, ray_namespaces_ cache will not be cleared, and we clear the cache in this PR.
For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible.
Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together.
Also, add the basic support for dashboard reporter pubsub.