Commit graph

2756 commits

Author SHA1 Message Date
Chong-Li
1807cff9b6
Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173) 2022-04-22 14:46:38 +08:00
SangBin Cho
30ab5458a7
[State Observability] Tasks and Objects API (#23912)
This PR implements ray list tasks and ray list objects APIs.

NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
2022-04-21 18:45:03 -07:00
mwtian
02b0d82cf8
[Ray client] return None from internal KV for non-existent keys (#24058)
This fixes the behavior diff between client and non-client internal KV.
2022-04-21 10:55:57 -07:00
jon-chuang
ddcc252b51
[Core] Ray logs API (1/n) (#23435)
Expose HTTP endpoint to retrieve logs from ray cluster
2022-04-20 23:11:02 -07:00
Yi Cheng
04611edf5a
[scheduler] Update syncer API and add reconnect feature. (#23929)
This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size.

Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage
Make ray syncer test work even when we add more components in the protobuf
Make ray syncer able to reconnect to a new node.
2022-04-20 14:31:24 -07:00
Gagandeep Singh
554831fad1
Increase register timeout seconds (#23223) 2022-04-20 12:25:01 -07:00
Jiajun Yao
6cfec51d1e
Spread even if nodes are not available (#23445)
Several changes to make spread scheduling work better under load:

* When nodes are not available, spread among feasible nodes.
* If grant_or_reject is true, don't spill back if the selected node is not available.
* Don't spill due to waiting for dependencies for spread tasks.
2022-04-20 07:35:15 -07:00
mwtian
34fb092656
[Pubsub] reduce memory usage for channels that do not require total memory cap (#23985)
In a1e06f64ae, memory bound was added for each subscribed entity in the publisher. It adds two extra `std::deque` per subscribed entity, which turns out to cost a lot more memory when there are a large number of `ObjectRef`s: https://github.com/ray-project/ray/pull/23853#issuecomment-1098382286

This PR avoids the extra memory usage for entities in channels unlikely to grow too large, i.e. all channels except those for logs and error info. Subscribed entity memory usage no longer shows up in the memory profile when there are 1M object refs. Raw data: [profile006.pb.gz](https://github.com/ray-project/ray/files/8508547/profile006.pb.gz)
2022-04-19 17:44:15 -07:00
jon-chuang
e0c0ea2e59
[Core] Add node_name field to GcsNodeInfo (#23543)
Make it easier to identify nodes by a string identifier separate from their IP address.
2022-04-19 05:03:12 -07:00
mwtian
ea66192a38
[GCS] Use gRPC instead of socket for GCS client health check (#23939)
A user has reported a crash in GCS client where the client was unable to connect to the GCS server after retries, even when GCS server has always been running. I was not able to reproduce the exact issue, but noticed that the health check logic with socket has unexpected behavior sometimes, e.g. it is much slower to use socket for health check compared to using gRPC (~40s vs < 1s sometimes). The user issue could be related to this slowness, so this PR updates the logic to use gRPC health check.
2022-04-18 16:55:29 -07:00
mwtian
d5d2ef4249
[Core] Add a utility to check GCS / Ray cluster health (#23382)
* Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well.
* Improve the error message when the address provided to Ray does not contain port.
2022-04-18 09:58:45 -07:00
Lingxuan Zuo
b7d148815e
[CoreWorker API] collect mobius used core worker api to internal (#23961)
To remove symbols conflict effect on core worker linked different ray versions. This PR extracts an united core worker api (not all) and collect them into a internal library, so native devs can use them anywhere no matter the core worker implementation changes.
2022-04-18 16:03:20 +08:00
Jiajun Yao
5d7f45fc8f
Unify AddSpilledUrl into UpdateObjectLocationBatch RPC (#23872)
- Logically these two rpcs are about notifying the owner about the object location changes, so we should just have one rpc for that purpose. This prevents out-of-order updates seen by the owner (i.e. receiving object removed from object store before spill update). Also by using UpdateObjectLocationBatch, we get batch update for free.
- Maintain a FIFO order for object location updates so we won't have starvation.
2022-04-17 21:48:29 -07:00
mwtian
16227683f6
[Core] trim size of Reference struct (#23853)
During large scale shuffle (number of partitions used >= 1000), driver uses significant amount of memory for storing ObjectRefs. On Intel MacOS, each Reference struct currently takes up 592 bytes. We can reduce per-Reference memory footprint:

    - During shuffle, no ObjectRef borrowing or nesting happens. And in this case fields related to borrowing or nesting should not take up memory. This reduces sizeof(Reference) from 592 to 400.
    - Fields in the Reference struct can be reordered to enhance packing. This reduces sizeof(Reference) from 400 to 368.

On Intel MacOS running the shuffle benchmark with 1000 partitions and 10MB partition size, RSS at the end of shuffle drops from ~5GB to ~4.5GB.

Related issue number

#23604
2022-04-14 13:25:06 -07:00
Jiajun Yao
95714cc281
Node affinity scheduling strategy (#23381)
Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits:

1. Specify node using id instead of ip since ip may not be unique for each node.
2. Support soft constraint so the task can be tolerant to node failures.

After this PR, the node-ip custom resource can be deprecated.
2022-04-12 21:31:26 -07:00
Tao Wang
6aefe9b36e
[Core]Save task spec in separate table (#22650)
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.

As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
2022-04-12 12:24:26 -07:00
Chen Shen
15c32a210c
[GCS][Storage unification 1/n] store index in memory instead of storage (#23754)
Today we have two storage interfaces in Gcs, one is InternalKvInterface which exposes key value interfaces, another is StoreClient which is kv interface with secondary index support.
To make GCS storage pluggable, we need to narrow down and unify the storage interface. This is a try to only use kv store and build index purely in memory.

known limitations:

we need to rebuild index during GCS startup
there might be consistency issues when concurrent change (write/delete) to the same key; but the current redis based solution also suffer from the same issue.
2022-04-10 15:56:08 -07:00
mwtian
a1e06f64ae
[Pubsub] Add memory buffer limit in publisher for each subscribed entity (#23707)
We want to limit the maximum memory used for each subscribed entity, e.g. to avoid having GCS run out of memory if some workers publish a huge amount of logs and subscribers cannot keep up.

After this change, Ray publishers maintain one message buffer for all subscribers of an entity, and one message buffer for all subscribers of all entities in the channel.

The limit can be configured with publisher_entity_buffer_max_bytes. The default value is 10MiB.
2022-04-08 00:36:38 -07:00
ZhuSenlin
7a46f5176a
[scheduler] Refactor scheduling policy interface (#23628)
As we (@scv119 @raulchen @Chong-Li @WangTaoTheTonic) discussed offline, and @scv119 also mentioned it in (#23323 (comment)). I refactor the interface of ISchedulingPolicy and make it expose only one batch interface., and still provide a SingleSchedulePolicy to be compatible with single scheduler.

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-04-07 20:59:43 -07:00
Stephanie Wang
1c972d5d2d
[core] Spill at least the object fusion size instead of at most (#22750)
Copied from #22571:

Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap.

This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved.

Increases some test timeouts to allow tests to pass.
2022-04-05 10:57:42 -07:00
liuyang-my
bdd3b9a0ab
[Serve] Unified Controller API for Cross Language Client (#23004) 2022-04-05 09:20:02 -07:00
jon-chuang
9c950e8979
[Core] Placement Group: Fix Flakey Test placement_group_test_5 and Typo (#23350)
placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails.

Also fix bug - typo.
2022-04-05 05:33:43 -07:00
Yi Cheng
99ca8ee8e4
[flaky] Deflaky ray_syncer_test (#23703)
```
src/ray/common/test/ray_syncer_test.cc:495: Failure
  | Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5
```
This is measuring number of request send. For extreme case, they should equal.  This PR fixed this.
2022-04-04 19:38:58 -07:00
Yi Cheng
e1a974aa9c
[gcs] Remove not useful options in redis client options. (#23572)
This PR removes not useful options in Redis client options.
2022-04-01 14:41:15 -07:00
Tao Wang
2ce3cd0073
[Hotfix]Fix compile failure (#23651) 2022-04-01 13:23:11 -07:00
Stephanie Wang
b43426bc33
[core] Add metrics for disk and network I/O (#23546)
Adds some metrics useful for object-intensive workloads:

    Per raylet/object manager:
        Add num bytes pending restore to spill manager
        Add num requests cumulative to PullManager
        Num bytes pushed/pulled from other nodes cumulative
        Histogram for request latencies in PullManager:
            total life time of request, from start to cancel
            request satisfaction time, from start to object local
            pull time, from object activation to object local
    Per-node disk read/write speed, IOPS
2022-04-01 11:15:34 -07:00
Jiajun Yao
c5c5c24e8f
Remove unused ObjectDirectory::LookupLocations() (#23647)
Remove dead code.
2022-04-01 10:03:37 -07:00
Hao Chen
75f1861625
Remove predefined resources vector in ResourceRequest (#23584)
"ResourceRequest" now uses 2 containers: a vector for predefined resources, and a map for custom resources. 
This was intended to be a perf optimization. However, in practice, this makes the code more complex, and, moreover, prevents optimizations for some methods (e.g., "ResourceIds", "Size").

This PR removes the vector and makes ResourceRequest use only one map for all resources. Also, "ResourceIds" now returns a "boost:range" to allow iterating resource IDs without having to construct temporary sets. 

microbenchmark shows a slight perf improvement.
last nightly: `placement group create/removal per second 837.76 +- 16.68`.
this PR: `placement group create/removal per second 895.76 +- 16.99`.
2022-03-31 17:16:11 -07:00
Yi Cheng
5a2ab76af8
[flaky] Release gcs client in test (#23644)
To deflaky gcs_client_test, this PR tries to release the client object.
2022-03-31 16:57:50 -07:00
Yi Cheng
8d7f71601d
deflaky ray syncer test (#23641) 2022-03-31 13:42:30 -07:00
Yi Cheng
87dc57df26
[3][cleanup][gcs] Remove redis based pubsub. (#23520)
This PR removes redis based pubsub.
2022-03-31 00:13:55 -07:00
Yi Cheng
df3e761b18
[gcs] Change old syncer to gcs_syncer namespace (#23623)
Before integration of the newly introduced ray_syncer, there is a conflict in naming. This PR move the old ray syncer to another namespace.
2022-03-30 21:54:02 -07:00
ZhuSenlin
e79a63db64
[GCS] [3 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler #23371
As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler.

There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs.
This is [3/n]:

Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy
Delete gcs_resource_scheduler
Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test
BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment)

To be clear:

scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes.
Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy.
Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)
2022-03-30 17:39:46 -07:00
Yi Cheng
d01f947ff1
[gcs] Make core worker test compilable. (#23608)
It seems like core worker test is not running and it breaks the build. This PR fixed this.
2022-03-30 17:26:38 -07:00
jon-chuang
54ddcedd1a
[Core] Chore: move test to right dir #23096 2022-03-30 09:29:38 -07:00
Yi Cheng
781c46ae44
[scheduling][5] Refactor resource syncer. (#23270)
## Why are these changes needed?

This PR refactor the resource syncer to decouple it from GCS and raylet. GCS and raylet will use the same module to sync data. The integration will happen in the next PR.

There are several new introduced components:

* RaySyncer: the place where remote and local information sits. It's a coordinator layer.
* NodeState: keeps track of the local status, similar to NodeSyncConnection.
* NodeSyncConnection: keeps track of the sending and receiving information and make sure not sending the information the remote node knows.

The core protocol is that each node will send {what it has} - {what the target has} to the target.
For example, think about node A <-> B. A will send all A has exclude what B has to B.

Whenever when there is new information (from NodeState or NodeSyncConnection), it'll be passed to RaySyncer broadcast message to broadcast. 

NodeSyncConnection is for the communication layer. It has two implementations Client and Server:

* Server => Client: client will send a long-polling request and server will response every 100ms if there is data to be sent.
* Client => Server: client will check every 100ms to see whether there is new data to be sent. If there is, just use RPC call to send the data.

Here is one example:

```mermaid
flowchart LR;
    A-->B;
    B-->C;
    B-->D;
```

It means A initialize the connection to B and B initialize the connections to C and D

Now C generate a message M:

1. [C] RaySyncer check whether there is new message generated in C and get M
2. [C] RaySyncer will push M to NodeSyncConnection in local component (B)
3. [C] ServerSyncConnection will wait until B send a long polling and send the data to B
4. [B] B received the message from C and push it to local sync connection (C, A, D)
5. [B] ClientSyncConnection of C will not push it to its local queue since it's received by this channel.
6. [B] ClientSyncConnection of D will send this message to D
7. [B] ServerSyncConnection of A will be used to send this message to A (long-polling here)
8. [B] B will update NodeState (local component) with this message M
9. [D] D's pipelines is similar to 5) (with ServerSyncConnection) and 8)
10. [A] A's pipeline is similar to 5) and 8)
2022-03-29 23:52:39 -07:00
Stephanie Wang
da7901f3fc
[core] Filter out self node from the list of object locations during pull (#23539)
Running Datasets shuffle with 1TB data and 2k partitions sometimes times out due to a failed object fetch. This happens because the object directory notifies the PullManager that the object is already on the local node, even though it isn't. This seems to be a bug in the object directory.

To work around this on the PullManager side, this PR filters out the current node from the list of locations provided by the object directory. @jjyao confirmed that this fixes the issue for Datasets shuffle.
2022-03-29 15:18:14 -07:00
Yi Cheng
61c9186b59
[2][cleanup][gcs] Cleanup GCS client options. (#23519)
This PR cleanup GCS client options.
2022-03-29 12:01:58 -07:00
Hao Chen
b7d32df8b0
Refactor scheduler data structures (#22854)
This is the first PR to refactor scheduler data structures (See #22850).

Major changes:
- Hid the implementation details in the `ResourceRequest` and `TaskResourceInstnaces` classes, which expose public methods such as algebra operators and comparison operators. 
- Hid the differences between "predefined" and "custom" resources inside these 2 classes. Call sites can simply use the resource ID to access the resource, no matter it is predefined or custom.
- The predefined_resources vector now always has the full length. So no more "resize"s are needed. 
- Removed the `ResourceCapacity` class. Now "total" and "available" resources are stored in separate fields in "NodeResources". 
- Moved helper functions for FixedPoint vectors from "cluster_resource_data.h" to "fixed_point.h"
- "ResourceID" now has static methods to get the resource ids of predefined resources, e.g. "ResourceID::CPU()". 
- Encapsulated unit-instance resource logic to "ResourceID"

Other planned changes that are not included in this PR:
- Rename ResourceRequest to ResourceSet, and move it to its own file.
- Remove the predefined vectors and always use maps.

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-29 19:44:59 +08:00
Matti Picus
77c4c1e48e
WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449) 2022-03-29 00:56:42 -07:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Chen Shen
51bdefc2c8
[scheduler][monitoring] dump detailed spilling metrics (#23321)
Dump the detailed spilling metrics in scheduler.
2022-03-28 10:49:04 -07:00
Qing Wang
ef5b9b87d3
[Java] Add set runtime env api for normal task. (#23412)
This PR adds the API `setRuntimeEnv` for submitting a normal task, for the usage:
```java
RuntimeEnv runtimeEnv =
    new RuntimeEnv.Builder()
        .addEnvVar("KEY1", "A")
        .build();

/// Return `A`
Ray.task(RuntimeEnvTest::getEnvVar, "KEY1").setRuntimeEnv(runtimeEnv).remote().get();
```
2022-03-24 15:57:24 +08:00
mwtian
26f1a7ef7d
[Core] Account for spilled objects when reporting object store memory usage (#23425) 2022-03-23 22:25:22 -07:00
Eric Liang
38925f60d2
Add a get_if_exists option for simpler creation of named actors (#23344)
Getting or creating a named actor is a common pattern, however it is somewhat esoteric in how to achieve this. Add a utility function and test that it doesn't cause any scary error messages.

Actor.options(name="my_singleton", get_if_exists=True).remote(args)
2022-03-23 22:02:58 -07:00
Chong-Li
6e0e46ea56
[GCS] Make gcs scheduler accommodate cluster/local task managers (#22942)
* Accommodate cluster and local task managers

* Fix warning

* Fix bug

* Format

* Format

* fix torch

* Fix comments

* lint

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2022-03-23 15:58:59 -07:00
Jiajun Yao
dfebf7ffae
Fix metric type for NumSpilledTasks to gauge (#23391)
The metric type for NumSpilledTasks should be gauge since the sum already happens in SchedulerStats.
2022-03-22 16:17:00 -07:00
Guyang Song
69af9764b2
[runtime env] URI reference refactor (#22828)
- Move the URI reference logic from raylet to agent.
- Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible`
- More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528

Future works
- We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this.
- Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count.
- Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?
2022-03-21 11:21:15 -05:00
Larry
81dcf9ff35
[Placement Group] Make PlacementGroupID generate from JobID (#23175) 2022-03-21 17:09:16 +08:00
ZhuSenlin
871f749baf
[GCS] [2 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler (#23323)
* Add new interface to policy for batch scheduling and unify the scheduling result and context

* Remove the dependence of GcsClient on ClusterResourceScheduler

* fix compile error

* fix lint error

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-03-20 15:03:14 -07:00