hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Chong-Li	1807cff9b6	Replace the legacy ResourceSet & SchedulingResources at Raylet (#23173 )	2022-04-22 14:46:38 +08:00
SangBin Cho	30ab5458a7	[State Observability] Tasks and Objects API (#23912 ) This PR implements ray list tasks and ray list objects APIs. NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.	2022-04-21 18:45:03 -07:00
mwtian	02b0d82cf8	[Ray client] return `None` from internal KV for non-existent keys (#24058 ) This fixes the behavior diff between client and non-client internal KV.	2022-04-21 10:55:57 -07:00
jon-chuang	ddcc252b51	[Core] Ray logs API (1/n) (#23435 ) Expose HTTP endpoint to retrieve logs from ray cluster	2022-04-20 23:11:02 -07:00
Yi Cheng	04611edf5a	[scheduler] Update syncer API and add reconnect feature. (#23929 ) This PR focuses on updating syncer-related code and comments from this #23660 to reduce the code size. Update Snapshot/Update -> CreateSyncMessage/ConsumeSyncMessage Make ray syncer test work even when we add more components in the protobuf Make ray syncer able to reconnect to a new node.	2022-04-20 14:31:24 -07:00
Gagandeep Singh	554831fad1	Increase register timeout seconds (#23223 )	2022-04-20 12:25:01 -07:00
Jiajun Yao	6cfec51d1e	Spread even if nodes are not available (#23445 ) Several changes to make spread scheduling work better under load: * When nodes are not available, spread among feasible nodes. * If grant_or_reject is true, don't spill back if the selected node is not available. * Don't spill due to waiting for dependencies for spread tasks.	2022-04-20 07:35:15 -07:00
mwtian	34fb092656	[Pubsub] reduce memory usage for channels that do not require total memory cap (#23985 ) In `a1e06f64ae`, memory bound was added for each subscribed entity in the publisher. It adds two extra `std::deque` per subscribed entity, which turns out to cost a lot more memory when there are a large number of `ObjectRef`s: https://github.com/ray-project/ray/pull/23853#issuecomment-1098382286 This PR avoids the extra memory usage for entities in channels unlikely to grow too large, i.e. all channels except those for logs and error info. Subscribed entity memory usage no longer shows up in the memory profile when there are 1M object refs. Raw data: [profile006.pb.gz](https://github.com/ray-project/ray/files/8508547/profile006.pb.gz)	2022-04-19 17:44:15 -07:00
jon-chuang	e0c0ea2e59	[Core] Add node_name field to GcsNodeInfo (#23543 ) Make it easier to identify nodes by a string identifier separate from their IP address.	2022-04-19 05:03:12 -07:00
mwtian	ea66192a38	[GCS] Use gRPC instead of socket for GCS client health check (#23939 ) A user has reported a crash in GCS client where the client was unable to connect to the GCS server after retries, even when GCS server has always been running. I was not able to reproduce the exact issue, but noticed that the health check logic with socket has unexpected behavior sometimes, e.g. it is much slower to use socket for health check compared to using gRPC (~40s vs < 1s sometimes). The user issue could be related to this slowness, so this PR updates the logic to use gRPC health check.	2022-04-18 16:55:29 -07:00
mwtian	d5d2ef4249	[Core] Add a utility to check GCS / Ray cluster health (#23382 ) * Provide a utility to ping a Ray cluster and verify it has the same Ray version. This is useful to check if a Ray cluster is available at a given address, without connecting to the cluster with the more heavyweight ray.init(). This utility is integrated with ray memory to provide a better error message when the Ray cluster is unavailable. There seem to be user demand for exposing this as an API as well. * Improve the error message when the address provided to Ray does not contain port.	2022-04-18 09:58:45 -07:00
Lingxuan Zuo	b7d148815e	[CoreWorker API] collect mobius used core worker api to internal (#23961 ) To remove symbols conflict effect on core worker linked different ray versions. This PR extracts an united core worker api (not all) and collect them into a internal library, so native devs can use them anywhere no matter the core worker implementation changes.	2022-04-18 16:03:20 +08:00
Jiajun Yao	5d7f45fc8f	Unify AddSpilledUrl into UpdateObjectLocationBatch RPC (#23872 ) - Logically these two rpcs are about notifying the owner about the object location changes, so we should just have one rpc for that purpose. This prevents out-of-order updates seen by the owner (i.e. receiving object removed from object store before spill update). Also by using UpdateObjectLocationBatch, we get batch update for free. - Maintain a FIFO order for object location updates so we won't have starvation.	2022-04-17 21:48:29 -07:00
mwtian	16227683f6	[Core] trim size of Reference struct (#23853 ) During large scale shuffle (number of partitions used >= 1000), driver uses significant amount of memory for storing ObjectRefs. On Intel MacOS, each Reference struct currently takes up 592 bytes. We can reduce per-Reference memory footprint: - During shuffle, no ObjectRef borrowing or nesting happens. And in this case fields related to borrowing or nesting should not take up memory. This reduces sizeof(Reference) from 592 to 400. - Fields in the Reference struct can be reordered to enhance packing. This reduces sizeof(Reference) from 400 to 368. On Intel MacOS running the shuffle benchmark with 1000 partitions and 10MB partition size, RSS at the end of shuffle drops from ~5GB to ~4.5GB. Related issue number #23604	2022-04-14 13:25:06 -07:00
Jiajun Yao	95714cc281	Node affinity scheduling strategy (#23381 ) Instead of relying on the node-ip custom resource for static task-to-node placement, this PR introduces an explicit NodeAffinitySchedulingStrategy with the following benefits: 1. Specify node using id instead of ip since ip may not be unique for each node. 2. Support soft constraint so the task can be tolerant to node failures. After this PR, the node-ip custom resource can be deprecated.	2022-04-12 21:31:26 -07:00
Tao Wang	6aefe9b36e	[Core]Save task spec in separate table (#22650 ) This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs. As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.	2022-04-12 12:24:26 -07:00
Chen Shen	15c32a210c	[GCS][Storage unification 1/n] store index in memory instead of storage (#23754 ) Today we have two storage interfaces in Gcs, one is InternalKvInterface which exposes key value interfaces, another is StoreClient which is kv interface with secondary index support. To make GCS storage pluggable, we need to narrow down and unify the storage interface. This is a try to only use kv store and build index purely in memory. known limitations: we need to rebuild index during GCS startup there might be consistency issues when concurrent change (write/delete) to the same key; but the current redis based solution also suffer from the same issue.	2022-04-10 15:56:08 -07:00
mwtian	a1e06f64ae	[Pubsub] Add memory buffer limit in publisher for each subscribed entity (#23707 ) We want to limit the maximum memory used for each subscribed entity, e.g. to avoid having GCS run out of memory if some workers publish a huge amount of logs and subscribers cannot keep up. After this change, Ray publishers maintain one message buffer for all subscribers of an entity, and one message buffer for all subscribers of all entities in the channel. The limit can be configured with publisher_entity_buffer_max_bytes. The default value is 10MiB.	2022-04-08 00:36:38 -07:00
ZhuSenlin	7a46f5176a	[scheduler] Refactor scheduling policy interface (#23628 ) As we (@scv119 @raulchen @Chong-Li @WangTaoTheTonic) discussed offline, and @scv119 also mentioned it in (#23323 (comment)). I refactor the interface of ISchedulingPolicy and make it expose only one batch interface., and still provide a SingleSchedulePolicy to be compatible with single scheduler. Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-04-07 20:59:43 -07:00
Stephanie Wang	1c972d5d2d	[core] Spill at least the object fusion size instead of at most (#22750 ) Copied from #22571: Whenever we spill, we try to spill all spillable objects. We also try to fuse small objects together to reduce total IOPS. If there aren't enough objects in the object store to meet the fusion threshold, we spill the objects anyway to avoid liveness issues. However, currently we spill at most the object fusion size when instead we should be spilling at least the fusion size. Then we use the max number of fused objects as a cap. This PR fixes the fusion behavior so that we always spill at minimum the fusion size. If we reach the end of the spillable objects, and we are under the fusion threshold, we'll only spill it if we don't have other spills pending too. This gives the pending spills time to finish, and then we can re-evaluate whether it's necessary to spill the remaining objects. Liveness is also preserved. Increases some test timeouts to allow tests to pass.	2022-04-05 10:57:42 -07:00
liuyang-my	bdd3b9a0ab	[Serve] Unified Controller API for Cross Language Client (#23004 )	2022-04-05 09:20:02 -07:00
jon-chuang	9c950e8979	[Core] Placement Group: Fix Flakey Test placement_group_test_5 and Typo (#23350 ) placement_group_test_5 is flakey. Reason is requesting PG with exact object store memory as node. If object store has object, then PG scheduling fails. Also fix bug - typo.	2022-04-05 05:33:43 -07:00
Yi Cheng	99ca8ee8e4	[flaky] Deflaky `ray_syncer_test` (#23703 ) ``` src/ray/common/test/ray_syncer_test.cc:495: Failure \| Expected: (s1.GetNumConsumedMessages(s2.syncer->GetLocalNodeID())) < (max_sends * 2 + 3), actual: 5 vs 5 ``` This is measuring number of request send. For extreme case, they should equal. This PR fixed this.	2022-04-04 19:38:58 -07:00
Yi Cheng	e1a974aa9c	[gcs] Remove not useful options in redis client options. (#23572 ) This PR removes not useful options in Redis client options.	2022-04-01 14:41:15 -07:00
Tao Wang	2ce3cd0073	[Hotfix]Fix compile failure (#23651 )	2022-04-01 13:23:11 -07:00
Stephanie Wang	b43426bc33	[core] Add metrics for disk and network I/O (#23546 ) Adds some metrics useful for object-intensive workloads: Per raylet/object manager: Add num bytes pending restore to spill manager Add num requests cumulative to PullManager Num bytes pushed/pulled from other nodes cumulative Histogram for request latencies in PullManager: total life time of request, from start to cancel request satisfaction time, from start to object local pull time, from object activation to object local Per-node disk read/write speed, IOPS	2022-04-01 11:15:34 -07:00
Jiajun Yao	c5c5c24e8f	Remove unused ObjectDirectory::LookupLocations() (#23647 ) Remove dead code.	2022-04-01 10:03:37 -07:00
Hao Chen	75f1861625	Remove predefined resources vector in ResourceRequest (#23584 ) "ResourceRequest" now uses 2 containers: a vector for predefined resources, and a map for custom resources. This was intended to be a perf optimization. However, in practice, this makes the code more complex, and, moreover, prevents optimizations for some methods (e.g., "ResourceIds", "Size"). This PR removes the vector and makes ResourceRequest use only one map for all resources. Also, "ResourceIds" now returns a "boost:range" to allow iterating resource IDs without having to construct temporary sets. microbenchmark shows a slight perf improvement. last nightly: `placement group create/removal per second 837.76 +- 16.68`. this PR: `placement group create/removal per second 895.76 +- 16.99`.	2022-03-31 17:16:11 -07:00
Yi Cheng	5a2ab76af8	[flaky] Release gcs client in test (#23644 ) To deflaky gcs_client_test, this PR tries to release the client object.	2022-03-31 16:57:50 -07:00
Yi Cheng	8d7f71601d	deflaky ray syncer test (#23641 )	2022-03-31 13:42:30 -07:00
Yi Cheng	87dc57df26	[3][cleanup][gcs] Remove redis based pubsub. (#23520 ) This PR removes redis based pubsub.	2022-03-31 00:13:55 -07:00
Yi Cheng	df3e761b18	[gcs] Change old syncer to gcs_syncer namespace (#23623 ) Before integration of the newly introduced ray_syncer, there is a conflict in naming. This PR move the old ray syncer to another namespace.	2022-03-30 21:54:02 -07:00
ZhuSenlin	e79a63db64	[GCS] [3 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler #23371 As we (@scv119 @iycheng @raulchen @Chong-Li @WangTaoTheTonic ) discussed offline, the GcsResourceScheduler on the GCS side should be unified to ClusterResourceScheduler. There is already a big PR( #23268 ) to do this, but in order to make review easy, I will split it to two or mall small PRs. This is [3/n]: Move the implementation of all policies from gcs_resource_scheduler to bundle_scheduling_plocy Delete gcs_resource_scheduler Refactor gcs_resource_scheduler_test to cluster_resource_scheduler_2_test BTW: The interface inside ISchedulingPolicy should be refactor in another PR, see the discussion #23323 (comment) To be clear: scorer related codes are moved out from gcs_resoruce_scheduler to scorer.h/.cc and no logic changes. Policy related codes are moved out from gcs_resoruce_scheduler to bundle_scheduling_policy.h/.cc, and a small part of the logic in "GcsResourceScheduler::Schedule" is distributed into each policy. Some codes inside gcs_placement_group_scheduler.h/.cc are changed to adapt to new data structure (SchedulingResult and SchedulingContext)	2022-03-30 17:39:46 -07:00
Yi Cheng	d01f947ff1	[gcs] Make core worker test compilable. (#23608 ) It seems like core worker test is not running and it breaks the build. This PR fixed this.	2022-03-30 17:26:38 -07:00
jon-chuang	54ddcedd1a	[Core] Chore: move test to right dir #23096	2022-03-30 09:29:38 -07:00
Yi Cheng	781c46ae44	[scheduling][5] Refactor resource syncer. (#23270 ) ## Why are these changes needed? This PR refactor the resource syncer to decouple it from GCS and raylet. GCS and raylet will use the same module to sync data. The integration will happen in the next PR. There are several new introduced components: * RaySyncer: the place where remote and local information sits. It's a coordinator layer. * NodeState: keeps track of the local status, similar to NodeSyncConnection. * NodeSyncConnection: keeps track of the sending and receiving information and make sure not sending the information the remote node knows. The core protocol is that each node will send {what it has} - {what the target has} to the target. For example, think about node A <-> B. A will send all A has exclude what B has to B. Whenever when there is new information (from NodeState or NodeSyncConnection), it'll be passed to RaySyncer broadcast message to broadcast. NodeSyncConnection is for the communication layer. It has two implementations Client and Server: * Server => Client: client will send a long-polling request and server will response every 100ms if there is data to be sent. * Client => Server: client will check every 100ms to see whether there is new data to be sent. If there is, just use RPC call to send the data. Here is one example: ```mermaid flowchart LR; A-->B; B-->C; B-->D; ``` It means A initialize the connection to B and B initialize the connections to C and D Now C generate a message M: 1. [C] RaySyncer check whether there is new message generated in C and get M 2. [C] RaySyncer will push M to NodeSyncConnection in local component (B) 3. [C] ServerSyncConnection will wait until B send a long polling and send the data to B 4. [B] B received the message from C and push it to local sync connection (C, A, D) 5. [B] ClientSyncConnection of C will not push it to its local queue since it's received by this channel. 6. [B] ClientSyncConnection of D will send this message to D 7. [B] ServerSyncConnection of A will be used to send this message to A (long-polling here) 8. [B] B will update NodeState (local component) with this message M 9. [D] D's pipelines is similar to 5) (with ServerSyncConnection) and 8) 10. [A] A's pipeline is similar to 5) and 8)	2022-03-29 23:52:39 -07:00
Stephanie Wang	da7901f3fc	[core] Filter out self node from the list of object locations during pull (#23539 ) Running Datasets shuffle with 1TB data and 2k partitions sometimes times out due to a failed object fetch. This happens because the object directory notifies the PullManager that the object is already on the local node, even though it isn't. This seems to be a bug in the object directory. To work around this on the PullManager side, this PR filters out the current node from the list of locations provided by the object directory. @jjyao confirmed that this fixes the issue for Datasets shuffle.	2022-03-29 15:18:14 -07:00
Yi Cheng	61c9186b59	[2][cleanup][gcs] Cleanup GCS client options. (#23519 ) This PR cleanup GCS client options.	2022-03-29 12:01:58 -07:00
Hao Chen	b7d32df8b0	Refactor scheduler data structures (#22854 ) This is the first PR to refactor scheduler data structures (See #22850). Major changes: - Hid the implementation details in the `ResourceRequest` and `TaskResourceInstnaces` classes, which expose public methods such as algebra operators and comparison operators. - Hid the differences between "predefined" and "custom" resources inside these 2 classes. Call sites can simply use the resource ID to access the resource, no matter it is predefined or custom. - The predefined_resources vector now always has the full length. So no more "resize"s are needed. - Removed the `ResourceCapacity` class. Now "total" and "available" resources are stored in separate fields in "NodeResources". - Moved helper functions for FixedPoint vectors from "cluster_resource_data.h" to "fixed_point.h" - "ResourceID" now has static methods to get the resource ids of predefined resources, e.g. "ResourceID::CPU()". - Encapsulated unit-instance resource logic to "ResourceID" Other planned changes that are not included in this PR: - Rename ResourceRequest to ResourceSet, and move it to its own file. - Remove the predefined vectors and always use maps. Co-authored-by: Chong-Li <lc300133@antgroup.com>	2022-03-29 19:44:59 +08:00
Matti Picus	77c4c1e48e	WINDOWS: enable and fix failures in test_runtime_env_complicated (#22449 )	2022-03-29 00:56:42 -07:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Chen Shen	51bdefc2c8	[scheduler][monitoring] dump detailed spilling metrics (#23321 ) Dump the detailed spilling metrics in scheduler.	2022-03-28 10:49:04 -07:00
Qing Wang	ef5b9b87d3	[Java] Add set runtime env api for normal task. (#23412 ) This PR adds the API `setRuntimeEnv` for submitting a normal task, for the usage: ```java RuntimeEnv runtimeEnv = new RuntimeEnv.Builder() .addEnvVar("KEY1", "A") .build(); /// Return `A` Ray.task(RuntimeEnvTest::getEnvVar, "KEY1").setRuntimeEnv(runtimeEnv).remote().get(); ```	2022-03-24 15:57:24 +08:00
mwtian	26f1a7ef7d	[Core] Account for spilled objects when reporting object store memory usage (#23425 )	2022-03-23 22:25:22 -07:00
Eric Liang	38925f60d2	Add a `get_if_exists` option for simpler creation of named actors (#23344 ) Getting or creating a named actor is a common pattern, however it is somewhat esoteric in how to achieve this. Add a utility function and test that it doesn't cause any scary error messages. Actor.options(name="my_singleton", get_if_exists=True).remote(args)	2022-03-23 22:02:58 -07:00
Chong-Li	6e0e46ea56	[GCS] Make gcs scheduler accommodate cluster/local task managers (#22942 ) * Accommodate cluster and local task managers * Fix warning * Fix bug * Format * Format * fix torch * Fix comments * lint Co-authored-by: Chong-Li <lc300133@antgroup.com>	2022-03-23 15:58:59 -07:00
Jiajun Yao	dfebf7ffae	Fix metric type for NumSpilledTasks to gauge (#23391 ) The metric type for NumSpilledTasks should be gauge since the sum already happens in SchedulerStats.	2022-03-22 16:17:00 -07:00
Guyang Song	69af9764b2	[runtime env] URI reference refactor (#22828 ) - Move the URI reference logic from raylet to agent. - Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible` - More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528 Future works - We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this. - Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count. - Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?	2022-03-21 11:21:15 -05:00
Larry	81dcf9ff35	[Placement Group] Make PlacementGroupID generate from JobID (#23175 )	2022-03-21 17:09:16 +08:00
ZhuSenlin	871f749baf	[GCS] [2 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler (#23323 ) * Add new interface to policy for batch scheduling and unify the scheduling result and context * Remove the dependence of GcsClient on ClusterResourceScheduler * fix compile error * fix lint error Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-20 15:03:14 -07:00

1 2 3 4 5 ...

2756 commits