hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
ZhuSenlin	125ef0e5a6	[GCS] integrate cluster_resource_manager into gcs_resource_manager and gcs_resource_scheduler (#23105 ) * refactor gcs_resource_manager * fix lint error * fix lint error * fix compile error * fix test * fix test * fix test * add unit test * refactor UpdateNodeNormalTaskResources * fix comment Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-16 16:27:14 -07:00
Chen Shen	5a2ebc281c	[Scheduler] separate scheduler code to its own build target (#23124 ) * wip * comments * fix build * fix-test * fix format	2022-03-14 23:23:58 -07:00
Tao Wang	10c03cb126	Migrating to flat hash map [GCS&util&common] (#22932 ) Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules. The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated. The follow-up PRs would be migrating in core worker, placement group and others.	2022-03-11 18:35:06 +09:00
Chen Shen	bc3f7a7684	[scheduling policy 3/n][rfc] Refactor SchedulingPolicy into interface and implementations (#22907 ) * scheduling policy * update Co-authored-by: Gagandeep Singh <gdp.1807@gmail.com>	2022-03-08 18:47:56 -08:00
Chen Shen	3e3db8e9cd	[scheduler] hide StringIDMap under BaseSchedulingID (#22722 ) * add * address comments	2022-03-01 22:50:53 -08:00
Chen Shen	dfcb0f5de5	[clean up ClusterResourceScheduler 1/n] move IsSchedulable logic into ClusterResourceManager #22711	2022-02-28 20:37:56 -08:00
Lingxuan Zuo	94caac8722	Remove exporting symbols (#22623 ) To hidden symbols of thirdparty library, this pull request reuses internal namespace that can be imported by any external native projects without side effects. Besides, we suggest all of contributors to make sure it'd better use thirdparty library in ray scopes/namspaces and only ray::internal should be exported. More details in https://github.com/ray-project/ray/pull/22526 Mobius has applied this change in https://github.com/ray-project/mobius/pull/28. Co-authored-by: 林濯 <lingxuan.zlx@antgroup.com>	2022-02-28 09:41:10 +08:00
Stephanie Wang	634ca9afdb	[core] Cleanup handling for nondeterministic object size during transfer (#22639 ) Currently object transfers assume that the object size is fixed. This is a bad assumption during failures, especially with lineage reconstruction enabled and tasks with nondeterministic outputs. This PR cleans up the handling and hopefully guards against two cases where the object size may change during a transfer: 1. The object manager's size information does not match the object in the local plasma store (due to async notifications). --> the object manager overwrites its own information if it finds that the physical object has a different size. 2. The receiver's created buffer size does not match the sender's object size. --> the receiver destroys the previous buffer and creates a new buffer with the correct size. This might cause some transient errors but eventually object transfer should succeed. Unfortunately I couldn't trigger this from Python because it depends on some pretty specific timing conditions. However, I did add some unit tests for case 2 (this is the majority of the PR).	2022-02-25 09:39:14 -08:00
mwtian	839bc5019f	Fix building Windows wheels (#22388 ) (#22391 ) This fixes Windows wheel build issue on master and releases/1.11.0 branch. If the issue happens more often we can try to run iwyu.	2022-02-15 15:24:10 -08:00
Lingxuan Zuo	ec62d7f510	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 ) New repo url is https://github.com/ray-project/mobius Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-23 17:53:41 +08:00
Yi Cheng	e4ba51f25b	[core] Add GC for function table (#21509 ) In Ray, functions are exported to the function table during runtime. But it's not cleaned up after use. This PR garbage collects the resource when there is no job/detached actor referencing the resource. Ideally, we should move the function table imports/exports feature to core, so gcs function manager is introduced, and currently, it's for reference counting only.	2022-01-13 18:06:05 -08:00
SangBin Cho	f5fdbeb594	Refactor event tracker out of asio class (#21215 ) This refactors the event tracker to be decoupled from the asio class. Co-authored-by: Eric Liang <ekhliang@gmail.com>	2022-01-12 22:43:31 -08:00
Jiajun Yao	aec37d4b60	Add container utils (#21444 ) - Add debug_string helper functions for common containers. - Add map_find_or_die helper function	2022-01-10 15:29:29 -08:00
Jiajun Yao	48a5208645	Refactor ObjectManager wait logic to WaitManager (#21369 ) - This PR moves the `ObjectManager::Wait` related logic to a separate WaitManager class. - Fix the wait hang issue by not relying on the async object location notification, but checking if wait is complete when the local object is added, at that time the object is guaranteed to be local.	2022-01-06 10:42:31 -05:00
Qing Wang	3c68370fcf	[Core] Cache job_configs instead of ray_namespace. (#21279 ) We need to get not only ray_namespace config of a job. In this PR, we cache the job_configs instead of ray_namespaces, so that we can use it for other PR(For example, this PR #21249 needs the num_java_worker_pre_process item). Also, before this PR, ray_namespaces_ cache will not be cleared, and we clear the cache in this PR.	2022-01-05 17:48:06 -08:00
mwtian	2410ec5ef0	[Core][Dashboard Pubsub 1/n] Allow a channel to have subscribers to a key and to the whole channel concurrently (#20954 ) For actor channel, GCS clients subscribe to a single actor but dashboard subscribes to all actors. This change makes supporting this possible. Most of the added code is in `integration_test.cc`, which tests the publisher and subscriber together. Also, add the basic support for dashboard reporter pubsub.	2021-12-09 15:00:38 -08:00
SangBin Cho	5298a9046c	[Internal Observability] [Part 1] Centralize existing metrics to metric_defs.h (#20728 ) This PR centralizes all existing metrics to `metric_defs.h`. Previously, each file relies on implicit import of metric_def.h within the stats module. After this PR we only precisely import `metric_defs.h` for each file.	2021-12-08 14:06:05 -08:00
Yi Cheng	442b1025cd	[1/gcs-mem-kv] Memory mode for internal kv (#20881 ) This is part work of redis removal. In this PR we introduced a new mode for internal kv, memory mode. There are two ways to address this: - Update store client and use store client in internal kv - Add memory table into internal kv directly. The former one actually is a better choice since it put everything related to storage into a lowerlevel. But it's pretty hard to do this now, since internal kv use hset/hget and redis store client use set/get, so the data will not be compatible and it'll be a brake change. So the easier way to do this is 2) and it's what this PR doing. Next: use the flag for store client	2021-12-08 10:40:35 -08:00
Lixin Wei	96dc10a95a	[Core] Fix Crash in ObjectDirectory (#20540 ) Here we met a crash in line 446's RAY_CHECK `d26c9e67e8/src/ray/object_manager/ownership_based_object_directory.cc (L441-L450)` And we found out that it's because we didn't set the node_id for dead nodes. If there are dead nodes and we are trying to LookupRemoteConnectionInfo in it. This crash will happen. This PR fixes this crash.	2021-12-07 23:03:49 -08:00
Yi Cheng	ea1d081aac	[core] Simple chaos testing for asio (#19970 ) Right now in ray, a lot of edge cases related to grpc are not tested. This PR is just a simple try to give the developer some way to delay grpc request. It could be used with manual testing and also e2e test since it's supporting delay for specific grpc method. To use this feature, just simple set os env `RAY_TESTING_ASIO_DELAY_US="method1=10:20,method2=20:30,*=200:200"` This means, for `method1` it'll delay 10-20us, for method2 it'll delay 20-30us. For all the rest, it'll delay 200us.	2021-12-07 14:47:07 -08:00
Qing Wang	116bda8f05	[Core] Remove duplicated implementations of concurrency group executor. (#20467 ) ## Why are these changes needed? ThreadPoolManager and FiberStateManager have the same functionality and logic. This PR aims to remove the duplicate implementations of them. Add a ConcurrencyGroupExecutor class to do that logic. `ConcurrencyGroupExecutor<FiberState>` is used as FiberStateManager, `ConcurrencyGroupExecutor<BoundedExecutor>` is used as ThreadPoolManager.	2021-11-27 12:57:40 +08:00
Gagandeep Singh	f22a24aca4	Replace time based seed generation with `absl::BitGen` and `absl::Uniform` (#20696 )	2021-11-24 14:36:35 -08:00
Guyang Song	53630ee03b	Revert "Revert "[runtime env] redefine runtime env to protobuf"" and fix windows compiling (#20692 ) - Fix windows compiling and revert https://github.com/ray-project/ray/pull/20641 - Seems the pr https://github.com/ray-project/ray/pull/20670 can solve the windows compiling issue.	2021-11-24 09:01:01 -08:00
Alex Wu	9388d28233	Revert "[runtime env] redefine runtime env to protobuf" (#20641 ) Reverts #19511 Breaks windows compilation	2021-11-22 13:11:30 -08:00
Guyang Song	ad56b9b432	[runtime env] redefine runtime env to protobuf (#19511 )	2021-11-20 16:54:42 +08:00
Chen Shen	f02b53a810	[Core][actor out-of-order execution 3/n] Introducing out-of-order actor submit queue (#20150 ) Why are these changes needed? This is the third PR in the stack that supports out or order execution for threaded/async actors. Previous PR #20149 Next PR #20160 At a high level, threaded actor/async actor already don't guarantee execution order, and the current "sequential" order implementation has caused some confusion and inconvenience. Please refer to #19822 for detailed discussion. In this PR, we implemented the out-of-order of queue that supports out of order execution. Conceptually it's very simple: it sends the requests as soon as the dependency is resolved.	2021-11-16 10:48:49 -08:00
Tao Wang	507bd9186b	[Core]Make convertion between ray/grpc status more specific (#20047 ) * [Core]Make convertion between ray/grpc status more specific * per comments * lint * per comments * use ABORT instead of UNKNOWN, add some tests * lint * lint	2021-11-10 00:48:05 -08:00
Lingxuan Zuo	97259e33b2	Relink grpc/absl for streaming.so (#20136 ) To avoid exporting thrirdparty library symbol globally, these absl/grpc libs have been applied in _streaming.so. Side-effect: Static variables might be uninitialized if core worker lib and streaming lib both use them.	2021-11-09 14:13:53 +08:00
mwtian	ef4b6e4648	[Core][GCS] remove gcs object manager (#19963 )	2021-11-02 16:20:53 -07:00
SangBin Cho	99b5932d06	Add a simple node failure integration test + clean up spammy logs upon node failures (#19695 ) * . * Done * clean up * lint * fix a bug * lint * fix issue * Remove no-op from StartRayLog * Addressed code review.	2021-10-29 18:42:35 -04:00
Lixin Wei	56301e34b2	[Refactor] Remove ServiceBased Abstraction (#19694 ) ## Why are these changes needed? Prior to this PR, we have: ```cpp class XxxAccessor {} class ServiceBasedXxxAccessor : public XxxAccessor{} class GcsClient {} class ServiceBasedGcsClient : public GcsClient{} ``` However, XxxAccessor has only one implementation: ServiceBasedXxxAccessor. And GcsClient has only one implementation: ServiceBasedGcsClient. I think this abstraction is not necessary and will make development hard(I have to modify two files every time). This PR removes all ServiceBasedXxx and moves its implementations to the base class. Now we only have: ```cpp class XxxAccessor {} class GcsClient {} ```	2021-10-29 10:16:14 -07:00
mwtian	b238297bfb	[Core][Pubsub] Support subscribing to GCS via Ray pubsub (#19687 ) This PR adds more infrastructure for subscribing to GCS via ray::pubsub instead of Redis. Most important logic added are GCS subscriber RPC interface in src/ray/protobuf/gcs_service.proto GCS subscriber handler in src/ray/gcs/gcs_server/pubsub_handler.{h,cc} GCS wrapper for ray::pubsub subscriber in src/ray/gcs/pubsub/gcs_pub_sub.{h,cc} Other files are modified for adding boilerplates, plumbing, removing dead code and cleanups. This PR can also be reviewed commit by commit. 418f065, 3279430 are cleanups. 028939c is a pure-refactoring of how GCS clients subscribe to GCS updates that should not change behavior yet, similar to [Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher #19600. 286161f parameterized gcs_server_test to test GCS pubsub. The rest of commits have new logic added. All new logic are behind the gcs_grpc_based_pubsub flag, so this PR should not affect Ray's default behavior. The added subscriber logic was tested by enabling gcs_grpc_based_pubsub in service_based_gcs_client_test.cc and adding basic handling logic for TaskLease. Since TaskLease pubsub will be removed, the change will not be checked in. Next step is to support SubscribeAll entities for a channel in ray::pubsub, and test migrating more channels.	2021-10-28 01:18:54 +08:00
Yi Cheng	48fb86a978	[core] Fix the spilling back failure in case of node missing (#19564 ) ## Why are these changes needed? When ray spill back, it'll check whether the node exists or not through gcs, so there is a race condition and sometimes raylet crashes due to this. This PR filter out the node that's not available when select the node. ## Related issue number #19438	2021-10-22 11:22:07 -07:00
mwtian	530f2d7c5e	[Pubsub] Wrap Redis-based publisher in GCS to allow incrementally switching to the GCS-based publisher (#19600 ) ## Why are these changes needed? The most significant change of the PR is the `GcsPublisher` wrapper added to `src/ray/gcs/pubsub/gcs_pub_sub.h`. It forwards publishing to the underlying `GcsPubSub` (Redis-based) or `pubsub::Publisher` (GCS-based) depending on the migration status, so it allows incremental migration by channel. - Since it was decided that we want to use typed ID and messages for GCS-based publishing, each member function of `GcsPublisher` accepts a typed message. Most of the modified files are from migrating publishing logic in GCS to use `GcsPublisher` instead of `GcsPubSub`. Later on, `GcsPublisher` member functions will be migrated to use GCS-based publishing. This change should make no functionality difference. If this looks ok, a similar change would be made for subscribers in GCS client. ## Related issue number	2021-10-22 10:52:36 -07:00
Yi Cheng	a3dc07b1ee	[core] Fix some legacy issues (#19392 ) ## Why are these changes needed? There are some issues left from previous PRs. - Put the gcs_actor_scheduler_mock_test back - Add comment for named actor creation behavior - Fix the comment for some flags. ## Related issue number	2021-10-15 18:06:01 -07:00
Matti Picus	9ca34c7192	add dependencies to BUILD.bazel and update windows bazel to 4.2.1 (#19132 ) * add dependencies to BUILD.bazel and update windows bazel to 4.2.1 * fixes from review	2021-10-11 10:25:19 -07:00
Yi Cheng	056c3af699	[core] Update placement group retry implementation (#18842 ) * exp backoff * up * format * up * up * up * up * up * format * fix * up * format * adjust ordering * up * Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)" This reverts commit `2e99fb215f`. * up * update * format * up * format * fix * Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"" This reverts commit 93425fdb986059e53699623a0fc8590c062e139b. * up * format * fix lint * up * up * up * up * check * add test1 * format * up * add test * up * up * up * fix * up * up * up * add test * format * up * up * fix lint * format * fix * format * fix * up	2021-10-04 21:31:56 -07:00
Yi Cheng	16cf719aff	[core] hot fix of build failure (#18963 )	2021-09-28 20:29:28 -07:00
Yi Cheng	e3dd1e3751	Revert "Revert "[test] add unit test for PR #17634 (#18585 )" (#18830 )" (#18871 ) * Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" This reverts commit `8dd3057644`. * up	2021-09-28 05:53:52 -07:00
Jiajun Yao	e79f271b05	Fix nil redis array element (#18813 )	2021-09-24 20:11:43 -07:00
Yi Cheng	8dd3057644	Revert "[test] add unit test for PR #17634 (#18585 )" (#18830 ) This reverts commit `73c3cff18b`.	2021-09-22 16:51:02 -07:00
Yi Cheng	73c3cff18b	[test] add unit test for PR #17634 (#18585 )	2021-09-22 14:39:30 -07:00
Yi Cheng	07babd807c	Revert "Revert "[core] Async submitting actor registerring (#18009 )" (#18719 )" (#18722 )	2021-09-20 19:17:00 -07:00
Yi Cheng	cf64ab5b90	Revert "[core] Async submitting actor registerring (#18009 )" (#18719 ) This reverts commit `8ce01ea2cc`.	2021-09-17 13:34:12 -07:00
Yi Cheng	8ce01ea2cc	[core] Async submitting actor registerring (#18009 )	2021-09-17 10:03:35 -07:00
Simon Mo	317a34c523	[Serve] Use BackendConfig Protobuf (#17835 )	2021-09-16 11:08:23 -07:00
Stephanie Wang	be7cb70c30	[core] Fix ref counting during actor construction (#18646 ) * test * fix * cpp * skip windows Co-authored-by: Eric Liang <ekhliang@gmail.com>	2021-09-15 22:16:53 -07:00
Edward Oakes	7736cdd91d	[dashboard] Rename "new_dashboard" -> "dashboard" (#18214 )	2021-09-15 11:17:15 -05:00
Chong-Li	d314d0c10e	[GCS] Fix the Windows build of GCS actor scheduling (#18012 )	2021-09-10 17:17:25 -07:00

1 2 3 4 5 ...

327 commits