hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 18:41:40 -05:00

Author	SHA1	Message	Date
Jiajun Yao	dfebf7ffae	Fix metric type for NumSpilledTasks to gauge (#23391 ) The metric type for NumSpilledTasks should be gauge since the sum already happens in SchedulerStats.	2022-03-22 16:17:00 -07:00
Guyang Song	69af9764b2	[runtime env] URI reference refactor (#22828 ) - Move the URI reference logic from raylet to agent. - Redefine the runtime env agent RPC to `CreateRuntimeEnvOrGet` and `DeleteRuntimeEnvIfPossible` - More details https://github.com/ray-project/ray/issues/21695#issuecomment-1032161528 Future works - We don't remove the `RuntimeEnvUris` from `RuntimeEnv` protobuf in current PR because gcs also uses those URIs to do GC by runtime_env_manager. We should also clear this. - Ray client server shouldn't interact with agent directly. Or Ray client server should also decrease the reference count. - Currently, `WorkerPool::HandleJobStarted` will be called multiple times for one job. So we should make sure this function is idempotent. Can we change this logic and make this function be called only once?	2022-03-21 11:21:15 -05:00
Larry	81dcf9ff35	[Placement Group] Make PlacementGroupID generate from JobID (#23175 )	2022-03-21 17:09:16 +08:00
ZhuSenlin	871f749baf	[GCS] [2 / n] Refactor gcs_resource_scheduler to cluster_resource_scheduler (#23323 ) * Add new interface to policy for batch scheduling and unify the scheduling result and context * Remove the dependence of GcsClient on ClusterResourceScheduler * fix compile error * fix lint error Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-20 15:03:14 -07:00
mwtian	909cdea3cd	[Python Worker] add feature flag to support forking from workers (#23260 ) Make sure Python dependencies can be imported on demand, without the background importer thread. Use cases are: If the pubsub notification for a new export is lost, importing can still be done. Allow not running the background importer thread, without affecting Ray's functionalities. Add a feature flag to support forking from Python workers, by Enable fork support in gRPC. Disable importer thread and only leave the main thread in the Python worker. The importer thread will not run after forking anyway.	2022-03-18 14:47:18 -07:00
Jialing He	4a83bc3dc2	[runtime env] Support set timeout for runtime env setup (#23082 ) Interface example: ```python @ray.remote(runtime_env=RuntimeEnv(..., config=RuntimeEnvConfig(setup_timeout_s=10)) def f(): pass @ray.remote(runtime_env={..., "config": {"setup_timeout_s": 10}}) def f(): pass ``` Support set timeout second for timeout of runtime environment creation. Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>	2022-03-18 12:52:59 -05:00
ZhuSenlin	d3f92cca33	rename gcs_resource_scheduler to cluster_resource_scheduler (#23274 )	2022-03-18 13:19:33 +08:00
Tao Wang	b4bc8809dc	[Core][Tiny]Shorter thread name (#23222 ) In linux the thread name could not be longer than 15 chars. When we use command like top, we are easy being confused by similar thread name like `resource_report_poller` and `resource_report_broadcaster` because they are both show `resource_report`. This pr uses abbr to make the thread names shorter.	2022-03-18 09:58:32 +08:00
Chris K. W	6416c65505	Revert "Revert "[Client] chunked get requests (#22455 )"" (#23261 ) * revert revertchunkedgets * exit early if all chunks received, tighter exception handler for stream in proxy	2022-03-17 16:24:30 -07:00
ZhuSenlin	125ef0e5a6	[GCS] integrate cluster_resource_manager into gcs_resource_manager and gcs_resource_scheduler (#23105 ) * refactor gcs_resource_manager * fix lint error * fix lint error * fix compile error * fix test * fix test * fix test * add unit test * refactor UpdateNodeNormalTaskResources * fix comment Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-16 16:27:14 -07:00
Tao Wang	4614536572	Migrating to flat hash map [core worker&object manager] (#23126 ) Next move of #22932. This pr replace unordered_map to flat_hash_map in core worker and object manager module. Also some interfaces, like GetAllReferenceCounts, which expose user interfaces in Java/Python, is exclusive as it's a little bit complicated. We save them to deal with pg together. The follow-up PRs would be migrating in reference counting, placement group and others.	2022-03-15 22:16:28 -07:00
Qing Wang	149d06442b	[Core][Java][Remove JVM FullGC 3/N] Disable every 10min FullGC. (#21443 ) In this PR, we disabled every 10min FullGC which is not triggered by a global gc event in Java worker. As detail, we added `triggered_by_global_gc` flag to indicate whether the gc event is triggered by a global gc event. If it's triggered by global gc, we still need to do FullGC. Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-03-16 11:18:12 +08:00
qicosmos	d8de5a445a	[C++ Worker]Python call cpp actor (#23061 ) [Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.	2022-03-15 19:54:10 -07:00
Qing Wang	f51cb09e02	[Core][Java][Remove JVM FullGC 2/N] Make JVM be aware of in-memory store pressure. (#21441 )	2022-03-15 19:25:27 +08:00
Guyang Song	f65971756d	[dashboard agent] Catch agent port conflict (#23024 )	2022-03-15 16:09:15 +08:00
Chen Shen	5a2ebc281c	[Scheduler] separate scheduler code to its own build target (#23124 ) * wip * comments * fix build * fix-test * fix format	2022-03-14 23:23:58 -07:00
Kai Yang	35c7275bfc	[Object Spilling] Handle IO worker failures correctly (#20752 ) Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality. This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.	2022-03-15 12:14:14 +08:00
Jialing He	39a6c054d3	[runtime env][feature] introduce pip_check_enable and pip_version (#22826 )	2022-03-14 23:41:19 +08:00
Kai Yang	e9755d87a6	[Lint] One parameter/argument per line for C++ code (#22725 ) It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments. In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line. There are several benefits: * Conflict resolving is easier. * Less potential human mistakes when resolving conflicts. * Git history and Git blame are more straightforward. * Better readability. * Align with the new Python format style.	2022-03-13 17:05:44 +08:00
Chong-Li	f7e1343d39	[GCS] Fix the normal task resources at GCS (#22857 ) * Fix the normal task resources at GCS * Fix comments * Leave a TODO * Bring back a UT * consider object memory * Fix Co-authored-by: Chong-Li <lc300133@antgroup.com>	2022-03-11 21:54:03 -08:00
jon-chuang	0b54d9c780	[GCS] Non-STRICT_PACK PGs should be sorted by resource priority, size (#22762 ) Previously, placement group had suboptimal bin-packing resulting in unexpected placement group stalls for users. The root cause is lack of implementation for sorting of pg bundles by resource priority and size. This PR implements a naive priority mechanism for bundles that can be improved upon (and even config by user in the future) in the GCS resource scheduler. The behaviour is to schedule: "GPU" first, custom resources in int64_t order next, and finally, memory and then "CPU" last.	2022-03-11 21:47:07 -08:00
Jialing He	0cbbb8c1d0	[runtime env][core] Use Proto message `RuntimeEnvInfo` between user code and core_worker (#22856 )	2022-03-11 22:14:18 +08:00
Tao Wang	10c03cb126	Migrating to flat hash map [GCS&util&common] (#22932 ) Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules. The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated. The follow-up PRs would be migrating in core worker, placement group and others.	2022-03-11 18:35:06 +09:00
Yi Cheng	ec88eb7d1d	[4][resource reporting] Remove ray syncer from gcs_resource_manager (#22832 ) This PR is part of resource reporting refactoring. In this PR ray syncer is moved from gcs_resource_manager to gcs_placement_group_scheduler. With this one, gcs_resource_manager is totally decoupled from resource broadcasting.	2022-03-11 01:15:25 -08:00
Chen Shen	3ebc4ae289	fix comments and typo (#23008 ) Fix comments and typos for scheduler code.	2022-03-10 11:40:31 -08:00
Yi Cheng	9f275c9bb8	[3][resource reporting] Use GCS to report the placement group creation information instead of reporting by raylet (#22597 )	2022-03-10 11:08:21 -08:00
qicosmos	e4a9517739	[C++ Worker]Python call cpp worker (#22820 )	2022-03-10 11:06:14 -08:00
ZhuSenlin	a15890be58	[GCS] refactor the resource related data structures on the GCS (#22924 ) * refactor resource data structure in gcs * fix comment * fix lint error * fix * DISABLED_TestRejectedRequestWorkerLeaseReply as it depends on the update of normal task Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-09 08:22:02 -08:00
Chen Shen	bc3f7a7684	[scheduling policy 3/n][rfc] Refactor SchedulingPolicy into interface and implementations (#22907 ) * scheduling policy * update Co-authored-by: Gagandeep Singh <gdp.1807@gmail.com>	2022-03-08 18:47:56 -08:00
Chen Shen	cd0354e06d	[scheduling-policy 2/n] refactor scheduling policy API (#22885 ) * add scheduling-options * address comments	2022-03-08 09:29:00 -08:00
ZhuSenlin	1e4d7bc1f4	[Core] make StringIdMap thread safe (#22893 ) * make StringIdMap thread safe * fix comment Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>	2022-03-08 09:23:41 -08:00
Tao Wang	4576f53fe3	[HOTFIX]fix some compilation failures in core worker test (#22855 ) There're some compilation failures in core worker test when we build project using `bazel build //:all`. It seems broken and not integrated in CI.	2022-03-08 16:14:14 +08:00
mwtian	3f4a59c506	[Core] clean up pubsub to prepare for refactor (#22819 ) To prepare for additional changes in pubsub to fix #22339 and #22340, - Use structs instead of std::pair to hold per-subscription data, in case we need to expand the data fields. - Rename variables in tests to indicate non-object pubsub testing. - Pass full request to long poll handler in Publisher. - Simplify logic when possible. There should be no behavior change. Most of the code changes are based on #20276	2022-03-07 17:21:04 -08:00
Chen Shen	fbdf3e96f2	[scheduling-policy 1/n] pass check-node-liveness by constructor #22880	2022-03-07 16:55:29 -08:00
Jiajun Yao	2302b4eea8	Stop and join actor asyncio threads during exit (#22810 )	2022-03-07 14:45:08 -08:00
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
SangBin Cho	79e8405fda	Revert "[GCS] refactor the resource related data structures on the GCS (#22817 )" (#22863 ) This reverts commit `549466a42f`.	2022-03-07 08:48:17 -08:00
ZhuSenlin	549466a42f	[GCS] refactor the resource related data structures on the GCS (#22817 )	2022-03-07 18:43:33 +08:00
Yi Cheng	5bbbfac5e8	[gcs] Fix resource updating incorrectly (#22644 ) When there is no scheduling task of scheduling class in local raylet, the backlog resource will not be reported. It usually will happen when core worker try to schedule the task on other node and report backlog to local node. This will lead to the wrong demands.	2022-03-04 14:32:54 -08:00
mwtian	55166f0780	Revert "Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 )" (#22808 ) This reverts commit `b98c9c77f1`.	2022-03-03 12:32:28 -08:00
Chen Shen	f0ba0a3d3d	[LocalResourceManager] unify (Add/Subtract)(CPU/GPU)ResourceInstances (#22777 ) * add * more	2022-03-03 09:15:49 -08:00
mwtian	b98c9c77f1	Revert "Disable scheduler_report_pinned_bytes_only (#22132 )" (#22786 ) This reverts commit `88d2e21585`.	2022-03-02 18:29:31 -08:00
Chen Shen	e8c823791b	[scheduling-ids] enforce thread-private #22775	2022-03-02 16:27:49 -08:00
mwtian	02d09da7b4	[Core] remove verbose logs (#22785 ) IIUC, these log statements added in #22612 do not seem intended.	2022-03-02 16:00:26 -08:00
Chen Shen	3e3db8e9cd	[scheduler] hide StringIDMap under BaseSchedulingID (#22722 ) * add * address comments	2022-03-01 22:50:53 -08:00
Yi Cheng	271ed44143	[2][resource reporting] Encapsulate poller and broadcaster into syncer in gcs (#22464 ) This PR move the poller and broadcaster from gcs server to ray syncer. TODO in next PR: deprecate the code path of placement group resource reporting and move the broadcaster out of gcs cluster resource manager.	2022-03-01 21:51:14 -08:00
Eric Liang	06d4444b4a	Never re-use task workers for actors or GPU tasks (#22482 ) Don't re-use task workers for actors, since those workers may own objects that will be lost on actor exit. This adds a slight performance penalty for actor startup.	2022-03-01 16:46:18 -08:00
Eric Liang	1a170f7234	[RFC] Disable actor queueing warning for concurrent actors (#22720 ) The warning was not implemented properly for out of order actors. Disable it for now.	2022-03-01 14:28:19 -08:00
Archit Kulkarni	127b69bc21	[runtime env] Fix protobuf serialization/deserialization (#22672 ) This PR fixes some minor bugs in `to_dict` and `from_dict` for the runtime env protobuf and adds a test to cover this codepath. The test checks that `to_dict` and `from_dict` are inverses. This PR contains all fixes required to make the test pass.	2022-03-01 12:34:50 -06:00
Eric Liang	482b0117e8	Basic log observability for spilling (#22612 )	2022-03-01 09:40:51 -08:00

1 2 3 4 5 ...

2710 commits