hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
qicosmos	7172802d8c	[C++ Worker][xlang] Support calling python worker (#21390 ) C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod. ## call python normal task ```python #test_cross_language_invocation.py import ray @ray.remote def py_return_input(v): return v ``` c++ api call python function ```c++ auto py_obj1 = ray::Task(ray::PyFunction</ReturnType/int>{/module_name=/"test_cross_language_invocation", /function_name=/"py_return_input"}) .Remote(42); EXPECT_EQ(42, py_obj1.Get()); ``` The user need to fill python module name and function name, then pass arguments into the remote. The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result. ## call python actor task ```python #test_cross_language_invocation.py @ray.remote class Counter(object): def __init__(self, value): self.value = int(value) def increase(self, delta): self.value += int(delta) return str(self.value) ``` c++ api call python actor function ```c++ // Create python actor auto py_actor_handle = ray::Actor(ray::PyActorClass{/module_name=/"test_cross_language_invocation", /class_name=/"Counter"}) .Remote(1); EXPECT_TRUE(!py_actor_handle.ID().empty()); // Call python actor task auto py_actor_ret = py_actor_handle.Task(ray::PyActorMethod</ReturnType/std::string>{/actor_function_name=/"increase"}).Remote(1); EXPECT_EQ("2", py_actor_ret.Get()); ``` The user need to fill python module name and class name when creating python actor. PyActorMethod only need to fill the function name. It's also similar with calling c++ actor task, also has compile-time safe checking.	2022-01-21 13:55:30 +08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00
jon-chuang	5f7224bd51	[C++ API] fix wrong arg handling for object references in `TaskExecutor`, `TaskArgByReference` (#21236 ) Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB. ``` C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body. ``` This was not caught due to lack of testing for large objects, which has now been added.	2022-01-17 12:08:15 +08:00
qicosmos	f8244a4cc0	[C++ Worker]fix uninit worker context (#21371 )	2022-01-10 17:17:41 +08:00
Yi Cheng	8fa9fddaa0	[1/3][kv] move some internal kv py logic into cpp (#21386 ) This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons: - internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit. - for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key	2022-01-07 17:35:06 -08:00
qicosmos	d1a27487a3	[C++ Worker] fix uninit ray runtime instance (#21125 ) In some compiler, the static ray runtime in ray runtime holder maybe a new un-init instance in dynamic library, so we need to init ray time holder in dynamic library to make sure the new instance valid.	2021-12-21 12:07:59 +08:00
Yi Cheng	a778741db6	[gcs] Update constructor of gcs client (#21025 ) GcsClient accepts only redis before. To make it work without redis, we need to be able to pass gcs address to gcs client as well. In this PR, we add GCS related into into GcsClientOptions so that we can connect to the gcs directly with gcs address. This PR is part of GCS bootstrap. In the following PR, we'll add functionality to set the correct GcsClientOptions based on flags.	2021-12-16 00:19:37 -08:00
WanXing Wang	72bd2d7e09	[Core] Support back pressure for actor tasks. (#20894 ) Resubmit the PR https://github.com/ray-project/ray/pull/19936 I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode. In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock. The solution is quite simple, release the lock before executing task in local mode. In the commit `7c2f61c76c`: 1. Release the lock in local mode to fix the bug. @scv119 2. `test_local_mode_deadlock` added to cover the case. @rkooo567 3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.	2021-12-13 23:56:07 -08:00
Stephanie Wang	3a5dd9a10b	[core] Pin object if it already exists (#20447 ) A worker can crash right after putting its return values into the object store. Then, the owner will receive the worker crashed error, but the return objects will still be in the remote object store. Later, if the task is retried, the worker will crash on [this line](https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L105) because the object already exists. Another way this can happen is if a task has multiple return values, and one of those return values is transferred to another node. If the task is later re-executed on that node, the task will fail because of the same error. This PR fixes the crash so that: 1. If an object already exists, we try to pin that copy. Ideally, we should destroy the old copy and create the new one to make sure that metadata like the owner address is in sync, but this is pretty complicated to do right now. 2. If the pinning fails, we store an OBJECT_LOST error to throw to the application. 3. On the raylet, we check whether we already have the object pinned, and only subscribe to the owner's eviction message if the object is not pinned. 4. Also fixes bugs in the analogous case for `ray.put` (previously this would hang, now the application will receive an error if a `ray.put` object already exists).	2021-12-10 15:56:43 -08:00
Jiajun Yao	5b168a1515	[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy (#20507 ) This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy. Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.	2021-12-07 23:11:31 -08:00
Kai Fricke	d4413299c0	Revert "[Core] Support back pressure for actor tasks (#19936 )" (#20880 ) This reverts commit `a4495941c2`.	2021-12-03 17:48:47 -08:00
WanXing Wang	a4495941c2	[Core] Support back pressure for actor tasks (#19936 ) Support back pressure in core worker. Job config added for python worker and java worker.	2021-12-02 14:41:30 -08:00
qicosmos	a49c1d5f55	[C++] Deprecated global named actor and global PGs. (#20468 ) Why are these changes needed? This PR removes global named actor and global PGs. Related issue number #20460	2021-11-18 23:21:59 +08:00
Yi Cheng	a4e187c0e7	[gcs] Update function table to use internal kv (#20152 ) ## Why are these changes needed? This is a part of redis removal. This PR remove redis kv in function table. rpush related code is not updated in this PR. ## Related issue number	2021-11-15 23:34:41 -08:00
Qing Wang	1172195571	[Java] Remove global named actor and global pg (#20135 ) This PR removes global named actor and global PGs. I believe these APIs are not used widely in OSS. CPP part is not included in this PR. @kfstorm @clay4444 @raulchen Please take a look if this change is reasonable. IMPORTANT NOTE: This is a Java API change and will lead backward incompatibility in Java global named actor and global PG usage. CPP part is not included in this PR. INCLUDES: Remove setGlobalName() and getGlobalActor() APIs. Remove getGlobalPlacementGroup() and setGlobalPG Add getActor(name, namespace) API Add getPlacementGroup(name, namespace) API Update doc pages.	2021-11-15 16:28:53 +08:00
Yi Cheng	e54d3117a4	[gcs] Update all redis kv usage in python except function table (#20014 ) ## Why are these changes needed? This is part of redis removal project. In this PR all direct usage of redis got removed except function table. Function table will be migrated in the next PR ## Related issue number #19443	2021-11-10 20:24:53 -08:00
Tao Wang	60df705b4e	[Cpp]Get next job id globally instead of random selecting (#20102 ) ## Why are these changes needed? ## Related issue number Final part of #13984 ## Checks - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2021-11-09 15:46:57 +08:00
Kai Yang	e84391d1d3	[Core] Encode job ID in randomized task IDs for user-created threads (#19320 ) ## Why are these changes needed? Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID. This PR updates the task ID generation code to always encode the job ID. A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads. ## Related issue number ## Checks - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(	2021-11-08 21:00:40 +08:00
Alex Wu	146b3d6bcc	[scheduler] Include depth and function descriptor in scheduling class (#20004 )	2021-11-05 08:19:48 -07:00
qicosmos	246a901aea	[C++ API] Support object ref args (#19550 )	2021-10-29 17:36:53 +08:00
qicosmos	efef38f240	[C++ Worker] Add basic ref counting test cases (#17768 )	2021-10-29 11:22:19 +08:00
Qing Wang	048e7f7d5d	[Core] Port concurrency groups with asyncio (#18567 ) ## Why are these changes needed? This PR aims to port concurrency groups functionality with asyncio for Python. ### API ```python @ray.remote(concurrency_groups={"io": 2, "compute": 4}) class AsyncActor: def __init__(self): pass @ray.method(concurrency_group="io") async def f1(self): pass @ray.method(concurrency_group="io") def f2(self): pass @ray.method(concurrency_group="compute") def f3(self): pass @ray.method(concurrency_group="compute") def f4(self): pass def f5(self): pass ``` The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group. Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them. Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc. TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency. The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority. ```python a.f2.options(concurrency_group="compute").remote() ``` ### Implementation The straightforward implementation details are: - Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor. - Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups. ## Related issue number #16047	2021-10-21 21:46:56 +08:00
Guyang Song	c04fb62f1d	[C++ worker] set native library path for shared library search (#19376 )	2021-10-18 16:03:49 +08:00
Gagandeep Singh	d226cbf21a	Added StartupToken to idenitfy a process at startup (#19014 ) * Added StartupToken to idenitfy a process at startup * Applied linting formats * Addressed reviews * Fixing worker_pool_test * Fixed worker_pool_test * Applied linting formatting * Added documentation for StartupToken * Fixed linting * Reordered initialisation of WorkerPool members * Fixed Python docs * Fixing bugs in cluster_mode_test * Fixing Java tests * Create and set shim process after verifying startup_token * shim_process.GetId() -> worker_shim_pid * Improvements in startup token and modifying java files * update io_ray_runtime_RayNativeRuntime.h * Fixed java tests by adding startup-token to conf * Applied linting * Increased arg count for startup_token * Attempt to fix streaming tests * Type correction * applied linting * Corrected index of startup token arg * Modified, mock_worker.cc to accept startup tokens * Applied linting * Applied linting changes from CI * Removed override from worker.h * Applied linting from scripts/format.sh * Addressed reviews and applied scripts/format.sh * Applied linting script from ci/travis * Removed unrequired methods from public scope * Applied linting	2021-10-15 15:13:13 -07:00
Guyang Song	ab55b808c5	[runtime env] move worker env to runtime env in Java (#19060 )	2021-10-11 17:25:09 +08:00
gjoliver	635010d460	Update build rules and patches for darwin_arm64 platform. (#19037 ) * Update build rules and patches for darwin_arm64 platform. Changes include: Update nelhage/rules_boost package from current version (08/5/2020) to 5/27/2021 version. Remove rules_boost-undefine-boost_fallthrough.patch, since BOOST_FALLTHROUGH seems to be defined now. Minor changes to rules_boost-windows-linkopts.patch to use default condition to add -lpthread flag for all platforms. Add darwin_arm64 config to BUILD files for lib civetweb pulled in via prometheu dependency. * upgrade boost to 1.74.0 from 1.71.0 to match the udpated build file for windows. * Fix ray_cpp_pkg * Use boost/bind/bind.hpp boost/bind.hpp and global namespace placeholders are deprecated. * lint * Use absl::bind_front when possible. Otherwise, NOLINT * lint * lint * lint * lint * more lint * final lint * trigger build	2021-10-09 18:48:35 -07:00
Jiajun Yao	ed9118393c	Listen to 127.0.0.1 by default on mac osx (#18904 )	2021-09-29 11:40:19 -07:00
Guyang Song	337005d5a5	[C++ API][hotfix] fix C++ worker dynamic library loading issue on macOS (#18877 ) * fix C++ worker in macox * fix	2021-09-24 23:39:00 +08:00
Guyang Song	739cf64115	[C++ API] support head_args config in C++ API (#18709 )	2021-09-23 19:30:53 +08:00
qicosmos	64c25987f3	[C++ Worker]Simple kv store example (#18613 )	2021-09-18 16:02:44 +08:00
Jiajun Yao	ffe7108eae	Fix cpp api doc (#18671 )	2021-09-17 14:01:23 -07:00
Guyang Song	187e4a86ca	[C++ API] expose C++ task failure event (#18596 )	2021-09-16 19:20:16 +08:00
qicosmos	d7c631209b	[C++ Worker]Add api get placement group (#18535 )	2021-09-15 14:11:31 +08:00
Stephanie Wang	284dee493e	[core][usability] Disambiguate ObjectLostErrors for better understandability (#18292 ) * Define error types, throw error for ObjectReleased * x * Disambiguate OBJECT_UNRECONSTRUCTABLE and OBJECT_LOST * OwnerDiedError * fix test * x * ObjectReconstructionFailed * ObjectReconstructionFailed * x * x * print owner addr * str * doc * rename * x	2021-09-13 16:16:17 -07:00
qicosmos	ac0a153b06	[C++ Worker]Add some api of placement group (#18431 )	2021-09-13 15:10:54 +08:00
qicosmos	dd096c8e73	[C++ Worker]Fix abi issue (#18273 )	2021-09-10 11:53:05 +08:00
qicosmos	ba0084e9c7	[C++ Worker]Add gcs global state accessor (#17976 )	2021-09-09 12:08:08 +08:00
qicosmos	1da05209b9	[C++ Worker]Add get actor API. (#17897 ) * linkopts shared * add get actor api * fix * improve * reduce some duplicate code * improve some	2021-09-06 11:46:46 +08:00
qicosmos	72739462a9	[C++ Worker]Add some api of placement group part1. (#17925 ) * linkopts shared * add some pg api * add Wait for PlacementGroup	2021-09-03 13:32:28 +08:00
Stephanie Wang	d43d297d9a	[core] Attach call site to ObjectRefs, print on error (#17971 ) * Attach call site to ObjectRef * flag * Fix build * build * build * build * x * x * skip on windows * lint	2021-09-01 15:29:05 -07:00
Jiajun Yao	fbb3ac6a86	Retry application-level errors (#18176 ) * Retry application-level errors * Retry application-level errors * Push retry message to the driver	2021-09-01 10:53:06 -07:00
Stephanie Wang	8e06db7280	Revert "[Core] revert: revert Unified worker starter (#18008 )" (#18228 ) This reverts commit `b9978dd02b`.	2021-08-30 17:28:41 -07:00
Eric Liang	1adce7da4e	Revert "Auto discover dashboard agent port (#17855 )" (#18217 ) This reverts commit `53ddb551d5`.	2021-08-30 10:46:37 -07:00
fyrestone	53ddb551d5	Auto discover dashboard agent port (#17855 )	2021-08-30 12:06:28 +08:00
Jiajun Yao	25ef452b15	[Core] Fix typo in local_mode_task_submitter.cc (#18046 )	2021-08-24 13:03:05 -07:00
chenk008	b9978dd02b	[Core] revert: revert Unified worker starter (#18008 )	2021-08-23 13:34:32 -07:00
Stephanie Wang	b8fe776638	[core] Fix inlined nested ids (#17834 ) * test * Use ObjectRef instead of ObjectID in nested refs * java * doc * java * build * build * x * lint * simplify * fix	2021-08-20 08:58:29 -07:00
Eric Liang	661ac4e37b	Remove last traces of ref-counting flag (#17932 )	2021-08-19 21:08:13 -07:00
Simon Mo	b573864928	[CI] Add test owners (#17893 )	2021-08-18 18:38:31 -07:00

1 2 3

132 commits