hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-05 10:01:43 -05:00

Author	SHA1	Message	Date
Tao Wang	b6fe6156f5	[C++ worker]Support ActorHandle type return value (#28077 ) Before we support `ActorHandle` type as parameter, this PR adds support for `ActorHandle` type as return type.	2022-08-25 10:05:05 +08:00
Clark Zinzow	293452dcba	[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449 ) This reverts commit `cf7305a`, and unreverts #25896. This was reverted due to a failing Windows test: #26287 We can merge once the failing Windows test (and all other relevant tests) pass.	2022-08-05 16:07:13 -07:00
Jiajun Yao	d7dcb1f938	Replace boost::filesystem with std::filesystem (#27522 ) This redos #27319 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-04 21:33:51 -07:00
SangBin Cho	afd6597056	Revert "Replace boost::filesystem with std::filesystem (#27338 )" (#27483 ) This reverts commit `c50faa126c`.	2022-08-04 02:18:59 -07:00
Tao Wang	d4a1cebaa3	[C++ worker]Support ActorHandle type parameter (#27364 ) Now c++ worker doesn't support `ActorHandle` type parameter. When we pass an `ActorHandle` object to a task, it will incur this error: ![image](https://user-images.githubusercontent.com/5276001/182349872-a616ff55-6a2b-454d-9831-18877b56c228.png) The reason is that caller just deserializes the actor handle but doesn't register it to core worker, so if we call tasks of the actor, it will not be found in local.	2022-08-04 16:39:52 +08:00
Jiajun Yao	c50faa126c	Replace boost::filesystem with std::filesystem (#27338 ) Redo #27319 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-01 17:12:23 -07:00
Jiajun Yao	36d5e5f99d	Revert "Replace boost::filesystem with std::filesystem (#27319 )" (#27337 ) This reverts commit `8e5c51d7d7`.	2022-08-01 13:46:45 -07:00
Jiajun Yao	8e5c51d7d7	Replace boost::filesystem with std::filesystem (#27319 ) std::filesystem is shipped with c++17, there is no need to depend on boost for this. Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-01 11:44:39 -07:00
Guyang Song	0b60d90283	[Hotfix] Fix the failure of C++ tests (#27249 ) Signed-off-by: 久龙 <guyang.sgy@antfin.com>	2022-07-30 00:31:02 +08:00
Guyang Song	06b0e715c7	[runtime env] plugin refactor [7/n]: support runtime env in C++ API (#27010 ) Signed-off-by: 久龙 <guyang.sgy@antfin.com>	2022-07-27 18:24:31 +08:00
Tao Wang	4f2747f12a	[Core][C++ worker] Add GetNamespace api (#26509 )	2022-07-20 11:17:14 +08:00
Riatre	591cd22be7	Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525 ) * Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" This reverts commit `ab10890e90`. Signed-off-by: Riatre Foo <foo@riat.re> * Fix missing test data files dependency in rllib/BUILD See # 26334 and # 26517 for context. Once this is in, it should be good to roll-forwrad again. Signed-off-by: Riatre Foo <foo@riat.re> * debug: run all tests Signed-off-by: Riatre Foo <foo@riat.re> * Revert "debug: run all tests" This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970. Signed-off-by: Riatre Foo <foo@riat.re> * fix new tests since last rebase Signed-off-by: Riatre Foo <foo@riat.re>	2022-07-18 21:21:19 -07:00
SangBin Cho	0f0102666a	[Core] Support max cpu allocation per node for placement group scheduling (#26397 ) The PR adds a new experimental flag to the placement group API to avoid placement group taking all cpus on each node. It is used internally by Air to avoid placement group (created by Tune) is using all CPU resources which are needed for dataset	2022-07-16 01:47:30 -07:00
Sven Mika	ab10890e90	Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517 )	2022-07-13 11:19:30 -07:00
Riatre	2cdb76789e	Bump pytest from 5.4.3 to 7.0.1 (#26334 ) See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828. Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future. The change contains three commits with rather descriptive commit message, which I repeat here: Pass deps to py_test in py_test_module_list Bazel macro py_test_module_list takes a `deps` argument, but completely ignores it instead of passes it to `native.py_test`. Fixing that as we are going to use deps of py_test_module_list in BUILD in later changes. cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library from a py_test, which isn't working, see upstream issue: https://github.com/bazelbuild/bazel/issues/701. This is fixed by simply removing the (non-working) deps. Depend on conftest and data files in Python tests BUILD files Bazel requires that all the files used in a test run should be represented in the transitive dependencies specified for the test target. For py_test, it means srcs, deps and data. Bazel enforces this constraint by creating a "runfiles" directory, symbolic links files in the dependency closure and run the test in the "runfiles" directory, so that the test shouldn't see files not in the dependency graph. Unfortunately, the constraint does not apply for a large number of Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic links during test collection and effectively "breaks out" of the runfiles tree. pytest >= 6.0 introduces a breaking change and removed the symbolic link resolving behaviour, see pytest pull request https://github.com/pytest-dev/pytest/pull/6523 for more context. Currently, we are underspecifying dependencies in a lot of BUILD files and thus blocking us from updating to newer pytest (for Python 3.10 support). This change hopefully fixes all of them, and at least those in CI, by adding data or source dependencies (mostly for conftest.py-s) where needed. Bump pytest version from 5.4.3 to 7.0.1 We want at least pytest 6.2.5 for Python 3.10 support, but not past 7.1.0 since it drops Python 3.6 support (which Ray still supports), thus the version constraint is set to <7.1. Updating pytest, combined with earlier BUILD fixes, changed the ground truth of a few error message based unit test, these tests are updated to reflect the change. There are also two small drive-by changes for making test_traceback and test_cli pass under Python 3.10. These are discovered while debugging CI failures (on earlier Python) with a Python 3.10 install locally. Expect more such issues when adding Python 3.10 to CI.	2022-07-12 21:14:35 -07:00
Tao Wang	f4a602a290	[core][c++ worker]store log dir of driver in internal config (#26354 ) Move update logic of session dir and log dir into config_internal, make it more tense, and consistent with Python/Java. Store log dir of driver into config_internal, so it can be used later.	2022-07-12 18:44:04 +08:00
Tao Wang	bb6c805bd7	[Java worker][Cpp worker]Support Java call Cpp Task (#26182 )	2022-07-12 17:49:22 +08:00
Larry	009c65ecb8	fix cpp hide symbols cause ut failure and compile error on mac (#26438 )	2022-07-12 11:00:17 +08:00
Tao Wang	1de0d35cda	[core][c++ worker]Add namespace support for c++ worker (#26327 )	2022-07-12 09:58:26 +08:00
Guyang Song	857e51aadc	[C++ API] script exits if a single command fails(#26344 )	2022-07-07 16:54:53 +08:00
Guyang Song	cf7305a2c9	Revert "[Core] Add retry exception allowlist for user-defined filteri… (#26289 ) Closes #26287.	2022-07-05 15:17:36 -07:00
Kai Yang	7ea9d91e1a	[C++ worker] Refine worker context and more (#26281 ) * Avoid depending on `CoreWorkerProcess::GetCoreWorker()` in local mode. * Fix bug in `LocalModeObjectStore::PutRaw`. * Remove unused `TaskExecutor::Execute` method. * Use `Process::Wait` instead of sleep when invoking `ray start` and `ray stop`.	2022-07-05 13:47:28 +08:00
Clark Zinzow	2a4d22fbd2	[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. (#25896 ) This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.	2022-07-01 20:06:02 -07:00
Larry	c8a90e00ac	Hide other symbols in libray_api.so only keep ray* (#26069 )	2022-07-01 11:51:40 +08:00
Larry	de0ccc1dfc	add default_actor_lifetime param and add more exception info for cpp worker (#25929 )	2022-06-29 18:01:01 +08:00
Tao Wang	49cafc6323	[Cpp worker][Java worker]Support Java call Cpp Actor (#25933 )	2022-06-29 14:33:32 +08:00
Guyang Song	876fef0fcf	[C++ worker] optimize the log of dynamic library loading (#26120 )	2022-06-28 14:08:47 +08:00
Guyang Song	a0fbd54753	[C++ worker] use dynamic library in C++ default_worker (#25720 )	2022-06-22 15:11:15 +08:00
clarng	2b270fd9cb	apply isort uniformly for a subset of directories (#25824 ) Simplify isort filters and move it into isort cfg file. With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.	2022-06-17 13:40:32 -07:00
Guyang Song	974bbc0f43	[C++ worker] move xlang test to separate test file (#25756 )	2022-06-17 11:05:24 +08:00
Tao Wang	2d9af5028e	[Cpp worker]Support cpp call java task (#25757 )	2022-06-16 10:02:46 +08:00
Tao Wang	593a522abd	[Cpp worker]Support cpp call java actor (#25581 )	2022-06-14 14:17:14 +08:00
SangBin Cho	d89c8aa9f9	[Core] Add more accurate worker exit (#24468 ) This PR adds precise reason details regarding worker failures. All information is available either by - ray list workers - exceptions from actor failures. Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer) ``` RayActorError: The actor died unexpectedly before finishing this task. class_name: G actor_id: e818d2f0521a334daf03540701000000 pid: 61251 namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2 ip: 127.0.0.1 The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors. ``` ## Design Worker failures are reported by Raylet from 2 paths. (1) When the core worker calls `Disconnect`. (2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures. The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata. ## Exit types Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes. ### Status quo - SYSTEM ERROR EXIT - Failure from the connection (core worker dead) - Unexpected exception or exit with exit_code !=0 on core worker - Direct call failure - INTENDED EXIT - Shutdown driver - Exit_actor - exit(0) - Actor kill request - Task cancel request - UNUSED_RESOURCE_REMOVED - Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state - PG_REMOVED - When pg is removed, all workers fate share - CREATION_TASK (INIT ERROR) - When actor init has an error - IDLE - When worker is idle and num workers > soft limit (by default num cpus) - NODE DIED - Only can detect when the node of the owner is dead (need improvement) ### New proposal Remove unnecessary states and add “details” field. We can categorize failures by 4 types - UNEXPECTED_SYSTEM_ERROR_EXIT - When the worker is crashed unexpectedly - Failure from the connection (core worker dead) - Unexpected exception or exit with exit_code !=0 on core worker - Node died - Direct call failure - INTENDED_USER_EXIT. - When the worker is requested to be killed by users. No workflow required. Just correctly store the state. - Shutdown driver - Exit_actor - exit(0) - Actor kill request - Task cancel request - INTENDED_SYSTEM_EXIT - When the worker is requested to be killed by system (without explicit user request) - Unused resource removed - Pg removed - Idle - ACTOR_INIT_FAILURE (CREATION_TASK_FAILED) - When the actor init is failed, we fate share the process with the actor. - Actor init failed ## Limitation (Follow up) Worker failures are not reported under following circumstances - Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon). - Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata. I will create issues to track these problems.	2022-05-19 19:48:52 -07:00
Qing Wang	eb29895dbb	[Core] Remove multiple core workers in one process 1/n. (#24147 ) This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to `num_workers`. After this PR checking in, we needn't to consider the multiple core workers any longer. The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process, removing the logic related to multiple workers from workerpool, gcs and etc. BREAK CHANGE This PR removes these APIs: - Ray.wrapRunnable(); - Ray.wrapCallable(); - Ray.setAsyncContext(); - Ray.getAsyncContext(); And the following APIs are not allowed to invoke in a user-created thread in local mode: - Ray.getRuntimeContext().getCurrentActorId(); - Ray.getRuntimeContext().getCurrentTaskId() Note that this PR shouldn't be merged to 1.x.	2022-05-19 00:36:22 +08:00
Yi Cheng	61c9186b59	[2][cleanup][gcs] Cleanup GCS client options. (#23519 ) This PR cleanup GCS client options.	2022-03-29 12:01:58 -07:00
Yi Cheng	7de751dbab	[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518 ) This PR remove enable_gcs_bootstrap flag in cpp.	2022-03-28 21:37:24 -07:00
Larry	81dcf9ff35	[Placement Group] Make PlacementGroupID generate from JobID (#23175 )	2022-03-21 17:09:16 +08:00
qicosmos	d8de5a445a	[C++ Worker]Python call cpp actor (#23061 ) [Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.	2022-03-15 19:54:10 -07:00
Kai Yang	e9755d87a6	[Lint] One parameter/argument per line for C++ code (#22725 ) It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments. In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line. There are several benefits: * Conflict resolving is easier. * Less potential human mistakes when resolving conflicts. * Git history and Git blame are more straightforward. * Better readability. * Align with the new Python format style.	2022-03-13 17:05:44 +08:00
qicosmos	e4a9517739	[C++ Worker]Python call cpp worker (#22820 )	2022-03-10 11:06:14 -08:00
qicosmos	b8fbec1212	[C++ Worker]fix cpp api test (#22232 )	2022-02-10 16:06:38 +08:00
Stephanie Wang	dcd96ca348	[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120 ) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.	2022-02-08 14:50:50 -08:00
Guyang Song	36ba514f9c	[Doc] Fix bad doc and recover doc of c++ api (#22213 )	2022-02-08 19:04:37 +08:00
Guyang Song	8e1e783596	fix "team:xxx" tag of cpp tests #22163 Cpp worker tests should be part of ray core.	2022-02-07 11:33:55 -08:00
SangBin Cho	6dda196f47	Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106 ) This reverts commit `e3af828220`.	2022-02-04 00:55:45 -08:00
Stephanie Wang	e3af828220	[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#21719 ) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.	2022-02-03 17:31:27 -08:00
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
qicosmos	7172802d8c	[C++ Worker][xlang] Support calling python worker (#21390 ) C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod. ## call python normal task ```python #test_cross_language_invocation.py import ray @ray.remote def py_return_input(v): return v ``` c++ api call python function ```c++ auto py_obj1 = ray::Task(ray::PyFunction</ReturnType/int>{/module_name=/"test_cross_language_invocation", /function_name=/"py_return_input"}) .Remote(42); EXPECT_EQ(42, py_obj1.Get()); ``` The user need to fill python module name and function name, then pass arguments into the remote. The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result. ## call python actor task ```python #test_cross_language_invocation.py @ray.remote class Counter(object): def __init__(self, value): self.value = int(value) def increase(self, delta): self.value += int(delta) return str(self.value) ``` c++ api call python actor function ```c++ // Create python actor auto py_actor_handle = ray::Actor(ray::PyActorClass{/module_name=/"test_cross_language_invocation", /class_name=/"Counter"}) .Remote(1); EXPECT_TRUE(!py_actor_handle.ID().empty()); // Call python actor task auto py_actor_ret = py_actor_handle.Task(ray::PyActorMethod</ReturnType/std::string>{/actor_function_name=/"increase"}).Remote(1); EXPECT_EQ("2", py_actor_ret.Get()); ``` The user need to fill python module name and class name when creating python actor. PyActorMethod only need to fill the function name. It's also similar with calling c++ actor task, also has compile-time safe checking.	2022-01-21 13:55:30 +08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00
jon-chuang	5f7224bd51	[C++ API] fix wrong arg handling for object references in `TaskExecutor`, `TaskArgByReference` (#21236 ) Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB. ``` C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body. ``` This was not caught due to lack of testing for large objects, which has now been added.	2022-01-17 12:08:15 +08:00

1 2 3 4

178 commits