Commit graph

178 commits

Author SHA1 Message Date
Tao Wang
b6fe6156f5
[C++ worker]Support ActorHandle type return value (#28077)
Before we support `ActorHandle` type as parameter, this PR adds support for `ActorHandle` type as return type.
2022-08-25 10:05:05 +08:00
Clark Zinzow
293452dcba
[Core] Unrevert "Add retry exception allowlist for user-defined filtering of retryable application-level errors." (#26449)
This reverts commit cf7305a, and unreverts #25896.

This was reverted due to a failing Windows test: #26287

We can merge once the failing Windows test (and all other relevant tests) pass.
2022-08-05 16:07:13 -07:00
Jiajun Yao
d7dcb1f938
Replace boost::filesystem with std::filesystem (#27522)
This redos #27319

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-04 21:33:51 -07:00
SangBin Cho
afd6597056
Revert "Replace boost::filesystem with std::filesystem (#27338)" (#27483)
This reverts commit c50faa126c.
2022-08-04 02:18:59 -07:00
Tao Wang
d4a1cebaa3
[C++ worker]Support ActorHandle type parameter (#27364)
Now c++ worker doesn't support `ActorHandle` type parameter.
When we pass an `ActorHandle` object to a task, it will incur this error:
![image](https://user-images.githubusercontent.com/5276001/182349872-a616ff55-6a2b-454d-9831-18877b56c228.png)
The reason is that caller just deserializes the actor handle but doesn't register it to core worker, so if we call tasks of the actor, it will not be found in local.
2022-08-04 16:39:52 +08:00
Jiajun Yao
c50faa126c
Replace boost::filesystem with std::filesystem (#27338)
Redo #27319

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-01 17:12:23 -07:00
Jiajun Yao
36d5e5f99d
Revert "Replace boost::filesystem with std::filesystem (#27319)" (#27337)
This reverts commit 8e5c51d7d7.
2022-08-01 13:46:45 -07:00
Jiajun Yao
8e5c51d7d7
Replace boost::filesystem with std::filesystem (#27319)
std::filesystem is shipped with c++17, there is no need to depend on boost for this.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-01 11:44:39 -07:00
Guyang Song
0b60d90283
[Hotfix] Fix the failure of C++ tests (#27249)
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
2022-07-30 00:31:02 +08:00
Guyang Song
06b0e715c7
[runtime env] plugin refactor [7/n]: support runtime env in C++ API (#27010)
Signed-off-by: 久龙 <guyang.sgy@antfin.com>
2022-07-27 18:24:31 +08:00
Tao Wang
4f2747f12a
[Core][C++ worker] Add GetNamespace api (#26509) 2022-07-20 11:17:14 +08:00
Riatre
591cd22be7
Revert "Revert "Bump pytest from 5.4.3 to 7.0.1"" (#26525)
* Revert "Revert "Bump pytest from 5.4.3 to 7.0.1""

This reverts commit ab10890e90.

Signed-off-by: Riatre Foo <foo@riat.re>

* Fix missing test data files dependency in rllib/BUILD

See # 26334 and # 26517 for context.

Once this is in, it should be good to roll-forwrad again.

Signed-off-by: Riatre Foo <foo@riat.re>

* debug: run all tests

Signed-off-by: Riatre Foo <foo@riat.re>

* Revert "debug: run all tests"

This reverts commit 0c5e796b0eb437d64922f66749c61b0412486970.

Signed-off-by: Riatre Foo <foo@riat.re>

* fix new tests since last rebase

Signed-off-by: Riatre Foo <foo@riat.re>
2022-07-18 21:21:19 -07:00
SangBin Cho
0f0102666a
[Core] Support max cpu allocation per node for placement group scheduling (#26397)
The PR adds a new experimental flag to the placement group API to avoid placement group taking all cpus on each node. It is used internally by Air to avoid placement group (created by Tune) is using all CPU resources which are needed for dataset
2022-07-16 01:47:30 -07:00
Sven Mika
ab10890e90
Revert "Bump pytest from 5.4.3 to 7.0.1" (breaks lots of RLlib tests for unknown reasons) (#26517) 2022-07-13 11:19:30 -07:00
Riatre
2cdb76789e
Bump pytest from 5.4.3 to 7.0.1 (#26334)
See #23676 for context. This is another attempt at that as I figured out what's going wrong in `bazel test`. Supersedes #24828.

Now that there are Python 3.10 wheels for Ray 1.13 and this is no longer a blocker for supporting Python 3.10, I still want to make `bazel test //python/ray/tests/...` work for developing in a 3.10 env, and make it easier to add Python 3.10 tests to CI in future.

The change contains three commits with rather descriptive commit message, which I repeat here:

Pass deps to py_test in py_test_module_list

    Bazel macro py_test_module_list takes a `deps` argument, but completely
    ignores it instead of passes it to `native.py_test`. Fixing that as we
    are going to use deps of py_test_module_list in BUILD in later changes.

    cpp/BUILD.bazel depends on the broken behaviour: it deps-on a cc_library
    from a py_test, which isn't working, see upstream issue:
    https://github.com/bazelbuild/bazel/issues/701.
    This is fixed by simply removing the (non-working) deps.

Depend on conftest and data files in Python tests BUILD files

    Bazel requires that all the files used in a test run should be
    represented in the transitive dependencies specified for the test
    target. For py_test, it means srcs, deps and data.

    Bazel enforces this constraint by creating a "runfiles" directory,
    symbolic links files in the dependency closure and run the test in the
    "runfiles" directory, so that the test shouldn't see files not in the
    dependency graph.

    Unfortunately, the constraint does not apply for a large number of
    Python tests, due to pytest (>=3.9.0, <6.0) resolving these symbolic
    links during test collection and effectively "breaks out" of the
    runfiles tree.

    pytest >= 6.0 introduces a breaking change and removed the symbolic link
    resolving behaviour, see pytest pull request
    https://github.com/pytest-dev/pytest/pull/6523 for more context.

    Currently, we are underspecifying dependencies in a lot of BUILD files
    and thus blocking us from updating to newer pytest (for Python 3.10
    support). This change hopefully fixes all of them, and at least those in
    CI, by adding data or source dependencies (mostly for conftest.py-s)
    where needed.

Bump pytest version from 5.4.3 to 7.0.1

    We want at least pytest 6.2.5 for Python 3.10 support, but not past
    7.1.0 since it drops Python 3.6 support (which Ray still supports), thus
    the version constraint is set to <7.1.

    Updating pytest, combined with earlier BUILD fixes, changed the ground
    truth of a few error message based unit test, these tests are updated to
    reflect the change.

    There are also two small drive-by changes for making test_traceback and
    test_cli pass under Python 3.10. These are discovered while debugging CI
    failures (on earlier Python) with a Python 3.10 install locally.  Expect
    more such issues when adding Python 3.10 to CI.
2022-07-12 21:14:35 -07:00
Tao Wang
f4a602a290
[core][c++ worker]store log dir of driver in internal config (#26354)
Move update logic of session dir and log dir into config_internal, make it more tense, and consistent with Python/Java.
Store log dir of driver into config_internal, so it can be used later.
2022-07-12 18:44:04 +08:00
Tao Wang
bb6c805bd7
[Java worker][Cpp worker]Support Java call Cpp Task (#26182) 2022-07-12 17:49:22 +08:00
Larry
009c65ecb8
fix cpp hide symbols cause ut failure and compile error on mac (#26438) 2022-07-12 11:00:17 +08:00
Tao Wang
1de0d35cda
[core][c++ worker]Add namespace support for c++ worker (#26327) 2022-07-12 09:58:26 +08:00
Guyang Song
857e51aadc
[C++ API] script exits if a single command fails(#26344) 2022-07-07 16:54:53 +08:00
Guyang Song
cf7305a2c9
Revert "[Core] Add retry exception allowlist for user-defined filteri… (#26289)
Closes #26287.
2022-07-05 15:17:36 -07:00
Kai Yang
7ea9d91e1a
[C++ worker] Refine worker context and more (#26281)
* Avoid depending on `CoreWorkerProcess::GetCoreWorker()` in local mode.
* Fix bug in `LocalModeObjectStore::PutRaw`.
* Remove unused `TaskExecutor::Execute` method.
* Use `Process::Wait` instead of sleep when invoking `ray start` and `ray stop`.
2022-07-05 13:47:28 +08:00
Clark Zinzow
2a4d22fbd2
[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. (#25896)
This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.
2022-07-01 20:06:02 -07:00
Larry
c8a90e00ac
Hide other symbols in libray_api.so only keep ray* (#26069) 2022-07-01 11:51:40 +08:00
Larry
de0ccc1dfc
add default_actor_lifetime param and add more exception info for cpp worker (#25929) 2022-06-29 18:01:01 +08:00
Tao Wang
49cafc6323
[Cpp worker][Java worker]Support Java call Cpp Actor (#25933) 2022-06-29 14:33:32 +08:00
Guyang Song
876fef0fcf
[C++ worker] optimize the log of dynamic library loading (#26120) 2022-06-28 14:08:47 +08:00
Guyang Song
a0fbd54753
[C++ worker] use dynamic library in C++ default_worker (#25720) 2022-06-22 15:11:15 +08:00
clarng
2b270fd9cb
apply isort uniformly for a subset of directories (#25824)
Simplify isort filters and move it into isort cfg file.

With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.
2022-06-17 13:40:32 -07:00
Guyang Song
974bbc0f43
[C++ worker] move xlang test to separate test file (#25756) 2022-06-17 11:05:24 +08:00
Tao Wang
2d9af5028e
[Cpp worker]Support cpp call java task (#25757) 2022-06-16 10:02:46 +08:00
Tao Wang
593a522abd
[Cpp worker]Support cpp call java actor (#25581) 2022-06-14 14:17:14 +08:00
SangBin Cho
d89c8aa9f9
[Core] Add more accurate worker exit (#24468)
This PR adds precise reason details regarding worker failures. All information is available either by 
- ray list workers
- exceptions from actor failures.

Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)
```
RayActorError: The actor died unexpectedly before finishing this task.
	class_name: G
	actor_id: e818d2f0521a334daf03540701000000
	pid: 61251
	namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
	ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
```

## Design
Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls `Disconnect`.
(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.

The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.

## Exit types
Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes. 

### Status quo
- SYSTEM ERROR EXIT
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Direct call failure
- INTENDED EXIT
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- UNUSED_RESOURCE_REMOVED
     - Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state
- PG_REMOVED
    - When pg is removed, all workers fate share
- CREATION_TASK (INIT ERROR)
    - When actor init has an error
- IDLE
    - When worker is idle and num workers > soft limit (by default num cpus)
- NODE DIED
    - Only can detect when the node of the owner is dead (need improvement)

### New proposal
Remove unnecessary states and add “details” field. We can categorize failures by 4 types

- UNEXPECTED_SYSTEM_ERROR_EXIT
     - When the worker is crashed unexpectedly
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Node died
    - Direct call failure
- INTENDED_USER_EXIT. 
    - When the worker is requested to be killed by users. No workflow required. Just correctly store the state.
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- INTENDED_SYSTEM_EXIT
    - When the worker is requested to be killed by system (without explicit user request)
    - Unused resource removed
    - Pg removed
    - Idle
- ACTOR_INIT_FAILURE (CREATION_TASK_FAILED)
     - When the actor init is failed, we fate share the process with the actor. 
     - Actor init failed

## Limitation (Follow up)
Worker failures are not reported under following circumstances
- Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon).
- Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata. 

I will create issues to track these problems.
2022-05-19 19:48:52 -07:00
Qing Wang
eb29895dbb
[Core] Remove multiple core workers in one process 1/n. (#24147)
This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to `num_workers`.
After this PR checking in, we needn't to consider the multiple core workers any longer.

The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process,  removing the logic related to multiple workers from workerpool, gcs and etc.

**BREAK CHANGE**
This PR removes these APIs:
- Ray.wrapRunnable();
- Ray.wrapCallable();
- Ray.setAsyncContext();
- Ray.getAsyncContext();

And the following APIs are not allowed to invoke in a user-created thread in local mode:
- Ray.getRuntimeContext().getCurrentActorId();
- Ray.getRuntimeContext().getCurrentTaskId()

Note that this PR shouldn't be merged to 1.x.
2022-05-19 00:36:22 +08:00
Yi Cheng
61c9186b59
[2][cleanup][gcs] Cleanup GCS client options. (#23519)
This PR cleanup GCS client options.
2022-03-29 12:01:58 -07:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Larry
81dcf9ff35
[Placement Group] Make PlacementGroupID generate from JobID (#23175) 2022-03-21 17:09:16 +08:00
qicosmos
d8de5a445a
[C++ Worker]Python call cpp actor (#23061)
[Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.
2022-03-15 19:54:10 -07:00
Kai Yang
e9755d87a6
[Lint] One parameter/argument per line for C++ code (#22725)
It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments.

In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line.

There are several benefits:

* Conflict resolving is easier.
* Less potential human mistakes when resolving conflicts.
* Git history and Git blame are more straightforward.
* Better readability.
* Align with the new Python format style.
2022-03-13 17:05:44 +08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
qicosmos
b8fbec1212
[C++ Worker]fix cpp api test (#22232) 2022-02-10 16:06:38 +08:00
Stephanie Wang
dcd96ca348
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
2022-02-08 14:50:50 -08:00
Guyang Song
36ba514f9c
[Doc] Fix bad doc and recover doc of c++ api (#22213) 2022-02-08 19:04:37 +08:00
Guyang Song
8e1e783596
fix "team:xxx" tag of cpp tests #22163
Cpp worker tests should be part of ray core.
2022-02-07 11:33:55 -08:00
SangBin Cho
6dda196f47
Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106)
This reverts commit e3af828220.
2022-02-04 00:55:45 -08:00
Stephanie Wang
e3af828220
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#21719)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.
2022-02-03 17:31:27 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
qicosmos
7172802d8c
[C++ Worker][xlang] Support calling python worker (#21390)
C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod.

## call python normal task
```python
#test_cross_language_invocation.py
import ray

@ray.remote
def py_return_input(v):
    return v
```

c++ api call python function
```c++
  auto py_obj1 = ray::Task(ray::PyFunction</*ReturnType*/int>{/*module_name=*/"test_cross_language_invocation",
                                                     /*function_name=*/"py_return_input"})
                     .Remote(42);
  EXPECT_EQ(42, *py_obj1.Get());
```
The user need to fill python module name and function name, then pass arguments into the remote.
The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result.

## call python actor task
```python
#test_cross_language_invocation.py
@ray.remote
class Counter(object):
    def __init__(self, value):
        self.value = int(value)

    def increase(self, delta):
        self.value += int(delta)
        return str(self.value)
```
c++ api call python actor function
```c++
  // Create  python actor
  auto py_actor_handle =
      ray::Actor(ray::PyActorClass{/*module_name=*/"test_cross_language_invocation",  /*class_name=*/"Counter"})
          .Remote(1);
  EXPECT_TRUE(!py_actor_handle.ID().empty());

  // Call python actor task
  auto py_actor_ret =
      py_actor_handle.Task(ray::PyActorMethod</*ReturnType*/std::string>{/*actor_function_name=*/"increase"}).Remote(1);
  EXPECT_EQ("2", *py_actor_ret.Get());
```
The user need to fill python module name and class name when creating python actor.

PyActorMethod only need to fill the function name.

It's also similar with calling c++ actor task, also has compile-time safe checking.
2022-01-21 13:55:30 +08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
jon-chuang
5f7224bd51
[C++ API] fix wrong arg handling for object references in TaskExecutor, TaskArgByReference (#21236)
Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. 

That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. 

This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB.
```
C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body.
```
This was not caught due to lack of testing for large objects, which has now been added.
2022-01-17 12:08:15 +08:00