Commit graph

158 commits

Author SHA1 Message Date
Guyang Song
cf7305a2c9
Revert "[Core] Add retry exception allowlist for user-defined filteri… (#26289)
Closes #26287.
2022-07-05 15:17:36 -07:00
Kai Yang
7ea9d91e1a
[C++ worker] Refine worker context and more (#26281)
* Avoid depending on `CoreWorkerProcess::GetCoreWorker()` in local mode.
* Fix bug in `LocalModeObjectStore::PutRaw`.
* Remove unused `TaskExecutor::Execute` method.
* Use `Process::Wait` instead of sleep when invoking `ray start` and `ray stop`.
2022-07-05 13:47:28 +08:00
Clark Zinzow
2a4d22fbd2
[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. (#25896)
This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.
2022-07-01 20:06:02 -07:00
Larry
c8a90e00ac
Hide other symbols in libray_api.so only keep ray* (#26069) 2022-07-01 11:51:40 +08:00
Larry
de0ccc1dfc
add default_actor_lifetime param and add more exception info for cpp worker (#25929) 2022-06-29 18:01:01 +08:00
Tao Wang
49cafc6323
[Cpp worker][Java worker]Support Java call Cpp Actor (#25933) 2022-06-29 14:33:32 +08:00
Guyang Song
876fef0fcf
[C++ worker] optimize the log of dynamic library loading (#26120) 2022-06-28 14:08:47 +08:00
Guyang Song
a0fbd54753
[C++ worker] use dynamic library in C++ default_worker (#25720) 2022-06-22 15:11:15 +08:00
clarng
2b270fd9cb
apply isort uniformly for a subset of directories (#25824)
Simplify isort filters and move it into isort cfg file.

With this change, isort will not longer apply to diffs other than to files that are in whitelisted directory (isort only supports blacklist so we implement that instead) This is much simpler than building our own whitelist logic since our formatter runs multiple codepaths depending on whether it is formatting a single file / PR / entire repo in CI.
2022-06-17 13:40:32 -07:00
Guyang Song
974bbc0f43
[C++ worker] move xlang test to separate test file (#25756) 2022-06-17 11:05:24 +08:00
Tao Wang
2d9af5028e
[Cpp worker]Support cpp call java task (#25757) 2022-06-16 10:02:46 +08:00
Tao Wang
593a522abd
[Cpp worker]Support cpp call java actor (#25581) 2022-06-14 14:17:14 +08:00
SangBin Cho
d89c8aa9f9
[Core] Add more accurate worker exit (#24468)
This PR adds precise reason details regarding worker failures. All information is available either by 
- ray list workers
- exceptions from actor failures.

Here's an example when the actor is killed by a SIGKILL (e.g., OOM killer)
```
RayActorError: The actor died unexpectedly before finishing this task.
	class_name: G
	actor_id: e818d2f0521a334daf03540701000000
	pid: 61251
	namespace: 674a49b2-5b9b-4fcc-b6e1-5a1d4b9400d2
	ip: 127.0.0.1
The actor is dead because its worker process has died. Worker exit type: UNEXPECTED_SYSTEM_EXIT Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
```

## Design
Worker failures are reported by Raylet from 2 paths.
(1) When the core worker calls `Disconnect`.
(2) When the worker is unexpectedly killed, the socket is closed, raylet reports the worker failures.

The PR ensures all worker failures are reported through Disconnect while it includes more detailed information to its metadata.

## Exit types
Previously, the worker exit types are arbitrary and not correctly categorized. This PR reduces the number of worker exit types while it includes details of each exit type so that users can easily figure out the root cause of worker crashes. 

### Status quo
- SYSTEM ERROR EXIT
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Direct call failure
- INTENDED EXIT
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- UNUSED_RESOURCE_REMOVED
     - Upon GCS restart, it kills bundles that are not registered to GCS to synchronize the state
- PG_REMOVED
    - When pg is removed, all workers fate share
- CREATION_TASK (INIT ERROR)
    - When actor init has an error
- IDLE
    - When worker is idle and num workers > soft limit (by default num cpus)
- NODE DIED
    - Only can detect when the node of the owner is dead (need improvement)

### New proposal
Remove unnecessary states and add “details” field. We can categorize failures by 4 types

- UNEXPECTED_SYSTEM_ERROR_EXIT
     - When the worker is crashed unexpectedly
    - Failure from the connection (core worker dead)
    - Unexpected exception or exit with exit_code !=0 on core worker
    - Node died
    - Direct call failure
- INTENDED_USER_EXIT. 
    - When the worker is requested to be killed by users. No workflow required. Just correctly store the state.
    - Shutdown driver
    - Exit_actor
    - exit(0)
    - Actor kill request
    - Task cancel request
- INTENDED_SYSTEM_EXIT
    - When the worker is requested to be killed by system (without explicit user request)
    - Unused resource removed
    - Pg removed
    - Idle
- ACTOR_INIT_FAILURE (CREATION_TASK_FAILED)
     - When the actor init is failed, we fate share the process with the actor. 
     - Actor init failed

## Limitation (Follow up)
Worker failures are not reported under following circumstances
- Worker is failed before it registers its information to GCS (it is usually from critical system bug, and extremely uncommon).
- Node is failed. In this case, we should track Node ID -> Worker ID mapping at GCS and when the node is failed, we should record worker metadata. 

I will create issues to track these problems.
2022-05-19 19:48:52 -07:00
Qing Wang
eb29895dbb
[Core] Remove multiple core workers in one process 1/n. (#24147)
This is the 1st PR to remove the code path of multiple core workers in one process. This PR is aiming to remove the flags and APIs related to `num_workers`.
After this PR checking in, we needn't to consider the multiple core workers any longer.

The further following PRs are related to the deeper logic refactor, like eliminating the gap between core worker and core worker process,  removing the logic related to multiple workers from workerpool, gcs and etc.

**BREAK CHANGE**
This PR removes these APIs:
- Ray.wrapRunnable();
- Ray.wrapCallable();
- Ray.setAsyncContext();
- Ray.getAsyncContext();

And the following APIs are not allowed to invoke in a user-created thread in local mode:
- Ray.getRuntimeContext().getCurrentActorId();
- Ray.getRuntimeContext().getCurrentTaskId()

Note that this PR shouldn't be merged to 1.x.
2022-05-19 00:36:22 +08:00
Yi Cheng
61c9186b59
[2][cleanup][gcs] Cleanup GCS client options. (#23519)
This PR cleanup GCS client options.
2022-03-29 12:01:58 -07:00
Yi Cheng
7de751dbab
[1][core][cleanup] remove enable gcs bootstrap in cpp. (#23518)
This PR remove enable_gcs_bootstrap flag in cpp.
2022-03-28 21:37:24 -07:00
Larry
81dcf9ff35
[Placement Group] Make PlacementGroupID generate from JobID (#23175) 2022-03-21 17:09:16 +08:00
qicosmos
d8de5a445a
[C++ Worker]Python call cpp actor (#23061)
[Last PR](https://github.com/ray-project/ray/pull/22820) has supported python call c++ normal task, this PR supports python call c++ actor task.
2022-03-15 19:54:10 -07:00
Kai Yang
e9755d87a6
[Lint] One parameter/argument per line for C++ code (#22725)
It's really annoying to deal with parameter/argument conflicts. This is even frustrating when we merge code from the community to Ant's internal code base with hundreds of conflicts caused by parameters/arguments.

In this PR, I updated the clang-format style to make parameters/arguments stay on different lines if they can't fit into a single line.

There are several benefits:

* Conflict resolving is easier.
* Less potential human mistakes when resolving conflicts.
* Git history and Git blame are more straightforward.
* Better readability.
* Align with the new Python format style.
2022-03-13 17:05:44 +08:00
qicosmos
e4a9517739
[C++ Worker]Python call cpp worker (#22820) 2022-03-10 11:06:14 -08:00
qicosmos
b8fbec1212
[C++ Worker]fix cpp api test (#22232) 2022-02-10 16:06:38 +08:00
Stephanie Wang
dcd96ca348
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#22120)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
2022-02-08 14:50:50 -08:00
Guyang Song
36ba514f9c
[Doc] Fix bad doc and recover doc of c++ api (#22213) 2022-02-08 19:04:37 +08:00
Guyang Song
8e1e783596
fix "team:xxx" tag of cpp tests #22163
Cpp worker tests should be part of ray core.
2022-02-07 11:33:55 -08:00
SangBin Cho
6dda196f47
Revert "[core] Increment ref count when creating an ObjectRef to prev… (#22106)
This reverts commit e3af828220.
2022-02-04 00:55:45 -08:00
Stephanie Wang
e3af828220
[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope (#21719)
When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.
2022-02-03 17:31:27 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
qicosmos
7172802d8c
[C++ Worker][xlang] Support calling python worker (#21390)
C++ API need to call python and java worker, this pr support call python worker. Call python worker is similar with call c++ worker, need to pass PyFunction, PyActorClass and PyActorMethod.

## call python normal task
```python
#test_cross_language_invocation.py
import ray

@ray.remote
def py_return_input(v):
    return v
```

c++ api call python function
```c++
  auto py_obj1 = ray::Task(ray::PyFunction</*ReturnType*/int>{/*module_name=*/"test_cross_language_invocation",
                                                     /*function_name=*/"py_return_input"})
                     .Remote(42);
  EXPECT_EQ(42, *py_obj1.Get());
```
The user need to fill python module name and function name, then pass arguments into the remote.
The user also need to assign the return type and arguments types of the python function, it used to do static safe checking and get result.

## call python actor task
```python
#test_cross_language_invocation.py
@ray.remote
class Counter(object):
    def __init__(self, value):
        self.value = int(value)

    def increase(self, delta):
        self.value += int(delta)
        return str(self.value)
```
c++ api call python actor function
```c++
  // Create  python actor
  auto py_actor_handle =
      ray::Actor(ray::PyActorClass{/*module_name=*/"test_cross_language_invocation",  /*class_name=*/"Counter"})
          .Remote(1);
  EXPECT_TRUE(!py_actor_handle.ID().empty());

  // Call python actor task
  auto py_actor_ret =
      py_actor_handle.Task(ray::PyActorMethod</*ReturnType*/std::string>{/*actor_function_name=*/"increase"}).Remote(1);
  EXPECT_EQ("2", *py_actor_ret.Get());
```
The user need to fill python module name and class name when creating python actor.

PyActorMethod only need to fill the function name.

It's also similar with calling c++ actor task, also has compile-time safe checking.
2022-01-21 13:55:30 +08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00
jon-chuang
5f7224bd51
[C++ API] fix wrong arg handling for object references in TaskExecutor, TaskArgByReference (#21236)
Previously, ref arg is handled wrongly, serializing the object ref, instead of RayObject to be passed as args buffer to the user function. 

That's because CoreWorker is the component responsible for ensuring that all ObjectReferences are resolved and serialized into `RayObject`s at the time of the `task_execution_callback` invocation, not any component downstream of the callback. 

This resulted in the following error for large objects which are not turned into `TaskArg::value` due to being over 100KB.
```
C++ exception with description "Invalid: invalid arguments: std::bad_cast" thrown in the test body.
```
This was not caught due to lack of testing for large objects, which has now been added.
2022-01-17 12:08:15 +08:00
qicosmos
f8244a4cc0
[C++ Worker]fix uninit worker context (#21371) 2022-01-10 17:17:41 +08:00
Yi Cheng
8fa9fddaa0
[1/3][kv] move some internal kv py logic into cpp (#21386)
This PR moves the internal kv namespace logic into cpp to reduce logic in python for the following reasons:

- internal kv is used in x-lang so we have to move it to cpp so that all langs can benefit.
- for https://github.com/ray-project/ray/issues/8822 we need to delete resource when job finished in gcs

One extra field about del is also added so that when delete, we are able to delete by prefix instead of just a key
2022-01-07 17:35:06 -08:00
qicosmos
d1a27487a3
[C++ Worker] fix uninit ray runtime instance (#21125)
In some compiler, the static ray runtime in ray runtime holder maybe a new un-init instance in dynamic library, 
so we need to init ray time holder in dynamic library to make sure the new instance valid.
2021-12-21 12:07:59 +08:00
Yi Cheng
a778741db6
[gcs] Update constructor of gcs client (#21025)
GcsClient accepts only redis before. To make it work without redis, we need to be able to pass gcs address to gcs client as well.

In this PR, we add GCS related into into GcsClientOptions so that we can connect to the gcs directly with gcs address.

This  PR is part of GCS bootstrap. In the following PR, we'll add functionality to set the correct GcsClientOptions based on flags.
2021-12-16 00:19:37 -08:00
WanXing Wang
72bd2d7e09
[Core] Support back pressure for actor tasks. (#20894)
Resubmit the PR https://github.com/ray-project/ray/pull/19936

I've figure out that the test case `//rllib:tests/test_gpus::test_gpus_in_local_mode` failed due to deadlock in local mode.
In local mode, if the user code submits another task during the executing of current task, the `CoreWorker::actor_task_mutex_` may cause deadlock.
The solution is quite simple, release the lock before executing task in local mode.

In the commit 7c2f61c76c:
1. Release the lock in local mode to fix the bug. @scv119 
2. `test_local_mode_deadlock` added to cover the case. @rkooo567 
3. Left a trivial change in `rllib/tests/test_gpus.py` to make the `RAY_CI_RLLIB_DIRECTLY_AFFECTED ` to take effect.
2021-12-13 23:56:07 -08:00
Stephanie Wang
3a5dd9a10b
[core] Pin object if it already exists (#20447)
A worker can crash right after putting its return values into the object store. Then, the owner will receive the worker crashed error, but the return objects will still be in the remote object store. Later, if the task is retried, the worker will crash on [this line](https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L105) because the object already exists.

Another way this can happen is if a task has multiple return values, and one of those return values is transferred to another node. If the task is later re-executed on that node, the task will fail because of the same error.

This PR fixes the crash so that:
1. If an object already exists, we try to pin that copy. Ideally, we should destroy the old copy and create the new one to make sure that metadata like the owner address is in sync, but this is pretty complicated to do right now.
2. If the pinning fails, we store an OBJECT_LOST error to throw to the application.
3. On the raylet, we check whether we already have the object pinned, and only subscribe to the owner's eviction message if the object is not pinned.
4. Also fixes bugs in the analogous case for `ray.put` (previously this would hang, now the application will receive an error if a `ray.put` object already exists).
2021-12-10 15:56:43 -08:00
Jiajun Yao
5b168a1515
[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy (#20507)
This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy.

Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.
2021-12-07 23:11:31 -08:00
Kai Fricke
d4413299c0
Revert "[Core] Support back pressure for actor tasks (#19936)" (#20880)
This reverts commit a4495941c2.
2021-12-03 17:48:47 -08:00
WanXing Wang
a4495941c2
[Core] Support back pressure for actor tasks (#19936)
Support back pressure in core worker.
Job config added for python worker and java worker.
2021-12-02 14:41:30 -08:00
qicosmos
a49c1d5f55
[C++] Deprecated global named actor and global PGs. (#20468)
Why are these changes needed?
This PR removes global named actor and global PGs.

Related issue number
#20460
2021-11-18 23:21:59 +08:00
Yi Cheng
a4e187c0e7
[gcs] Update function table to use internal kv (#20152)
## Why are these changes needed?
This is a part of redis removal. This PR remove redis kv in function table. 
rpush related code is not updated in this PR.

## Related issue number
2021-11-15 23:34:41 -08:00
Qing Wang
1172195571
[Java] Remove global named actor and global pg (#20135)
This PR removes global named actor and global PGs.

I believe these APIs are not used widely in OSS.
CPP part is not included in this PR.
@kfstorm @clay4444 @raulchen Please take a look if this change is reasonable.


IMPORTANT NOTE: This is a Java API change and will lead backward incompatibility in Java global named actor and global PG usage.

CPP part is not included in this PR.
INCLUDES:

 Remove setGlobalName() and getGlobalActor() APIs.
 Remove getGlobalPlacementGroup() and setGlobalPG
 Add getActor(name, namespace) API
 Add getPlacementGroup(name, namespace) API
 Update doc pages.
2021-11-15 16:28:53 +08:00
Yi Cheng
e54d3117a4
[gcs] Update all redis kv usage in python except function table (#20014)
## Why are these changes needed?
This is part of redis removal project. In this PR all direct usage of redis got removed except function table.
Function table will be migrated in the next PR

## Related issue number
#19443
2021-11-10 20:24:53 -08:00
Tao Wang
60df705b4e
[Cpp]Get next job id globally instead of random selecting (#20102)
## Why are these changes needed?

## Related issue number
Final part of #13984

## Checks

- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-09 15:46:57 +08:00
Kai Yang
e84391d1d3
[Core] Encode job ID in randomized task IDs for user-created threads (#19320)
## Why are these changes needed?

Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID.

This PR updates the task ID generation code to always encode the job ID.

A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads.

## Related issue number

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-08 21:00:40 +08:00
Alex Wu
146b3d6bcc
[scheduler] Include depth and function descriptor in scheduling class (#20004) 2021-11-05 08:19:48 -07:00
qicosmos
246a901aea
[C++ API] Support object ref args (#19550) 2021-10-29 17:36:53 +08:00
qicosmos
efef38f240
[C++ Worker] Add basic ref counting test cases (#17768) 2021-10-29 11:22:19 +08:00
Qing Wang
048e7f7d5d
[Core] Port concurrency groups with asyncio (#18567)
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.

### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
    def __init__(self):
        pass

    @ray.method(concurrency_group="io")
    async def f1(self):
        pass

    @ray.method(concurrency_group="io")
    def f2(self):
        pass

    @ray.method(concurrency_group="compute")
    def f3(self):
        pass

    @ray.method(concurrency_group="compute")
    def f4(self):
        pass

    def f5(self):
        pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group.  Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.

Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.

The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```

### Implementation
The straightforward implementation details are:
 - Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.


## Related issue number
#16047
2021-10-21 21:46:56 +08:00
Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search (#19376) 2021-10-18 16:03:49 +08:00