Commit graph

37 commits

Author SHA1 Message Date
Yi Cheng
b729d458e2
[client] Move Client implementation of ObjectRef/ActorRef to python (#22148)
`__dealloc__` is not allowed to call python code and this leads to two problems:

- The data has already been cleaned up
- Deadlock if there are locks used.

THis PR move the implementation to python layer to avoid this
2022-02-06 13:03:51 -08:00
Yi Cheng
5ae8d5b8af
Revert "Revert "[client] Fix ray client object ref releasing in wrong context."" (#22091)
Reverts ray-project/ray#22090
2022-02-04 14:50:23 -08:00
Yi Cheng
7ff1cbbb12
Revert "[client] Fix ray client object ref releasing in wrong context." (#22090)
Reverts ray-project/ray#22025
2022-02-03 13:59:52 -08:00
Yi Cheng
588d540b68
[client] Fix ray client object ref releasing in wrong context. (#22025) 2022-02-01 22:42:39 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
Stephanie Wang
3a5dd9a10b
[core] Pin object if it already exists (#20447)
A worker can crash right after putting its return values into the object store. Then, the owner will receive the worker crashed error, but the return objects will still be in the remote object store. Later, if the task is retried, the worker will crash on [this line](https://github.com/ray-project/ray/blob/master/src/ray/core_worker/transport/direct_actor_transport.cc#L105) because the object already exists.

Another way this can happen is if a task has multiple return values, and one of those return values is transferred to another node. If the task is later re-executed on that node, the task will fail because of the same error.

This PR fixes the crash so that:
1. If an object already exists, we try to pin that copy. Ideally, we should destroy the old copy and create the new one to make sure that metadata like the owner address is in sync, but this is pretty complicated to do right now.
2. If the pinning fails, we store an OBJECT_LOST error to throw to the application.
3. On the raylet, we check whether we already have the object pinned, and only subscribe to the owner's eviction message if the object is not pinned.
4. Also fixes bugs in the analogous case for `ray.put` (previously this would hang, now the application will receive an error if a `ray.put` object already exists).
2021-12-10 15:56:43 -08:00
Jiajun Yao
5b168a1515
[Scheduler] Support per task/actor PlacementGroupSchedulingStrategy (#20507)
This PR adds per task/actor scheduling strategy and currently the only strategy are PlacementGroupSchedulingStrategy and DefaultSchedulingStrategy.

Going forward, people should use `scheduling_strategy=PlacementGroupSchedulingStrategy` to define placement group for actor/task. The old way will be deprecated.
2021-12-07 23:11:31 -08:00
Qing Wang
048e7f7d5d
[Core] Port concurrency groups with asyncio (#18567)
## Why are these changes needed?
This PR aims to port concurrency groups functionality with asyncio for Python.

### API
```python
@ray.remote(concurrency_groups={"io": 2, "compute": 4})
class AsyncActor:
    def __init__(self):
        pass

    @ray.method(concurrency_group="io")
    async def f1(self):
        pass

    @ray.method(concurrency_group="io")
    def f2(self):
        pass

    @ray.method(concurrency_group="compute")
    def f3(self):
        pass

    @ray.method(concurrency_group="compute")
    def f4(self):
        pass

    def f5(self):
        pass
```
The annotation above the actor class `AsyncActor` defines this actor will have 2 concurrency groups and defines their max concurrencies, and it has a default concurrency group.  Every concurrency group has an async eventloop and a pythread to execute the methods which is defined on them.

Method `f1` will be invoked in the `io` concurrency group. `f2` in `io`, `f3` in `compute` and etc.
TO BE NOTICED, `f5` and `__init__` will be invoked in the default concurrency.

The following method `f2` will be invoked in the concurrency group `compute` since the dynamic specifying has a higher priority.
```python
a.f2.options(concurrency_group="compute").remote()
```

### Implementation
The straightforward implementation details are:
 - Before we only have 1 eventloop binding 1 pythread for an asyncio actor. Now we create 1 eventloop binding 1 pythread for every concurrency group of the asyncio actor.
- Before we have 1 fiber state for every caller in the asyncio actor. Now we create a FiberStateManager for every caller in the asyncio actor. And the FiberStateManager manages the fiber states for concurrency groups.


## Related issue number
#16047
2021-10-21 21:46:56 +08:00
Edward Oakes
1fa81673bd
[runtime_env] Clean up validation logic (#18984)
Splits the runtime_env parsing/validation and overriding into two separate codepaths. Adds unit testing for both.
2021-10-07 14:24:41 -05:00
mwtian
e41109a5e7
[Client] Use async rpc for remote call and actor creation (#18298)
* Use async rpc for remote calls, task and actor creations.

* fix

* check placement

* check placement group. wait for id in destructor

* fix

* fix exception in destructor

* Add test

* revert change

* Fix comment

* fix
2021-09-22 18:30:50 -07:00
mwtian
32f71765e9
[Client] Allow Client{Object,Actor}Ref to accept a future. (#18677)
* Allow Client{Object,Actor}Ref to accept a future. Check number of args and returns synchronously.

* rename callback, fix
2021-09-18 16:32:02 -07:00
Stephanie Wang
284dee493e
[core][usability] Disambiguate ObjectLostErrors for better understandability (#18292)
* Define error types, throw error for ObjectReleased

* x

* Disambiguate OBJECT_UNRECONSTRUCTABLE and OBJECT_LOST

* OwnerDiedError

* fix test

* x

* ObjectReconstructionFailed

* ObjectReconstructionFailed

* x

* x

* print owner addr

* str

* doc

* rename

* x
2021-09-13 16:16:17 -07:00
Stephanie Wang
d43d297d9a
[core] Attach call site to ObjectRefs, print on error (#17971)
* Attach call site to ObjectRef

* flag

* Fix build

* build

* build

* build

* x

* x

* skip on windows

* lint
2021-09-01 15:29:05 -07:00
Chen Shen
9565fa549e
[Core][RFC] limit the total number of inlined bytes in task request rpc
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-08-12 13:55:54 -07:00
Jialing He
492076806d
[object store] Assign the object owner in ray.put() (#16833) 2021-07-20 11:06:00 -07:00
Stephanie Wang
ce25d4e896
[core] Record Plasma object sources and dump on out of memory (#16179)
* debug

* lint, build

* clean up logs

* fix build
2021-06-02 10:04:15 -07:00
architkulkarni
a0c1cfe034
[Core] Pass RuntimeEnv as opaque string in the task spec (#15658) 2021-05-13 10:32:00 -05:00
fyrestone
52cfa1cdd7
Fix load code from local (#12102) 2021-03-24 11:49:58 +08:00
Clark Zinzow
cd7e567a57
[Core] Ownership-based Object Directory - Added support for object spilling in the ownership-based object directory. (#13948)
* Add support for object spilling in the ownership-based object directory.

* Move owner address hashmap into pinned_objects_ and objects_pending_spill_.

* Update local object manager tests.

* Feedback and misc. fixes.

* Move spilled unpin callback lambda to std::binded private method.

* Skip test_delete_objects_multi_node test on MacOS for now.
2021-02-11 10:36:22 -08:00
Hao Chen
77cd0d5a21
Fix a crash problem caused by GetActorHandle in ActorManager (#13164) 2021-01-08 12:11:08 +08:00
mehrdadn
fb5280f21b
Fix some Windows CI issues (#9708)
Co-authored-by: Mehrdad <noreply@github.com>
2020-07-28 18:10:23 -07:00
Hao Chen
d49dadf891
Change Python's ObjectID to ObjectRef (#9353) 2020-07-10 17:49:04 +08:00
SangBin Cho
8f19f1eafb
[Core] Actor handle refactoring (#8895)
* Marking needed changes.

* Resolve basic dependencies.

* In progress.

* linting.

* In progress 2.

* Linting.

* Refactor done. Cleanup needed.

* Linting.

* Recover kill actor in core worker because it is used inside raylet

* Cleanup.

* Use unique pointer instead. Unit tests are broken now.

* Fix the upstream change.

* Addressed code review 1.

* Lint.

* Addressed code review 2.

* Fix weird github history.

* Lint.

* Linting using clang 7.0.

* Use a better check message.

* Revert cpp stuff.

* Fix weird linting errors.

* Manuall fix all lint issues.

* Update a newline.

* Refactor some interface.

* Addressed all code review.

* Addressed code review
2020-07-07 11:11:41 -07:00
mehrdadn
92f67cd2ae
Add Optional Fast Build Configuration (#8925)
* Fast builds by default

* Update doc/source/development.rst

Co-authored-by: Simon Mo <xmo@berkeley.edu>

Co-authored-by: Mehrdad <noreply@github.com>
Co-authored-by: Simon Mo <xmo@berkeley.edu>
2020-06-18 14:12:12 -07:00
Edward Oakes
2677b71003
Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
Kai Yang
48b48cc8c2
Support multiple core workers in one process (#7623) 2020-04-07 11:01:47 +08:00
ijrsvt
9bfc2c4b54
Moving Local Mode to C++ (#7670) 2020-04-01 15:50:57 -05:00
Simon Mo
b804d40c04
Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Simon Mo
7bef7031c2
Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Simon Mo
e8941b1b79
Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
Simon Mo
0e94e1dc2a
[Asyncio] Increase recursion limit manually (#7142) 2020-02-12 14:15:36 -08:00
fyrestone
0648bd28ef [xlang] Cross language Python support (#6709) 2020-02-08 13:01:28 +08:00
Edward Oakes
984490d2be
Collect object IDs during serialization (#6946) 2020-02-03 18:38:11 -08:00
Edward Oakes
2a4d2c6e9e
Basic reference counting & pinning (#6554) 2020-01-06 17:30:26 -06:00
Chaokun Yang
6272907a57 [Streaming] Streaming data transfer and python integration (#6185) 2019-12-10 20:33:24 +08:00