Commit graph

1419 commits

Author SHA1 Message Date
Stephanie Wang
53549314c5
[core] Option to fallback to LRU on OutOfMemory (#7410)
* Add a test for LRU fallback

* Update error message

* Upgrade arrow to master

* Integrate with arrow

* Revert "Bazel mirrors (#7385)"

This reverts commit 44aded5272.

* Don't LRU evict

* Revert "Revert "Bazel mirrors (#7385)""

This reverts commit b6359fea78d1bd3925452ca88ac71e0c9e5c7dd3.

* Add lru_evict flag

* fix internal config

* Fix

* upgrade arrow

* debug

* Set free period in config for lru_evict, override max retries to fix
test

* Fix test?

* fix test

* Revert "debug"

This reverts commit 98f01c63a267f38218f5047b1866e4c1c8280017.

* fix exception str

* Fix ref count test

* Shorten travis test?
2020-03-14 11:28:43 -07:00
Kai Yang
d6e8f47065
Add a flag to disable reconstruction for a killed actor (#7346) 2020-03-13 19:10:21 +08:00
Qing Wang
f4656d8cc3
[Java] Enable direct call by default. (#7408)
* WIP

* Address comments.

* Linting

* Fix

* Fix

* Fix test

* Fix

* Fix single process ci

* Fix ut

* Update java/test/src/main/java/org/ray/api/test/PlasmaFreeTest.java

* Address comments

* Fix linting

* Minor update comments.

* Fix streaming CI
2020-03-13 12:25:30 +08:00
micafan
cc91ed57dc
[core] Fix losing task state when giving up forward task. (#7525)
* fix NodeManager::Forward task bug on error

* fix lint

* revert spillback task forward
2020-03-13 11:49:44 +08:00
Edward Oakes
768d0b3b3f
Allocate a buffer of 100 calls for each RPC handler (#7573) 2020-03-12 12:05:30 -07:00
ZhuSenlin
b663bc6d67
Use gcs server to replace raylet monitor when RAY_GCS_SERVICE_ENABLED=true (#7166) 2020-03-12 22:13:56 +08:00
fangfengbin
4c834b9d68
Fix the issue that gcs service client ignores error status code (#7539)
* add gcs reply status

* rebase master

* use macro to simplify

* convert status in gcs rpc client

* define a Status message in probobuf

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-03-12 15:08:29 +08:00
Stephanie Wang
fdb528514b
[core] Ref counting for actor handles (#7434)
* tmp

* Move Exit handler into CoreWorker, exit once owner's ref count goes to 0

* fix build

* Remove __ray_terminate__ and add test case for distributed ref counting

* lint

* Remove unused

* Fixes for detached actor, duplicate actor handles

* Remove unused

* Remove creation return ID

* Remove ObjectIDs from python, set references in CoreWorker

* Fix crash

* Fix memory crash

* Fix tests

* fix

* fixes

* fix tests

* fix java build

* fix build

* fix

* check status

* check status
2020-03-10 17:45:07 -07:00
Edward Oakes
119a303ea0
Remove static concurrency limit from gRPC server (#7544) 2020-03-10 16:27:02 -07:00
Edward Oakes
dbbf0c0e70
Add Apache 2 license to C++ files (#7520) 2020-03-10 16:07:17 -07:00
fangfengbin
fa785a2ad2
ServiceBasedGcsClient support detect gcs server availability and retry (#7292) 2020-03-10 21:01:07 +08:00
mehrdadn
fc76586518
Redis on Windows (#7509)
* Switch hiredis on Windows to that of the Windows port of Redis

* Use boost::asio::ip::tcp::socket::native_handle_type

* Use normal hiredis instead of Windows-specific one

* Finish up using normal hiredis

Co-authored-by: Mehrdad <noreply@github.com>
2020-03-09 18:49:54 -07:00
Edward Oakes
b4e2d5317e
Remove experimental.NoReturn (#7475) 2020-03-09 11:09:36 -07:00
Stephanie Wang
95bb0c5357
Upgrade plasma to latest version, use synchronous Seal (#7470)
* Upgrade arrow to master

* fix build

* todo

* lint

* Fix hanging test
2020-03-09 10:30:44 -07:00
Edward Oakes
0abcca258f
Add entries to in-memory store on Put() (#7085) 2020-03-04 10:17:27 -08:00
ijrsvt
fb76092d75
Re-route asyncio plasma code path through raylet instead of direct plasma connection (#7234) 2020-03-03 15:43:46 -05:00
fangfengbin
f5b1062ed9
Fix TwoNodeTest.TestActorTaskCrossNodes testcase when enable gcs service (#7416) 2020-03-03 19:37:38 +08:00
ijrsvt
584645cc7d
Fix Experimental Async API (#7391) 2020-03-02 22:24:20 -06:00
Qing Wang
2771af1036
Fix the bug of unregistered workers in worker pool (#7343)
* Fix

* Fix

* Fix complie

* Fix lint

* Fix linting

* Fix testDeleteObject

* Fix linting

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Update src/ray/raylet/worker_pool.cc

Co-Authored-By: Hao Chen <chenh1024@gmail.com>

* Address comments.

* FIx linting

Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-03-02 16:30:39 +08:00
mehrdadn
5fb5be0ba5
Some bug fixes for Windows (#7374)
* Fix MAP_SHARED check in sys/mman.h

* Fix missing :platform_shims dependency for ray_util

* dlmalloc patch for Arrow
2020-02-28 10:22:32 -08:00
mehrdadn
0efaa9b310
Use Redis for Windows (#7364) 2020-02-28 10:18:56 -08:00
micafan
3f8b1d2756
Fix ServiceBasedGcsGcsClientTest timing bug (#7365) 2020-02-28 12:01:02 -06:00
Edward Oakes
bd9411f849
Call TriggerGlobalGC when the plasma store is full (#7337) 2020-02-27 11:01:49 -08:00
Edward Oakes
55ccfb6089
Fix asyncio actor race condition (#7335) 2020-02-27 10:16:04 -08:00
Edward Oakes
2ad9bc5684
Move plasma retry logic into plasma store provider (#7328) 2020-02-26 16:57:02 -08:00
Eric Liang
b310661338
Add internal_api.global_gc() method, which triggers gc.collect() on all workers (#7327) 2020-02-26 14:09:29 -08:00
fangfengbin
ba494b5281
Fix gcs client rpc operation disorder bug (#7283) 2020-02-26 19:24:24 +08:00
Stephanie Wang
9964657815
Fix plasma bug (#7322) 2020-02-25 18:15:28 -08:00
Edward Oakes
44b4394afa
Remove unused AddContainedObjectIDs (#7323) 2020-02-25 16:42:20 -08:00
mehrdadn
57b33f1bed
Upgrade Boost (#6899) 2020-02-25 14:33:12 -08:00
Eric Liang
f14b6e477b
Raise gRPC message size limit to 100MB (#7269) 2020-02-24 23:22:49 -08:00
Edward Oakes
f2faf8d26e
Fix passing duplicate by-reference arguments (#7306) 2020-02-24 19:18:16 -08:00
Stephanie Wang
2c1f4fd82c
[core] Add long running regression test for distributed ref counting and fix memory leak (#7302)
* Add long running test for serialized IDs and fix mem leak

* comment
2020-02-24 17:58:42 -08:00
Stephanie Wang
2583949637
fix build (#7286) 2020-02-23 13:12:36 -08:00
Stephanie Wang
4c2de7be54
[core] Ref counting for returning object IDs created by a different process (#7221)
* Add regression tests

* Refactor, split RemoveSubmittedTaskReferences into submitted and finished paths

* Add nested return IDs to UpdateFinishedTaskRefs, rename WrapObjectIds

* Basic unit tests pass

* Fix unit test and add an out-of-order regression test

* Add stored_in_objects to ObjectReferenceCount, regression test now passes

* Add an Address to the ReferenceCounter so we can determine ownership

* Set the nested return IDs from the TaskManager

* Add another test

* Simplify

* Update src/ray/core_worker/reference_count_test.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* comments

* Add python test

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-22 13:29:48 -08:00
Eric Liang
01dd520797
Remove misleading error message (#7265) 2020-02-21 21:20:40 -08:00
Edward Oakes
d190e73727
Use our own implementation of parallel_memcopy (#7254) 2020-02-21 11:03:50 -08:00
Kai Yang
007333b960
[Java] Support direct call for normal tasks (#7193) 2020-02-21 10:03:34 +08:00
Stephanie Wang
f27bb6eb47
Only hold the RefCount lock if needed (#7249) 2020-02-20 17:10:06 -08:00
Edward Oakes
16e37416cd
Fix raylet pinning race condition (#7235) 2020-02-20 10:41:36 -08:00
Stephanie Wang
7e3819a27a
[core] Eagerly evict objects that are no longer in scope (#7220)
* Batch free requests, and free when object is unpinned

* rename

* note
2020-02-19 20:51:38 -08:00
Simon Mo
b804d40c04
Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Simon Mo
7bef7031c2
Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Simon Mo
e8941b1b79
Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Stephanie Wang
f76ce836b2
Distributed ref counting for serialized ObjectIDs (#6945)
* Skeleton plus a unit test for simple borrower case

* First unit test passes - forward an ID and task returns with 1 submitted task pending on the inner ID

* Invariant for contained_in

* Unit test passes for testing task return without creating a borrower

* Wrap ref count functionality in test case

* Fix bad delete

* Unit test and fix for borrowers creating more borrowers

* Unit test and fix for simple borrowing, but owner sends call after borrower's ref count goes to 0

* Refactor:
- keep a sentinel ref count for task argument IDs
- keep contained_in_borrowed in addition to contained_in_owned

* Unit test for nested IDs passes

* Refactor so that an object ID can only be contained in 1 borrowed ID at a time

* Add check

* Fix

* Unit test (passes) to test nesting object IDs but no borrowers created

* Unit test for nested objects from different owners passes, refactor to unset contained_in when popping refs

* Unit tests for borrowers receiving an ObjectID from multiple sources,
skip adding ownership info if we already have it to handle duplicate
refs

* Unit test for returning object ID passes

* More unit tests for returning object IDs pass

* Add serialized ID tests

* fix serialization issue

* remove swap

* It builds!

* debugging and some fixes:
- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs

* Fix accounting for serialized IDs:
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes

* mu_ -> mutex_

* lint

* fix build

* clear outer_object_id

* add direct call type check

* Fix test for direct call IDs and return IDs for actor calls

* Fix CoreWorkerClient.Addr()

* Remove unneeded lock

* Remove unnecessary ObjectID refs

* Fix worker holding serialized refs test

* Fix hex IDs

* fix

* fix tests

* fix tests

* refactor and cleanups

* lint

* Put inlined Ids in task args and some cleanup

* Add back gc.collect() line for test case

* Refactor and fixes:
- store inlined IDs in RayObject
- allow storing objects with inlined IDs in memory store
- pin objects that were promoted to plasma

* oops

* make sure worker ID is set in address, pass in rpc::Address to CoreWorkerClient

* todos

* cleanups and test builds

* Fix tests

* Add feature flag

* cleanups

* address comments and some cleanups

* cleanup

* fix recursive test

* Comments for tests

* Turn off ref counting by default

* Skip tests

* Fix some bugs for test_array.py, java build

* Don't include nested objects in the ref count when the feature flag is off

* C++ feature flag does not work...

* Remove

* Turn on python tests and add a warning when plasma objects are evicted before being pinned

* Fix build and remove irrelevant test

* Fix for java

* Revert "Fix build and remove irrelevant test"

This reverts commit 056cca9b263ed05b0f9ab2250907338edcbca2d5.

* Fix ray.internal.free

* Fixes and skip some flaky tests

* fix java build

* fix windows build

* Add IDs contained in owned objects

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* update

* Try to fix ::test_direct_call_serialized_id_eviction

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-18 18:21:34 -08:00
mehrdadn
4a12243336
Use Process instead of pid_t (round 2) (#6882)
* Revert "Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)"

This reverts commit bde575b8dd.

* Process wrapper, using Boost.Process on Windows

- Reverts bde575b8dd.
- Re-applies fb8e3615d5 after some refactoring.

* Remove Boost.Process dependency

* Don't open /proc file on Linux

* Change FATAL to ERROR and modify error message when process doesn't exist
2020-02-18 17:44:46 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
Eric Liang
fae99ecb8e
[core] Make sure to unsubscribe get dependencies for direct task calls. (#7201)
* fix

* remove assert
2020-02-17 18:35:25 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
fyrestone
a6b8bd47b0
[xlang] Cross language serialize ActorHandle (#7134) 2020-02-17 20:44:56 +08:00