Commit graph

1045 commits

Author SHA1 Message Date
Edward Oakes
2ad9bc5684
Move plasma retry logic into plasma store provider (#7328) 2020-02-26 16:57:02 -08:00
Eric Liang
b310661338
Add internal_api.global_gc() method, which triggers gc.collect() on all workers (#7327) 2020-02-26 14:09:29 -08:00
fangfengbin
ba494b5281
Fix gcs client rpc operation disorder bug (#7283) 2020-02-26 19:24:24 +08:00
Stephanie Wang
9964657815
Fix plasma bug (#7322) 2020-02-25 18:15:28 -08:00
Edward Oakes
44b4394afa
Remove unused AddContainedObjectIDs (#7323) 2020-02-25 16:42:20 -08:00
mehrdadn
57b33f1bed
Upgrade Boost (#6899) 2020-02-25 14:33:12 -08:00
Eric Liang
f14b6e477b
Raise gRPC message size limit to 100MB (#7269) 2020-02-24 23:22:49 -08:00
Edward Oakes
f2faf8d26e
Fix passing duplicate by-reference arguments (#7306) 2020-02-24 19:18:16 -08:00
Stephanie Wang
2c1f4fd82c
[core] Add long running regression test for distributed ref counting and fix memory leak (#7302)
* Add long running test for serialized IDs and fix mem leak

* comment
2020-02-24 17:58:42 -08:00
Stephanie Wang
2583949637
fix build (#7286) 2020-02-23 13:12:36 -08:00
Stephanie Wang
4c2de7be54
[core] Ref counting for returning object IDs created by a different process (#7221)
* Add regression tests

* Refactor, split RemoveSubmittedTaskReferences into submitted and finished paths

* Add nested return IDs to UpdateFinishedTaskRefs, rename WrapObjectIds

* Basic unit tests pass

* Fix unit test and add an out-of-order regression test

* Add stored_in_objects to ObjectReferenceCount, regression test now passes

* Add an Address to the ReferenceCounter so we can determine ownership

* Set the nested return IDs from the TaskManager

* Add another test

* Simplify

* Update src/ray/core_worker/reference_count_test.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* comments

* Add python test

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-22 13:29:48 -08:00
Eric Liang
01dd520797
Remove misleading error message (#7265) 2020-02-21 21:20:40 -08:00
Edward Oakes
d190e73727
Use our own implementation of parallel_memcopy (#7254) 2020-02-21 11:03:50 -08:00
Kai Yang
007333b960
[Java] Support direct call for normal tasks (#7193) 2020-02-21 10:03:34 +08:00
Stephanie Wang
f27bb6eb47
Only hold the RefCount lock if needed (#7249) 2020-02-20 17:10:06 -08:00
Edward Oakes
16e37416cd
Fix raylet pinning race condition (#7235) 2020-02-20 10:41:36 -08:00
Stephanie Wang
7e3819a27a
[core] Eagerly evict objects that are no longer in scope (#7220)
* Batch free requests, and free when object is unpinned

* rename

* note
2020-02-19 20:51:38 -08:00
Simon Mo
b804d40c04
Stop vendoring pyarrow (#7233) 2020-02-19 19:01:26 -08:00
Simon Mo
7bef7031c2
Revert "Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214)" (#7232) 2020-02-19 13:35:29 -08:00
Simon Mo
e8941b1b79
Revert "Revert "Removing Pyarrow dependency (#7146)" (#7209) (#7214) 2020-02-19 10:08:52 -08:00
Stephanie Wang
f76ce836b2
Distributed ref counting for serialized ObjectIDs (#6945)
* Skeleton plus a unit test for simple borrower case

* First unit test passes - forward an ID and task returns with 1 submitted task pending on the inner ID

* Invariant for contained_in

* Unit test passes for testing task return without creating a borrower

* Wrap ref count functionality in test case

* Fix bad delete

* Unit test and fix for borrowers creating more borrowers

* Unit test and fix for simple borrowing, but owner sends call after borrower's ref count goes to 0

* Refactor:
- keep a sentinel ref count for task argument IDs
- keep contained_in_borrowed in addition to contained_in_owned

* Unit test for nested IDs passes

* Refactor so that an object ID can only be contained in 1 borrowed ID at a time

* Add check

* Fix

* Unit test (passes) to test nesting object IDs but no borrowers created

* Unit test for nested objects from different owners passes, refactor to unset contained_in when popping refs

* Unit tests for borrowers receiving an ObjectID from multiple sources,
skip adding ownership info if we already have it to handle duplicate
refs

* Unit test for returning object ID passes

* More unit tests for returning object IDs pass

* Add serialized ID tests

* fix serialization issue

* remove swap

* It builds!

* debugging and some fixes:
- register handler for WaitForRefRemoved
- don't create a python reference for arg IDs
- pass in client factory into ReferenceCounter
- fix bad decrement in PopBorrowerRefs

* Fix accounting for serialized IDs:
- don't decrement for IDs on dependency resolution, wait until task finished
- add object IDs that were inlined when building the arguments to the task spec, pin these on the task executor until task finishes

* mu_ -> mutex_

* lint

* fix build

* clear outer_object_id

* add direct call type check

* Fix test for direct call IDs and return IDs for actor calls

* Fix CoreWorkerClient.Addr()

* Remove unneeded lock

* Remove unnecessary ObjectID refs

* Fix worker holding serialized refs test

* Fix hex IDs

* fix

* fix tests

* fix tests

* refactor and cleanups

* lint

* Put inlined Ids in task args and some cleanup

* Add back gc.collect() line for test case

* Refactor and fixes:
- store inlined IDs in RayObject
- allow storing objects with inlined IDs in memory store
- pin objects that were promoted to plasma

* oops

* make sure worker ID is set in address, pass in rpc::Address to CoreWorkerClient

* todos

* cleanups and test builds

* Fix tests

* Add feature flag

* cleanups

* address comments and some cleanups

* cleanup

* fix recursive test

* Comments for tests

* Turn off ref counting by default

* Skip tests

* Fix some bugs for test_array.py, java build

* Don't include nested objects in the ref count when the feature flag is off

* C++ feature flag does not work...

* Remove

* Turn on python tests and add a warning when plasma objects are evicted before being pinned

* Fix build and remove irrelevant test

* Fix for java

* Revert "Fix build and remove irrelevant test"

This reverts commit 056cca9b263ed05b0f9ab2250907338edcbca2d5.

* Fix ray.internal.free

* Fixes and skip some flaky tests

* fix java build

* fix windows build

* Add IDs contained in owned objects

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/protobuf/core_worker.proto

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.h

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/core_worker/reference_count.cc

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* Apply suggestions from code review

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

* update

* Try to fix ::test_direct_call_serialized_id_eviction

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-18 18:21:34 -08:00
mehrdadn
4a12243336
Use Process instead of pid_t (round 2) (#6882)
* Revert "Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)"

This reverts commit bde575b8dd.

* Process wrapper, using Boost.Process on Windows

- Reverts bde575b8dd.
- Re-applies fb8e3615d5 after some refactoring.

* Remove Boost.Process dependency

* Don't open /proc file on Linux

* Change FATAL to ERROR and modify error message when process doesn't exist
2020-02-18 17:44:46 -08:00
Eric Liang
0aa9373d62
Revert "Removing Pyarrow dependency (#7146)" (#7209)
This reverts commit 2116fd3bca.
2020-02-18 14:12:06 -08:00
Eric Liang
fae99ecb8e
[core] Make sure to unsubscribe get dependencies for direct task calls. (#7201)
* fix

* remove assert
2020-02-17 18:35:25 -08:00
ijrsvt
2116fd3bca
Removing Pyarrow dependency (#7146) 2020-02-17 18:00:13 -08:00
fyrestone
a6b8bd47b0
[xlang] Cross language serialize ActorHandle (#7134) 2020-02-17 20:44:56 +08:00
Qing Wang
f3703bafa3
[Java] Support concurrent actor calls API. (#7022)
* WIP

Temp change

Attach native thread to jvm

* Fix run mode

* Address comments.
2020-02-14 13:02:39 +08:00
Eric Liang
305eaaabe9
Fix hang if actor object id is returned from a task that exits (#6885) 2020-02-11 20:28:13 -08:00
mehrdadn
e09f63ad65
Fix build errors and add more targets to Windows builds (#6811)
* Fix common.fbs rename (due to apache/arrow/commit/bef9a1c251397311a6415d3dc362ef419d154caa)

* Add missing COPTS

* Use socketpair(AF_INET) if boost::asio::local is unavailable (e.g. on Windows)

* Fix compile bug in service_based_gcs_client_test.cc (fix build breakage in #6686)

* Work around googletest/gmock inability to specify override to avoid -Werror,-Winconsistent-missing-override

* Fix missing override on IsPlasmaBuffer()

* Fix missing libraries for streaming

* Factor out install-toolchains.sh

* Put some Bazel flags into .bazelrc

* Fix jni_md.h missing inclusion

* Add ~/bin to PATH for Bazel

* Change echo $$(date) > $@ to date > $@

* Fix lots of unquoted paths

* Add system() call checks for Windows

Co-authored-by: GitHub Web Flow <noreply@github.com>
2020-02-11 16:49:33 -08:00
Eric Liang
58c94f6381
[core] Delete() should never remote objects from in-memory store (#7117) 2020-02-10 22:40:09 -08:00
SangBin Cho
1e690673d8
Render tasks that are not schedulable on the dashboard. (#7034) 2020-02-10 14:23:06 -08:00
chaokunyang
247a4d022a Fix passing empty bytes in python tasks (#7045)
* ensure data_ won't be null_ptr when size == 0

* when data_sizes[i] == 0, we should Allocate an empty buffer

* work around for pyarrow.py_buffer

* fix comments

* add null ptr check

* add test for bytes

* lint
2020-02-10 12:07:29 +08:00
fangfengbin
694c0f2867
[Java] Enable GCS server when running java unit tests (#7041)
* enable gcs service when run java testcase

* fix ci bug

* fix windows compile bug

* fix ci bug

* restart ci job

* enable java testcase

* restart ci job

* restart ci job

* add debug log

* add debug log

* restart ci job

* add debug log

* restart ci

* add debug log

* fix java testcase bug

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job

* restart ci job
2020-02-10 09:39:14 +08:00
fyrestone
0648bd28ef [xlang] Cross language Python support (#6709) 2020-02-08 13:01:28 +08:00
Ion
f9885b8710
New scheduler local node (#6913)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-02-07 12:57:59 -08:00
Edward Oakes
580314bf81
Fix Ctrl-C hanging in in-memory store ray.get/ray.wait (#7033) 2020-02-05 10:17:22 -08:00
fangfengbin
ade7ebfc0c
Add service based gcs client (#6686) 2020-02-05 12:06:25 +08:00
Edward Oakes
844f607c93
Collect contained ObjectIDs during deserialization (#7029) 2020-02-03 22:49:14 -08:00
Edward Oakes
984490d2be
Collect object IDs during serialization (#6946) 2020-02-03 18:38:11 -08:00
Edward Oakes
77436c2e32
Use getppid() to check if the raylet has failed (#6963) 2020-02-02 22:05:21 -08:00
Edward Oakes
92525f35d1
Remove raylet client from Python worker (#6018) 2020-01-31 18:23:01 -08:00
Edward Oakes
341a921d81
Remove vanilla pickle serialization for task arguments (#6948) 2020-01-31 16:52:43 -08:00
Simon Mo
396d7fafc8
UI improvement for asyncio (#6905) 2020-01-27 12:45:51 -08:00
mehrdadn
bde575b8dd Revert "Use Boost.Process instead of pid_t (#6510)" (#6909)
This reverts commit fb8e3615d5.
2020-01-26 10:26:44 -06:00
Yunzhi Zhang
0834bda8c1 [Dashboard] Display actor task execution info (#6705)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2020-01-22 22:33:55 -08:00
Simon Mo
5f527816fe
Fix async actor high cpu utilization when idle (#6877) 2020-01-22 16:07:08 -08:00
mehrdadn
139bf8908e Replace UNIX sockets with TCP sockets in Ray on Windows (#6823)
* Replace UNIX sockets with TCP sockets in Ray
2020-01-20 17:28:11 -08:00
Stephanie Wang
815cd0e39a
Task and actor fate sharing with the owner process (#6818)
* Add test

* Kill workers leased by failed workers

* merge

* shorten test

* Add node failure test case

* Fix FromBinary for nil IDs, add assertions

* Test

* Fate sharing on node removal, fix owner address bug

* lint

* Update src/ray/raylet/node_manager.cc

Co-Authored-By: Zhijun Fu <37800433+zhijunfu@users.noreply.github.com>

* fix

* Remove unneeded test

* fix IDs

Co-authored-by: Zhijun Fu <37800433+zhijunfu@users.noreply.github.com>
2020-01-20 16:44:04 -08:00
Yunzhi Zhang
3acf3c7675 [Dashboard] Add actor task counter (#6820) 2020-01-17 15:43:56 -08:00
Zhijun Fu
92380dd4e6 Fix crash in HandleObjectMissing when direct actor creation task is not found in local_queues_ (#6817) 2020-01-17 13:29:13 -06:00