Commit graph

2123 commits

Author SHA1 Message Date
architkulkarni
c3d06697bb
[Core] Add dynamic conda env install in shim process (#15881) 2021-05-19 15:46:42 -05:00
Eric Liang
836c739fe5
Revert "[client] One Driver per RayClient Server (#15875)" (#15922)
This reverts commit 97d1414f23.
2021-05-19 11:58:29 -07:00
Ian Rodney
97d1414f23
[client] One Driver per RayClient Server (#15875) 2021-05-19 09:03:09 -07:00
qicosmos
8790bb465b
[C++ worker] Remove func ptr offset (#15809) 2021-05-19 18:03:39 +08:00
architkulkarni
194c5e3a96
[Core] Cache workers by runtime_env in worker pool (#15782)
* pass RuntimeEnv in task spec as opaque string

* lint

* set correct empty value for json: "{}" not ""

* add comment for field in proto

* fix worker pool test by checking both "" and "{}"

* add RAY_CHECK todo

* make dict empty if all values null

* remove unnecessary ser/de

* fix

* address comments

* add WorkerCacheKey with hash function

* clean up

* add naive impl., dedicated workers never killed

* put dedicated workers in idle_of_all_languages

* pipe env hash from worker.py -> Worker

* fully pipe through hash, basic cache test passing

* use int type for runtime env hash

* convert Worker env hash type from size_t to int

* fix

* add method to MockWorker to fix cpp tests

* make compatible with java streaming test

* restore old dynamic_options code to fix java test

* address comments

* add comment about sorting before hash

* add comments for private members of WorkerCacheKey
2021-05-18 00:19:27 -07:00
Alex Wu
69f228d22d
[core] Record actor+job start/end times and metadata (#15803) 2021-05-17 21:38:39 -07:00
Frank Luan
0dc34566fe
Refactor raylet to allocate+write+seal one return object at a time (#15757)
* Refactor raylet to allocate+write+seal one return object at a time

* Fix build

* Fix C++ and Java runtime

* Skip Windows testing

* Fix java and cpp runtime

* Fix warnings

* Fix cpp and java tests

* Fix cpp and java runtime

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2021-05-17 20:06:08 -07:00
SangBin Cho
ff461634b0
[Core] Improved bad error message. (#15663)
* Improved bad error message.

* Update src/ray/raylet/node_manager.cc

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* lint.

* Add a pid

Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-05-17 19:38:05 -07:00
Alex Wu
3e94114336
Namespaces (#15774) 2021-05-17 10:04:22 -07:00
SangBin Cho
259fcbd5bd
[Pubsub] Generalize the pubsub interface and adapt it for ref counting protocol (#15446)
* Add mock code first

* In the initial progress.

* Fix the number error

* In progress.

* in more pgoress.

* in progress.

* lint.

* Prototype done.

* Fix compilation bug.

* Now it is working with reference counting.

* Remove template.

* lint.

* Fixed issues.

* Fix reference count test.

* Reference count test passes now.

* Fixed the test array problem

* Addressed code review.

* lint.

* Addressed half of code review.

* Fix tests.

* Addressed the most critical issue.

* Make subscriber thread-safe.

* Revert "Make subscriber thread-safe."

This reverts commit 9a6a52197cfa8463ab60dfaae9530ad3c0ed8790.

* Fixed test failures. The only failure now is the asan failure.

* Reset test suites and see if it fixes the issue.

* Fix a flaky test

* Addressed code review.
2021-05-13 09:29:02 -07:00
architkulkarni
a0c1cfe034
[Core] Pass RuntimeEnv as opaque string in the task spec (#15658) 2021-05-13 10:32:00 -05:00
SongGuyang
40b2face74
Fix std::atomic compiling error (#15781) 2021-05-13 10:27:45 -05:00
Tao Wang
19462e43d6
[large scale]use proxy to track gcs server address in core worker (#15714) 2021-05-13 19:26:01 +08:00
fcardoso75
c877da4c19
create_and_mmap_buffer() - In case CreateFileMapping() fails, GetLastError() return code is printed (#15773)
* Enabling all test cases on test_client.py

* Moving test_client.py to a large CI py_test_module_list

* Disabling test_client::test_remote_functions

* Divide Run CI script action into separete Build action and Test action

* Reverting test_client.py to separate work for different tickes

* Reverting python\ray\tests\BUILD to separate work for different tickets

* create_and_mmap_buffer() - In case CreateFileMapping() fails, GetLastError() return code is printed

* Addressed lint comments

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-05-13 00:31:33 -07:00
Ian Rodney
cdf93930f3
Revert "[Core] Fix event loop instrumentation causing Java segfaults in tests. (#15349)" (#15727)
This reverts commit edb0d1b376.
2021-05-12 15:49:06 -07:00
mwtian
6a044f4f30
[Test] Ensure output params are initialized before calling IsPlasmaObjectPinnedOrSpilled() (#15758) 2021-05-12 10:22:35 -07:00
fyrestone
56c309416e
[Job submission] Basic job submission structure (#15103) 2021-05-12 15:08:20 +08:00
Clark Zinzow
c1b7d6f115
Don't consider a worker to be idle if it has in-flight object pinning RPCs. (#15686) 2021-05-11 19:21:52 -07:00
Eric Liang
82d5b67521
Remove placement group log spam (#15747) 2021-05-11 17:08:06 -07:00
Eric Liang
cb59d30917
Drop profiling events if the GCS becomes backlogged (#15726) 2021-05-11 14:10:34 -07:00
Eric Liang
996a002b00
Add prepopulate plasma memory flag for debugging (#15669)
* add prepopulate flag

* fix build

* warn
2021-05-07 15:17:31 -07:00
Clark Zinzow
edb0d1b376
[Core] Fix event loop instrumentation causing Java segfaults in tests. (#15349)
* Reenable event loop instrumentation.

* Take stats handle by copy in post() handler closure.

* Revert "Take stats handle by copy in post() handler closure."

This reverts commit e46777939bcc3bb4bb101e136e9d3348ea4ae1a1.
2021-05-07 15:01:00 -07:00
Yi Cheng
d5379ba99e
[core] RuntimeEnv GC in gcs (#14833) 2021-05-06 11:31:33 -05:00
Alex Wu
18d85d2de9
Grpc based resource broadcast (#15466) 2021-05-05 11:20:08 -07:00
architkulkarni
e5c5dde847
[Core] Prevent dedicated workers from being returned to general idle pool (#15545) 2021-04-29 15:45:25 -05:00
Alex Wu
40a6ced996
[core] Handle blocked worker crashes edge case (#15083) 2021-04-27 10:14:12 -07:00
Ian Rodney
4db696d365
[Client] Asyncio Client, Sync gRPC Server (#15488) 2021-04-27 08:41:10 -07:00
Ian Rodney
360b053254
[client] Add support for ray.timeline() (#15448) 2021-04-26 18:32:22 -07:00
architkulkarni
b08b2c5103
[Core] Add "shim process" setup_worker.py that calls "conda activate" for runtime_env (#15361) 2021-04-23 15:29:52 -05:00
Eric Liang
93a1ecba4b
Unhandled error messages aren't printed until next interaction with shell (#15432) 2021-04-23 11:00:34 -07:00
fangfengbin
d9780761a3
[GCS]Revert ping_gcs_rpc_server_max_retries to 600 (#14443) 2021-04-23 10:02:38 +08:00
Jialing He
5403021430
Fix incorrect call function WorkerID::FromBinary (#15449) 2021-04-22 15:44:49 +08:00
Yi Cheng
dbba3a456f
[core] Fixing of actor creation failure (#15411)
* Fix

* fix

* format

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* format

* fix comments
2021-04-20 15:27:45 -07:00
Yi Cheng
9b3ea7c32b
[core] Take care of object spilling failure (#14703)
* fix spilling failure

* format

* unittests added

* format

* format

* format

* fix

* add comment

* fix some comments

* add test cases

* format

* format
2021-04-20 10:28:48 -07:00
fangfengbin
ade684ac03
[Test] Fix gcs flaky testcase (#15391)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-04-19 10:21:39 -07:00
SangBin Cho
5f74d0e40d
[Test] Fix flaky test failure (#15326)
* Fix trial.

* unskip test.

* Mock commit
2021-04-16 18:09:02 -07:00
fangfengbin
0e3bbbeba3
[Test] Try deflaking gcs server test by adding log (#15332)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2021-04-15 21:16:09 -07:00
Stephanie Wang
6b2da7eda8
[core] Log warning on bad max task args value (#15314) 2021-04-14 20:34:08 -07:00
Yi Cheng
0caf96be94
Take care of failed killing request (#15313) 2021-04-14 18:07:10 -07:00
wanxing
0ad0839265
Optimize lambda copy to improve direct call performance. (#15036) 2021-04-14 11:02:49 +08:00
Ian Rodney
ec3d5f2ef1
[client] Handle ray.put failures (#15229) 2021-04-13 11:23:16 -07:00
Clark Zinzow
95659987a4
[Core] Event loop instrumentation - manual instrumentation hooks, instrumentation for deadline timer and local stream socket. (#15144)
* Added manual hooks in event loop instrumentation.

* Added instrumentation of the deadline timer in the periodical runner.

* Added instrumentation of the local stream socket in the ClientConnection.

* Addressed feedback except for opaque handle.

* Switch to opaque stats handle API.

* Add opaque stats handle destructor check to ensure that RecordExecution is called.

* Revert "Add opaque stats handle destructor check to ensure that RecordExecution is called."

This reverts commit 62cf8fca670d78c1160f0a9526b6cbe6e3a25725.

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Other feedback, fixes for code suggestions.

* Prevent handler stats from leaking queueing stats when handler execution is never recorded.

* Enable event loop instrumentation.

* Revert "Enable event loop instrumentation."

This reverts commit df90c504e45e1963dc2ef6c3197dc5c965bc19e7.

* Reorg GCS client and IO context member fields to prevent use-after-free.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-04-12 13:39:00 -07:00
Tao Wang
4c9eee609c
Revert "Revert "[GCS]Increase heartbeat interval to reduce pressure o… (#15207)
* Revert "Revert "[GCS]Increase heartbeat interval to reduce pressure on gcs server (#14203)" (#15194)"

This reverts commit a9ac4ad890.

* optimize wait condition to avoid flakey test

* remove unnecessary sleep
2021-04-12 10:45:42 -07:00
Hao Chen
10ff2f3b4a
Fix duplicate destruction of CoreWorkerProcess instance (#15245) 2021-04-12 21:01:21 +08:00
chenk008
6709560ef6
fix setproctitle break /proc/PID/environ (#15056)
* fix setproctitle break /proc/PID/environ

* bugfix

* add ut

* fix ut

* fix ut

* fix ut

* improve comment

* improve comment

* fix ut lint

* fix ut lint

* revert init.py

Co-authored-by: wuhua.ck <wuhua.ck@alibaba-inc.com>
2021-04-09 15:45:19 -07:00
Stephanie Wang
94e592004e
Prioritize worker requests for objects over queued task arguments (#15157) 2021-04-08 14:51:21 -07:00
SangBin Cho
a9ac4ad890
Revert "[GCS]Increase heartbeat interval to reduce pressure on gcs server (#14203)" (#15194)
This reverts commit ef195e5108.
2021-04-08 09:29:13 -07:00
SangBin Cho
bd58a9a9ff
[Build] Fix symbol problems (#15187) 2021-04-08 09:11:15 -07:00
Alex Wu
e5feaee95a
[core worker] Disable async connections (#15161) 2021-04-07 22:32:04 -07:00
SangBin Cho
61d120557d
[Pubsub] Generalize pubsub, Move pubsub code to pubsub_lib module (#15164)
* cherry-pick-1

* cherry-pick-2

* cherry-pick-part-3

* Should work.

* Lint fix.

* Fix lint 2.
2021-04-07 20:40:39 -07:00