Commit graph

1250 commits

Author SHA1 Message Date
Edward Oakes
4955d14878
Remove transport type remnants (#8673) 2020-05-29 15:47:08 -05:00
mehrdadn
cb91fe2fc4
SetErrorMode for all Ray processes (#8656) 2020-05-29 10:18:20 -05:00
fangfengbin
35eeec5647
Add C++ global state for actor table (#8501)
* add global state actors

* fix code style

* fix GcsActorManagerTest bug

* rebase master

* add jni code

* add get checkpoint id code

* add debug code

* add debug code

* change log level

* fix compile bug

* return null in jni

* fix crash bug

* change import seq

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
Co-authored-by: Hao Chen <chenh1024@gmail.com>
2020-05-29 21:10:42 +08:00
Hao Chen
08fee00bc8
Increase rayelt client connect timeout to fix test_debug_tools (#8605) 2020-05-28 20:57:30 +08:00
Lingxuan Zuo
e594524ed3
[GCS] global state query node info table from GCS. (#8498) 2020-05-28 16:39:13 +08:00
Tao Wang
675ccbc799
Resubscribe worker table info when gcs service restart (#8606) 2020-05-28 10:27:38 +08:00
Edward Oakes
442ada0fcd
Remove shutdown prints to the console (#8626) 2020-05-27 10:52:31 -05:00
Lingxuan Zuo
bd4fbcd7fc
Global state accessor jni (#8637) 2020-05-27 17:43:47 +08:00
Tao Wang
a1298686d7
[TEST]Use manager class to start/stop components instead of spreading duplicated codes everywhere (#8500) 2020-05-27 16:51:51 +08:00
fangfengbin
b0cf781152
fix resubscribe miss callback index bug (#8604) 2020-05-27 11:55:17 +08:00
fangfengbin
99dd6a581d
fix testActorRestart failure bug (#8613) 2020-05-27 11:10:45 +08:00
fangfengbin
01f4a6eca0
Add task table subscribe retry when gcs service restart (#8601) 2020-05-26 17:47:03 +08:00
fangfengbin
c41976938d
Add node table subscribe retry when gcs service restart (#8591) 2020-05-26 14:42:48 +08:00
Tao Wang
7e5b3dc0d9
GCS server task info handler use storage instead of redis accessor (#8584) 2020-05-26 10:38:31 +08:00
fangfengbin
765d470c40
Add gcs object manager (#8298) 2020-05-25 17:21:35 +08:00
fangfengbin
f22d12d2fc
fix TestGetUncommittedLineage npe bug (#8585) 2020-05-25 15:48:58 +08:00
fangfengbin
229af662c6
Add job table&actor table subscribe retry when gcs service restart (#8442) 2020-05-25 14:38:25 +08:00
Tao Wang
92c2e41dfd
[GCS]profile info getting implementation based gcs service (#8536) 2020-05-24 22:23:01 +08:00
fangfengbin
2ab1b773d4
GCS server worker info handler use storage instead of redis accessor (#8543) 2020-05-23 23:17:36 +08:00
Eric Liang
351839bf69
Revert "GCS server task info handler use storage instead of redis accessor (#8531)" (#8562)
This reverts commit 9823e15311.
2020-05-22 19:16:43 -07:00
Kai Yang
2e5e789294
Allow enabling logging in core worker with empty log_dir (#8529) 2020-05-22 18:02:37 +08:00
fangfengbin
9823e15311
GCS server task info handler use storage instead of redis accessor (#8531) 2020-05-22 12:04:03 +08:00
Eric Liang
bb8d3c5cd0
ASAN build for ray core tests (#8431) 2020-05-21 15:11:03 -07:00
Edward Oakes
a76434ccde
Add ability to specify worker and driver ports (#8071) 2020-05-20 15:31:13 -05:00
mehrdadn
ebf060d484
Make more tests run on Windows (#8446)
* Remove worker Wait() call due to SIGCHLD being ignored

* Port _pid_alive to Windows

* Show PID as well as TID in glog

* Update TensorFlow version for Python 3.8 on Windows

* Handle missing Pillow on Windows

* Work around dm-tree PermissionError on Windows

* Fix some lint errors on Windows with Python 3.8

* Simplify torch requirements

* Quiet git clean

* Handle finalizer issues

* Exit with the signal number

* Get rid of wget

* Fix some Windows compatibility issues with tests

Co-authored-by: Mehrdad <noreply@github.com>
2020-05-20 12:25:04 -07:00
Lingxuan Zuo
cd706f40c4
[Stats] add nodeaddress tag for stats test (#8423) 2020-05-20 12:30:01 -05:00
Max Fitton
0fadc11437
[dashboard] Only show workers from the correct cluster (#8434) 2020-05-18 13:30:41 -05:00
fangfengbin
9347a5d10c
Add global state accessor of jobs (#8401) 2020-05-18 20:32:05 +08:00
Edward Oakes
16f48078d9
Remove use of ObjectID transport flag (#7699) 2020-05-17 11:29:49 -05:00
Tao Wang
acffdb2349
[TEST]use cc_test to run core_worker_test, enforce/reuse RedisServiceManagerForTest (#8443) 2020-05-17 18:43:00 +08:00
Stephanie Wang
bd169749e0
Option to retry failed actor tasks (#8330)
* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-05-15 20:15:15 -07:00
Max Fitton
00325eb2b2
Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
fangfengbin
08b612052b
Add redis store client AsyncGetAll/AsyncBatchDelete/AsyncDeleteByIndex API (#8390) 2020-05-14 14:38:25 +08:00
Hao Chen
a593fde606
Fix core dumps in ExitActor (#8382) 2020-05-12 20:06:04 +08:00
fangfengbin
515afa6809
Fix AsyncGetAll miss override bug (#8402)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-05-11 11:08:16 -05:00
fangfengbin
8d0c1b5e06
GCS adapts to actor table pub sub (#8347) 2020-05-11 13:53:53 +08:00
Stephanie Wang
3a25f5f5b4
Clean up actor state from the GCS (#8261)
* parametrize test

* Regression test and logging

* Test no restart after actor deletion

* Unit tests

* Refactor to subscribe to and lookup from worker failure table

* Refactor ActorManager to remove dependencies

* Revert "Regression test and logging"

This reverts commit 835e1a9091b51ca8efb00392d4cc4a665145de24.

* Revert "parametrize test"

This reverts commit f31272082831ba1a494816dd5511d87b24eca4c9.

* Revert "Test no restart after actor deletion"

This reverts commit 114a83de14329aa6ab787c80cd5757cf074a9072.

* doc

* merge

* Revert "Refactor to subscribe to and lookup from worker failure table"

This reverts commit 6aa13a05178d0b9aa1db9dee5c978c911b74fa3a.

* Revert "Revert "Test no restart after actor deletion""

This reverts commit 1bd92d09172aa8ab42632551cf9c56463f9598fe.

* Revert "Revert "parametrize test""

This reverts commit 639ba4d3b02167fb2b05e9878f9aa600bcec95b3.

* Revert "Revert "Regression test and logging""

This reverts commit f18b5f0db699a23cbccde32789e3639425e99ca4.

* Clean up actors that have gone out of scope

* Use actor ID instead of shared_ptr

* Clean up actors owned by dead workers

* Use actor ID instead of shared_ptr

* TODO and lint

* Fix unit tests

* Add unit tests for supervision and docs

* xx

* Fix tests

* Fix tests

* fix build
2020-05-09 18:43:49 -07:00
Edward Oakes
2677b71003
Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
Hao Chen
93138e617a
Fix a bad usage of std::move (#8364) 2020-05-09 14:24:24 +08:00
fangfengbin
7fec602f2e
GCS adapts to node resource table pub sub (#8305) 2020-05-09 10:31:35 +08:00
Eric Liang
413db0902d
Trigger global GC when resources may be occupied by deleted actors 2020-05-07 14:57:21 -07:00
fangfengbin
dd3c050168
GCS adapts to batch heartbeat table pub sub (#8346) 2020-05-07 20:33:36 +08:00
fangfengbin
620ea94873
Fix node manager miss object info bug (#8337) 2020-05-07 20:16:42 +08:00
SangBin Cho
e631827a9f
[Core] Show_webui segfault fix. (#8323) 2020-05-06 11:45:07 -05:00
fangfengbin
97430b2d0f
GCS adapts to node table pub sub (#8209) 2020-05-05 18:34:41 +08:00
fangfengbin
14d03a0869
GCS adapts to task lease table pub sub (#8299) 2020-05-05 10:16:56 +08:00
ijrsvt
cc7bd6650a
[core] Enabling Remote Task Cancelation (#8225) 2020-05-04 15:24:22 -07:00
Stephanie Wang
8625e09067
Actor manager refactor and unit tests (#8224)
* parametrize test

* Regression test and logging

* Test no restart after actor deletion

* Unit tests

* Refactor to subscribe to and lookup from worker failure table

* Refactor ActorManager to remove dependencies

* Revert "Regression test and logging"

This reverts commit 835e1a9091b51ca8efb00392d4cc4a665145de24.

* Revert "parametrize test"

This reverts commit f31272082831ba1a494816dd5511d87b24eca4c9.

* Revert "Test no restart after actor deletion"

This reverts commit 114a83de14329aa6ab787c80cd5757cf074a9072.

* doc

* merge

* Revert "Refactor to subscribe to and lookup from worker failure table"

This reverts commit 6aa13a05178d0b9aa1db9dee5c978c911b74fa3a.

* Use actor ID instead of shared_ptr

* TODO and lint

* Update src/ray/gcs/gcs_server/gcs_actor_scheduler.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Fix build

* doc

* Build

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-04 10:16:52 -07:00
fangfengbin
5f351a05fe
GCS adapts to task table pub sub (#8210) 2020-05-04 10:23:55 +08:00
Ion
e24276e3c1
New scheduler int capacities (#8192)
* working version

* working version

* done

* done

* done

* addressing most of Philipp comments

* addressing most of Philipp comments
2020-05-03 18:47:30 -07:00