Commit graph

1323 commits

Author SHA1 Message Date
fangfengbin
9347a5d10c
Add global state accessor of jobs (#8401) 2020-05-18 20:32:05 +08:00
Edward Oakes
16f48078d9
Remove use of ObjectID transport flag (#7699) 2020-05-17 11:29:49 -05:00
Tao Wang
acffdb2349
[TEST]use cc_test to run core_worker_test, enforce/reuse RedisServiceManagerForTest (#8443) 2020-05-17 18:43:00 +08:00
Stephanie Wang
bd169749e0
Option to retry failed actor tasks (#8330)
* Python

* Consolidate state in the direct actor transport, set the caller starts at

* todo

* Remove unused

* Update and unit tests

* Doc

* Remove unused

* doc

* Remove debug

* Update src/ray/core_worker/transport/direct_actor_transport.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/core_worker/transport/direct_actor_transport.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* lint and fix build

* Update

* Fix build

* Fix tests

* Unit test for max_task_retries=0

* Fix java?

* Fix bad test

* Cross language fix

* fix java

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-05-15 20:15:15 -07:00
Max Fitton
00325eb2b2
Rename max_reconstructions to max_restarts and use -1 for infinite (#8274)
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-14 10:30:29 -05:00
fangfengbin
08b612052b
Add redis store client AsyncGetAll/AsyncBatchDelete/AsyncDeleteByIndex API (#8390) 2020-05-14 14:38:25 +08:00
Hao Chen
a593fde606
Fix core dumps in ExitActor (#8382) 2020-05-12 20:06:04 +08:00
fangfengbin
515afa6809
Fix AsyncGetAll miss override bug (#8402)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-05-11 11:08:16 -05:00
fangfengbin
8d0c1b5e06
GCS adapts to actor table pub sub (#8347) 2020-05-11 13:53:53 +08:00
Stephanie Wang
3a25f5f5b4
Clean up actor state from the GCS (#8261)
* parametrize test

* Regression test and logging

* Test no restart after actor deletion

* Unit tests

* Refactor to subscribe to and lookup from worker failure table

* Refactor ActorManager to remove dependencies

* Revert "Regression test and logging"

This reverts commit 835e1a9091b51ca8efb00392d4cc4a665145de24.

* Revert "parametrize test"

This reverts commit f31272082831ba1a494816dd5511d87b24eca4c9.

* Revert "Test no restart after actor deletion"

This reverts commit 114a83de14329aa6ab787c80cd5757cf074a9072.

* doc

* merge

* Revert "Refactor to subscribe to and lookup from worker failure table"

This reverts commit 6aa13a05178d0b9aa1db9dee5c978c911b74fa3a.

* Revert "Revert "Test no restart after actor deletion""

This reverts commit 1bd92d09172aa8ab42632551cf9c56463f9598fe.

* Revert "Revert "parametrize test""

This reverts commit 639ba4d3b02167fb2b05e9878f9aa600bcec95b3.

* Revert "Revert "Regression test and logging""

This reverts commit f18b5f0db699a23cbccde32789e3639425e99ca4.

* Clean up actors that have gone out of scope

* Use actor ID instead of shared_ptr

* Clean up actors owned by dead workers

* Use actor ID instead of shared_ptr

* TODO and lint

* Fix unit tests

* Add unit tests for supervision and docs

* xx

* Fix tests

* Fix tests

* fix build
2020-05-09 18:43:49 -07:00
Edward Oakes
2677b71003
Implement named actors using the GCS service (#8328) 2020-05-09 08:58:10 -05:00
Hao Chen
93138e617a
Fix a bad usage of std::move (#8364) 2020-05-09 14:24:24 +08:00
fangfengbin
7fec602f2e
GCS adapts to node resource table pub sub (#8305) 2020-05-09 10:31:35 +08:00
Eric Liang
413db0902d
Trigger global GC when resources may be occupied by deleted actors 2020-05-07 14:57:21 -07:00
fangfengbin
dd3c050168
GCS adapts to batch heartbeat table pub sub (#8346) 2020-05-07 20:33:36 +08:00
fangfengbin
620ea94873
Fix node manager miss object info bug (#8337) 2020-05-07 20:16:42 +08:00
SangBin Cho
e631827a9f
[Core] Show_webui segfault fix. (#8323) 2020-05-06 11:45:07 -05:00
fangfengbin
97430b2d0f
GCS adapts to node table pub sub (#8209) 2020-05-05 18:34:41 +08:00
fangfengbin
14d03a0869
GCS adapts to task lease table pub sub (#8299) 2020-05-05 10:16:56 +08:00
ijrsvt
cc7bd6650a
[core] Enabling Remote Task Cancelation (#8225) 2020-05-04 15:24:22 -07:00
Stephanie Wang
8625e09067
Actor manager refactor and unit tests (#8224)
* parametrize test

* Regression test and logging

* Test no restart after actor deletion

* Unit tests

* Refactor to subscribe to and lookup from worker failure table

* Refactor ActorManager to remove dependencies

* Revert "Regression test and logging"

This reverts commit 835e1a9091b51ca8efb00392d4cc4a665145de24.

* Revert "parametrize test"

This reverts commit f31272082831ba1a494816dd5511d87b24eca4c9.

* Revert "Test no restart after actor deletion"

This reverts commit 114a83de14329aa6ab787c80cd5757cf074a9072.

* doc

* merge

* Revert "Refactor to subscribe to and lookup from worker failure table"

This reverts commit 6aa13a05178d0b9aa1db9dee5c978c911b74fa3a.

* Use actor ID instead of shared_ptr

* TODO and lint

* Update src/ray/gcs/gcs_server/gcs_actor_scheduler.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Fix build

* doc

* Build

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2020-05-04 10:16:52 -07:00
fangfengbin
5f351a05fe
GCS adapts to task table pub sub (#8210) 2020-05-04 10:23:55 +08:00
Ion
e24276e3c1
New scheduler int capacities (#8192)
* working version

* working version

* done

* done

* done

* addressing most of Philipp comments

* addressing most of Philipp comments
2020-05-03 18:47:30 -07:00
fangfengbin
b7bbc3bc83
[GCS]GCS adapts to object table pub sub (#8180) 2020-05-03 21:44:33 +08:00
Stephanie Wang
b4ef500675
[core] Disable GCS actor management (#8271) 2020-05-02 22:02:56 -07:00
Edward Oakes
3aec683f61
Avoid fate sharing with owner for detached actors (#8267) 2020-05-01 11:58:47 -05:00
Edward Oakes
484f68765c
Fix resource_ids_ data race (#8253) 2020-04-30 18:55:54 -05:00
mehrdadn
254b1ec370
Set up testing and wheels for Windows on GitHub Actions (#8131)
* Move some Java tests into ci.sh

* Move C++ worker tests into ci.sh

* Define run()

* Prepare to move Python tests into ci.sh

* Fix issues in install-dependencies.sh

* Reload environment for GitHub Actions

* Move wheels to ci.sh and fix related issues

* Don't bypass failures in install-ray.sh anymore

* Make CI a little quieter

* Move linting into ci.sh

* Add vitals test right after build

* Fix os.uname() unavailability on Windows

Co-authored-by: Mehrdad <noreply@github.com>
2020-04-29 21:19:02 -07:00
Edward Oakes
ebdccde030
Fetch internal config from raylet (#8195) 2020-04-28 13:12:11 -05:00
fangfengbin
deffc340ea
[GCS]Add in-memory gcs table storage (#8184) 2020-04-28 17:19:46 +08:00
mehrdadn
b9de9dadd7
Fix Windows build (#8186)
Co-authored-by: Mehrdad <noreply@github.com>
2020-04-26 13:07:25 -07:00
fangfengbin
5bff707d20
[GCS]Add in-memory store client (#8144) 2020-04-26 19:09:26 +08:00
ZhuSenlin
9255fcd516
[GCS] Add node failure detector (#8119) 2020-04-26 19:08:27 +08:00
fangfengbin
c5d181e3d9
gcs adapts to worker table pub sub (#8182)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-04-26 17:58:55 +08:00
fangfengbin
f17bea2de5
Fix get gcs server address block bug (#8126) 2020-04-26 10:01:06 +08:00
ijrsvt
69ff7e3e35
TaskCancellation (#7669)
* Smol comment

* WIP, not passing ray.init

* Fixed small problem

* wip

* Pseudo interrupt things

* Basic prototype operational

* correct proc title

* Mostly done

* Cleanup

* cleaner raylet error

* Cleaning up a few loose ends

* Fixing Race Conds

* Prelim testing

* Fixing comments and adding second_check for kill

* Working_new_impl

* demo_ready

* Fixing my english

* Fixing a few problems

* Small problems

* Cleaning up

* Response to changes

* Fixing error passing

* Merged to master

* fixing lock

* Cleaning up print statements

* Format

* Fixing Unit test build failure

* mock_worker fix

* java_fix

* Canel

* Switching to Cancel

* Responding to Review

* FixFormatting

* Lease cancellation

* FInal comments?

* Moving exist check to CoreWorker

* Fix Actor Transport Test

* Fixing task manager test

* chaning clock repr

* Fix build

* fix white space

* lint fix

* Updating to medium size

* Fixing Java test compilation issue

* lengthen bad timeouts
2020-04-25 16:04:52 -07:00
fangfengbin
38dfe5db86
remove store client template (#8160)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-04-24 21:19:12 +08:00
fangfengbin
713e375d50
[GCS]GCS adapts to job table pub sub (#8145) 2020-04-24 16:33:25 +08:00
Qing Wang
d66d12661b
Improve the perf of constructing actor task specs. (#8093) 2020-04-21 11:54:09 +08:00
Stephanie Wang
eefea4e29c
[core] Post task submission to IO loop (#8090)
* Post to IO loop

* Unused

* Fix build
2020-04-20 19:13:50 -07:00
Stephanie Wang
1323e1753d
[core] When reconstruction is enabled, pin objects created by ray.put() (#8021)
* Unit test and pin ray.put objects until they have no more lineage references

* c++ tests

* lint

* Mark ray.put objects as pinned
2020-04-20 13:09:54 -07:00
ZhuSenlin
3f28a8a229
[GCS] reply to the owner only after the actor has been successfully created. (#8079)
* reply to the owner only after the actor is successfully created.

* reply immediately if the actor is already created

* fix comment

* add test_actor_creation_task provided by @Stephanie Wang

Co-authored-by: senlin.zsl <senlin.zsl@antfin.com>
2020-04-19 09:53:02 -07:00
Edward Oakes
90ef585fd5
Revert "Add ability to specify worker and driver ports (#7833)" (#8069)
This reverts commit 9f751ff8c4.
2020-04-17 12:32:22 -05:00
Eric Liang
55ce2bba10
Record num plasma errs in map (#8034) 2020-04-16 13:16:40 -07:00
Edward Oakes
9f751ff8c4
Add ability to specify worker and driver ports (#7833) 2020-04-16 13:49:25 -05:00
Clark Zinzow
d4cae5f632
[Core] Added ability to specify different IP addresses for a core worker and its raylet. (#7985) 2020-04-16 10:32:24 -05:00
fangfengbin
5a7882bb44
Fix gcs_server get invalid local address (#7842) 2020-04-16 14:58:19 +08:00
mehrdadn
ba00c29b67
Factor out Travis 'install' sections for use with GitHub Actions (#7988) 2020-04-15 08:10:22 -07:00
fangfengbin
efbaf155b2
[GCS]Add publish and subscribe function of gcs table (#7909) 2020-04-15 04:24:52 -07:00
fangfengbin
c17404918c
[GCS]Add gcs table storage interface (#7949) 2020-04-15 10:48:12 +08:00