Commit graph

1650 commits

Author SHA1 Message Date
fangfengbin
55a090fb16
[GCS]Optimize gcs client nodes get function (#11424)
* [GCS]Optimize gcs client nodes get function

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 21:13:19 -07:00
Tao Wang
273a712786
[GCS]Decouple node failure detector with resoure related operations (#11465) 2020-10-27 15:52:42 -07:00
fangfengbin
ebe9a8865c
[GCS]Fix a bug that creates invalid connection (#11590)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 10:08:06 -07:00
Ian Rodney
2da6ad2176
[core] Better error message for named actor not found (#11604) 2020-10-26 09:46:02 -07:00
Tao Wang
0fbee4da0c
[GCS] Remove unused ReportBatchHeartbeat/SubscribeHeartbeat (#11567)
* Remove unused message ReportBatchHeartbeat

* add up
2020-10-25 21:06:28 -07:00
Eric Liang
d3ee83205b
Remove crashing assert in actor creation for old scheduler (#11577)
* remove assert

* warn log
2020-10-24 00:05:26 -07:00
DK.Pino
9f804ade5f
[Placement Group]Add get all placement group api (#11460)
* add get all interface for placement group

* add get all interface for placement group

* make it work

* fix lint

* fix lint

* fix comment

* add cpp test

* fix python lint
2020-10-23 11:46:48 -07:00
Alex Wu
e02f4c0157
[New scheduler] queue by shape (#11381) 2020-10-21 15:56:06 -07:00
Edward Oakes
5d7f271e7d
Add --worker-port-list option to ray start (#11481) 2020-10-21 14:46:45 -05:00
Tao Wang
da2d3fbcfc
Remove unused field in heartbeat message (#11459) 2020-10-21 10:49:16 -07:00
Kai Yang
078a22d676
[Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493)
* Allow creating tasks/actors in a detached actor when driver has exited

* lint

* Address comment
2020-10-21 10:45:29 -07:00
Xuxue1
7200ddb72d
Fix code_search_path failed in java (#11406)
Co-authored-by: xujiqiang eigen <xujiqiang@hpc1.ipa.aidigger.com>
2020-10-21 18:10:48 +08:00
fangfengbin
a075e37695
[GCS]Fix TestActorTableResubscribe bug (#11463)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-20 22:32:41 -07:00
Lingxuan Zuo
aed739fbf4
[Log] Ignore callstacktrace test for windows (#11413) 2020-10-20 15:21:29 +08:00
DK.Pino
1b3b009f7a
[PlacementGroup]Add guarded by in placement group scheduler ut (#11306)
* add GUARDED_BY for success_placement_groups_ and failure_placement_groups_ vector

* update lint

* update lint

* update logical

* update lint

* change int to unsigned int

* update lint

* rename vector_mutex_ to placement_group_requests_mutex_

* resolve comment

* add int() for windows
2020-10-19 18:54:35 -07:00
fangfengbin
da89cb19eb
[GCS]Fix node info idempotent bug (#11423)
* [GCS]Fix node info idempotent bug

* Fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-19 10:23:33 +08:00
SangBin Cho
666fcde8ca
[Placement group] Input validation (#11152)
* Add a basic input validation.

* Addressed code review.
2020-10-14 13:56:41 -07:00
SangBin Cho
b1481c6acf
Revert "[PlacementGroup]Add node manager test framework (#11174)" (#11398)
This reverts commit 241e765d3a.
2020-10-14 11:09:20 -07:00
Lingxuan Zuo
149ec5f6bf
[Log] dump stacktrace from glog lib (#11360)
* dump stacktrace from glog lib

* fix windows compile

* add comments for getcallstack
2020-10-14 10:52:12 -07:00
Kai Yang
abc6126814
[Java] Release actor instance reference when Ray.exitActor() is invoked (#11324) 2020-10-14 13:12:59 +08:00
fangfengbin
c926838411
[GCS]Fix GcsActorManagerTest multithreading bug (#11361)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 21:36:40 -07:00
fangfengbin
241e765d3a
[PlacementGroup]Add node manager test framework (#11174)
* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix ut bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 19:27:11 -07:00
fangfengbin
0c02427da2
[GCS]Eviction of destroyed actors cached in GCS (#11338) 2020-10-13 15:34:35 +08:00
SangBin Cho
c107eea551
[Core] Do not report stats when worker is already dead. (#11167)
* Fix.

* Addressed code reivew.

* Done.
2020-10-12 11:57:04 -07:00
Alex Wu
175fc41fbc
[Autoscaler] Account for resource backlog size (#11261) 2020-10-12 09:43:48 -07:00
fangfengbin
d1579819e9
[GCS]Eviction of dead nodes cached in GCS (#11323) 2020-10-12 15:54:32 +08:00
fangfengbin
31117b5e96
[GCS]Add job id to log (#11331) 2020-10-12 13:53:08 +08:00
SangBin Cho
9dd4561d1b
[Placement Group] Fix stress tests to pass when actors are scheduled. (#11151)
* Fix stress tests to pass when actors are created.

* Addressed code review.
2020-10-09 21:52:26 -07:00
fangfengbin
3eb2b9e216
[GCS]Random eviction of destroyed actors cached in GCS (#11189)
* add part code

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:47 -07:00
fangfengbin
ca36105d77
[TEST]Fix TestActorSubscribeAll bug (#11297)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:27 -07:00
Alex Wu
a6f91664c1
[New Scheduler] Multi tenancy edge case (#11164)
* .

* refactor

* .

* .

* done?

* .

* .

* .

* lint

* no light heartbeat, no tests, fields 2,3

* .

* manually clang format :(

* .

* .

* test

* .

* .

* task manager heartbeat

* lint

* .

* add reminder

* CR

* CR

* cleanup

* CR

* comment

* lint

* .

* .
2020-10-08 13:19:01 -07:00
SangBin Cho
37fa86f9a0
[Placement Group] Fix placement group bugs that happen when rescheduling. (#11263)
* Fix placement group bugs while autoscaling.

* Addressed code review.
2020-10-08 08:58:59 -07:00
Sumanth Ratna
14d8826e43
Fix overriden typo (#11227) 2020-10-07 19:11:07 -07:00
Alex Wu
d2a0d23b0e
[Core] Fix master build failure (#11217)
Co-authored-by: Alex Wu <alex@Alexs-MacBook-Pro.local>
2020-10-06 10:23:34 -07:00
Alex Wu
dc7c2a70b8
[Core] Report worker backlog in GCS heartbeat (#11039) 2020-10-05 22:00:44 -07:00
SangBin Cho
80cc161f3e
[Placement Group] Report placement group load through heartbeat. (#11129)
* In progress.

* Fix a minor issue.

* Removed unnecessary comments.

* Addressed code review.

* Fix build failure.

* remove stray logs.

* Move global state to a med size test to avoid windows CI breakage.
2020-10-04 16:47:22 -07:00
fangfengbin
1244dafad3
[GCS]Optimization: Clear task_spec of destroyed actors (#11149)
* Clear task_spec of destroyed actors

* fix commnet

* disable ut

* fix windows compile bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-03 00:00:41 -07:00
SangBin Cho
6974cea0cd
[Core] Use optional return instead of nullptr for the GetNode method. 2020-10-02 20:54:26 -07:00
Stephanie Wang
ada58abcd9
[Object spilling] Update object directory and reload spilled objects automatically (#11021)
* Fix pytest...

* Release objects that have been spilled

* GCS object table interface refactor

* Add spilled URL to object location info

* refactor to include spilled URL in notifications

* improve tests

* Add spilled URL to object directory results

* Remove force restore call

* Merge spilled URL and location

* fix

* CI

* build

* osx

* Fix multitenancy issues

* Skip windows tests
2020-10-02 15:52:42 -07:00
fangfengbin
180c259702
[GCS]Remove unused api(ServiceBasedActorInfoAccessor::AsyncRegister/ServiceBasedActorInfoAccessor::AsyncUpdate) (#11099)
* remove unused gcs api

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-02 00:54:28 -07:00
Alex Wu
a866be381c
[New Scheduler] Heartbeat (#11024)
* .

* refactor

* .

* .

* done?

* .

* .

* .

* lint

* no light heartbeat, no tests, fields 2,3

* .

* manually clang format :(

* .

* .

* test

* .

* .

* task manager heartbeat

* lint

* .

* add reminder

* CR

* CR

* cleanup

* CR

* comment

* lint

* .
2020-10-01 15:54:53 -07:00
fangfengbin
138d6cced9
[GCS]Optimizing actor info query interface (#11067)
* add part code

* add part code

* fix review comment

* fix review comment

* fix review comment

* fix crash bug

* fix ut bug

* fix bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-30 11:34:42 -07:00
Kai Yang
3504391fd2
[Core] Multi-tenancy: enable multi-tenancy by default (#10570)
* Add new job in Travis to enable multi-tenancy

* fix

* Update .bazelrc

* Update .travis.yml

* fix test_job_gc_with_detached_actor

* fix test_multiple_downstream_tasks

* fix lint

* Enable multi-tenancy by default

* Kill idle workers in FIFO order

* Update test

* minor update

* Address comments

* fix some cases

* fix test_remote_cancel

* Address comments

* fix after merge

* remove kill

* fix worker_pool_test

* fix java test timeout

* fix test_two_custom_resources

* Add a delay when killing idle workers

* fix test_worker_failure

* fix test_worker_failed again

* fix DisconnectWorker

* update test_worker_failed

* Revert some python tests

* lint

* address comments
2020-09-29 23:54:53 -07:00
Tao Wang
15ae8816f7
[GCS]Remove useless / heavy heartbeat pub (#11132) 2020-09-29 23:38:17 -07:00
Tao Wang
1db83764bf
[GCS]Use new getting all available resources interface instead of pub-sub … (#10914)
* Use new all available resources getting interface instead of pub-sub in state.py

* add missing server handler and test cases, fix comments

* add fine grained test assert

* per comments

* involve new added function _available_resources_per_node

* change  ClientID to NodeID

* fix compile

* fix client id and lint

* robust tests check

* robust tests
2020-09-29 09:41:10 -07:00
SangBin Cho
0a6164ab15
[Core] Improve logging messages. (#11082) 2020-09-28 21:07:45 -07:00
fangfengbin
872219940b
[GCS]Fix miss PollOwnerForActorOutOfScope after gcs restarts bug (#11054)
* fix_RemoveActorFromOwner_crash_bug

* fix review comment

* fix review comment

* rm unused ut

* add testcase

* fix review comment

* rm unused import

* fix code style

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-28 10:06:40 -07:00
Lingxuan Zuo
27e1f513e3
[Log] make glog flush and RAY_LOG thread-safe (#11002)
* make glog flush and RAY_LOG thread-safe

* dump error log to console

* mapping all levels to destination

* hack glog for exporting message to stdout if no base name given

* patch lint

* use stdout logger by default

* add raylet std/err pytest checker

* add worker logs file check

* fix asan check

* loop in glog enums

* fix python lint

* lint for autoindent

* fix indent lint

* make raylet.err is not empty
2020-09-28 22:15:15 +08:00
Tao Wang
25ac8f9aa5
[GCS]Use new flag to indicate whether resources are updated and update realtime resources view (#10906)
* Handle resources turning empty and update realtime view

* add up missing flag

* per comments

* use flag instead of special key to represent if resource changed

* Update src/ray/protobuf/gcs.proto

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>

* fix lint in gcs.proto

* fix embarrassed mistake

Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
2020-09-28 01:57:27 -07:00
fangfengbin
2e41a29c8f
[Placement Group]Support placement group request processing idempotent in raylet (#10998)
* add part code

* fix review comment

* fix review comment

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-28 01:56:43 -07:00