fangfengbin
55a090fb16
[GCS]Optimize gcs client nodes get function ( #11424 )
...
* [GCS]Optimize gcs client nodes get function
* fix review comment
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 21:13:19 -07:00
Tao Wang
273a712786
[GCS]Decouple node failure detector with resoure related operations ( #11465 )
2020-10-27 15:52:42 -07:00
fangfengbin
ebe9a8865c
[GCS]Fix a bug that creates invalid connection ( #11590 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 10:08:06 -07:00
Ian Rodney
2da6ad2176
[core] Better error message for named actor not found ( #11604 )
2020-10-26 09:46:02 -07:00
Tao Wang
0fbee4da0c
[GCS] Remove unused ReportBatchHeartbeat/SubscribeHeartbeat ( #11567 )
...
* Remove unused message ReportBatchHeartbeat
* add up
2020-10-25 21:06:28 -07:00
Eric Liang
d3ee83205b
Remove crashing assert in actor creation for old scheduler ( #11577 )
...
* remove assert
* warn log
2020-10-24 00:05:26 -07:00
DK.Pino
9f804ade5f
[Placement Group]Add get all placement group api ( #11460 )
...
* add get all interface for placement group
* add get all interface for placement group
* make it work
* fix lint
* fix lint
* fix comment
* add cpp test
* fix python lint
2020-10-23 11:46:48 -07:00
Alex Wu
e02f4c0157
[New scheduler] queue by shape ( #11381 )
2020-10-21 15:56:06 -07:00
Edward Oakes
5d7f271e7d
Add --worker-port-list option to ray start ( #11481 )
2020-10-21 14:46:45 -05:00
Tao Wang
da2d3fbcfc
Remove unused field in heartbeat message ( #11459 )
2020-10-21 10:49:16 -07:00
Kai Yang
078a22d676
[Core] Allow creating tasks/actors in a detached actor when driver has exited ( #11493 )
...
* Allow creating tasks/actors in a detached actor when driver has exited
* lint
* Address comment
2020-10-21 10:45:29 -07:00
Xuxue1
7200ddb72d
Fix code_search_path failed in java ( #11406 )
...
Co-authored-by: xujiqiang eigen <xujiqiang@hpc1.ipa.aidigger.com>
2020-10-21 18:10:48 +08:00
fangfengbin
a075e37695
[GCS]Fix TestActorTableResubscribe bug ( #11463 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-20 22:32:41 -07:00
Lingxuan Zuo
aed739fbf4
[Log] Ignore callstacktrace test for windows ( #11413 )
2020-10-20 15:21:29 +08:00
DK.Pino
1b3b009f7a
[PlacementGroup]Add guarded by in placement group scheduler ut ( #11306 )
...
* add GUARDED_BY for success_placement_groups_ and failure_placement_groups_ vector
* update lint
* update lint
* update logical
* update lint
* change int to unsigned int
* update lint
* rename vector_mutex_ to placement_group_requests_mutex_
* resolve comment
* add int() for windows
2020-10-19 18:54:35 -07:00
fangfengbin
da89cb19eb
[GCS]Fix node info idempotent bug ( #11423 )
...
* [GCS]Fix node info idempotent bug
* Fix review comment
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-19 10:23:33 +08:00
SangBin Cho
666fcde8ca
[Placement group] Input validation ( #11152 )
...
* Add a basic input validation.
* Addressed code review.
2020-10-14 13:56:41 -07:00
SangBin Cho
b1481c6acf
Revert "[PlacementGroup]Add node manager test framework ( #11174 )" ( #11398 )
...
This reverts commit 241e765d3a
.
2020-10-14 11:09:20 -07:00
Lingxuan Zuo
149ec5f6bf
[Log] dump stacktrace from glog lib ( #11360 )
...
* dump stacktrace from glog lib
* fix windows compile
* add comments for getcallstack
2020-10-14 10:52:12 -07:00
Kai Yang
abc6126814
[Java] Release actor instance reference when Ray.exitActor()
is invoked ( #11324 )
2020-10-14 13:12:59 +08:00
fangfengbin
c926838411
[GCS]Fix GcsActorManagerTest multithreading bug ( #11361 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 21:36:40 -07:00
fangfengbin
241e765d3a
[PlacementGroup]Add node manager test framework ( #11174 )
...
* add part code
* add part code
* add part code
* add part code
* add part code
* add part code
* fix ut bug
* fix ut bug
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 19:27:11 -07:00
fangfengbin
0c02427da2
[GCS]Eviction of destroyed actors cached in GCS ( #11338 )
2020-10-13 15:34:35 +08:00
SangBin Cho
c107eea551
[Core] Do not report stats when worker is already dead. ( #11167 )
...
* Fix.
* Addressed code reivew.
* Done.
2020-10-12 11:57:04 -07:00
Alex Wu
175fc41fbc
[Autoscaler] Account for resource backlog size ( #11261 )
2020-10-12 09:43:48 -07:00
fangfengbin
d1579819e9
[GCS]Eviction of dead nodes cached in GCS ( #11323 )
2020-10-12 15:54:32 +08:00
fangfengbin
31117b5e96
[GCS]Add job id to log ( #11331 )
2020-10-12 13:53:08 +08:00
SangBin Cho
9dd4561d1b
[Placement Group] Fix stress tests to pass when actors are scheduled. ( #11151 )
...
* Fix stress tests to pass when actors are created.
* Addressed code review.
2020-10-09 21:52:26 -07:00
fangfengbin
3eb2b9e216
[GCS]Random eviction of destroyed actors cached in GCS ( #11189 )
...
* add part code
* fix lint error
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:47 -07:00
fangfengbin
ca36105d77
[TEST]Fix TestActorSubscribeAll bug ( #11297 )
...
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:27 -07:00
Alex Wu
a6f91664c1
[New Scheduler] Multi tenancy edge case ( #11164 )
...
* .
* refactor
* .
* .
* done?
* .
* .
* .
* lint
* no light heartbeat, no tests, fields 2,3
* .
* manually clang format :(
* .
* .
* test
* .
* .
* task manager heartbeat
* lint
* .
* add reminder
* CR
* CR
* cleanup
* CR
* comment
* lint
* .
* .
2020-10-08 13:19:01 -07:00
SangBin Cho
37fa86f9a0
[Placement Group] Fix placement group bugs that happen when rescheduling. ( #11263 )
...
* Fix placement group bugs while autoscaling.
* Addressed code review.
2020-10-08 08:58:59 -07:00
Sumanth Ratna
14d8826e43
Fix overriden typo ( #11227 )
2020-10-07 19:11:07 -07:00
Alex Wu
d2a0d23b0e
[Core] Fix master build failure ( #11217 )
...
Co-authored-by: Alex Wu <alex@Alexs-MacBook-Pro.local>
2020-10-06 10:23:34 -07:00
Alex Wu
dc7c2a70b8
[Core] Report worker backlog in GCS heartbeat ( #11039 )
2020-10-05 22:00:44 -07:00
SangBin Cho
80cc161f3e
[Placement Group] Report placement group load through heartbeat. ( #11129 )
...
* In progress.
* Fix a minor issue.
* Removed unnecessary comments.
* Addressed code review.
* Fix build failure.
* remove stray logs.
* Move global state to a med size test to avoid windows CI breakage.
2020-10-04 16:47:22 -07:00
fangfengbin
1244dafad3
[GCS]Optimization: Clear task_spec of destroyed actors ( #11149 )
...
* Clear task_spec of destroyed actors
* fix commnet
* disable ut
* fix windows compile bug
* fix ut bug
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-03 00:00:41 -07:00
SangBin Cho
6974cea0cd
[Core] Use optional return instead of nullptr for the GetNode method.
2020-10-02 20:54:26 -07:00
Stephanie Wang
ada58abcd9
[Object spilling] Update object directory and reload spilled objects automatically ( #11021 )
...
* Fix pytest...
* Release objects that have been spilled
* GCS object table interface refactor
* Add spilled URL to object location info
* refactor to include spilled URL in notifications
* improve tests
* Add spilled URL to object directory results
* Remove force restore call
* Merge spilled URL and location
* fix
* CI
* build
* osx
* Fix multitenancy issues
* Skip windows tests
2020-10-02 15:52:42 -07:00
fangfengbin
180c259702
[GCS]Remove unused api(ServiceBasedActorInfoAccessor::AsyncRegister/ServiceBasedActorInfoAccessor::AsyncUpdate) ( #11099 )
...
* remove unused gcs api
* fix ut bug
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-02 00:54:28 -07:00
Alex Wu
a866be381c
[New Scheduler] Heartbeat ( #11024 )
...
* .
* refactor
* .
* .
* done?
* .
* .
* .
* lint
* no light heartbeat, no tests, fields 2,3
* .
* manually clang format :(
* .
* .
* test
* .
* .
* task manager heartbeat
* lint
* .
* add reminder
* CR
* CR
* cleanup
* CR
* comment
* lint
* .
2020-10-01 15:54:53 -07:00
fangfengbin
138d6cced9
[GCS]Optimizing actor info query interface ( #11067 )
...
* add part code
* add part code
* fix review comment
* fix review comment
* fix review comment
* fix crash bug
* fix ut bug
* fix bug
* fix ut bug
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-30 11:34:42 -07:00
Kai Yang
3504391fd2
[Core] Multi-tenancy: enable multi-tenancy by default ( #10570 )
...
* Add new job in Travis to enable multi-tenancy
* fix
* Update .bazelrc
* Update .travis.yml
* fix test_job_gc_with_detached_actor
* fix test_multiple_downstream_tasks
* fix lint
* Enable multi-tenancy by default
* Kill idle workers in FIFO order
* Update test
* minor update
* Address comments
* fix some cases
* fix test_remote_cancel
* Address comments
* fix after merge
* remove kill
* fix worker_pool_test
* fix java test timeout
* fix test_two_custom_resources
* Add a delay when killing idle workers
* fix test_worker_failure
* fix test_worker_failed again
* fix DisconnectWorker
* update test_worker_failed
* Revert some python tests
* lint
* address comments
2020-09-29 23:54:53 -07:00
Tao Wang
15ae8816f7
[GCS]Remove useless / heavy heartbeat pub ( #11132 )
2020-09-29 23:38:17 -07:00
Tao Wang
1db83764bf
[GCS]Use new getting all available resources interface instead of pub-sub … ( #10914 )
...
* Use new all available resources getting interface instead of pub-sub in state.py
* add missing server handler and test cases, fix comments
* add fine grained test assert
* per comments
* involve new added function _available_resources_per_node
* change ClientID to NodeID
* fix compile
* fix client id and lint
* robust tests check
* robust tests
2020-09-29 09:41:10 -07:00
SangBin Cho
0a6164ab15
[Core] Improve logging messages. ( #11082 )
2020-09-28 21:07:45 -07:00
fangfengbin
872219940b
[GCS]Fix miss PollOwnerForActorOutOfScope
after gcs restarts bug ( #11054 )
...
* fix_RemoveActorFromOwner_crash_bug
* fix review comment
* fix review comment
* rm unused ut
* add testcase
* fix review comment
* rm unused import
* fix code style
* fix ut bug
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-28 10:06:40 -07:00
Lingxuan Zuo
27e1f513e3
[Log] make glog flush and RAY_LOG thread-safe ( #11002 )
...
* make glog flush and RAY_LOG thread-safe
* dump error log to console
* mapping all levels to destination
* hack glog for exporting message to stdout if no base name given
* patch lint
* use stdout logger by default
* add raylet std/err pytest checker
* add worker logs file check
* fix asan check
* loop in glog enums
* fix python lint
* lint for autoindent
* fix indent lint
* make raylet.err is not empty
2020-09-28 22:15:15 +08:00
Tao Wang
25ac8f9aa5
[GCS]Use new flag to indicate whether resources are updated and update realtime resources view ( #10906 )
...
* Handle resources turning empty and update realtime view
* add up missing flag
* per comments
* use flag instead of special key to represent if resource changed
* Update src/ray/protobuf/gcs.proto
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
* fix lint in gcs.proto
* fix embarrassed mistake
Co-authored-by: fangfengbin <869218239a@zju.edu.cn>
2020-09-28 01:57:27 -07:00
fangfengbin
2e41a29c8f
[Placement Group]Support placement group request processing idempotent in raylet ( #10998 )
...
* add part code
* fix review comment
* fix review comment
* fix review comment
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-09-28 01:56:43 -07:00