Commit graph

1759 commits

Author SHA1 Message Date
SangBin Cho
6e2a1eac36
[Placement Group] Placement group automatic cleanup. (#11546)
* In progress. Done with all placement group manager code.

* It is working with job.

* Finished detached actor implementation.

* Fix minor issue.

* In progress.

* Addressed code review.

* Addressed code review.

* Addressed code reivew.

* Fix a build error.
2020-10-30 10:55:43 -07:00
Alex Wu
e022d12dc3
[New scheduler] Deflake test heartbeat (#11586)
* defleked

* lint

* .

* Update cluster_task_manager_test.cc

Co-authored-by: Alex Wu <alex@anyscale.com>
2020-10-29 23:10:19 -07:00
architkulkarni
4175569d96
[Core] Add option to override environment variables for tasks and actors (#11619) 2020-10-29 14:22:44 -05:00
Simon Mo
e82ff08b0c
Fix asyncio plasma integration in cluster mode (#11665) 2020-10-29 11:53:10 -07:00
Lingxuan Zuo
0b7a3d9e02
[Log] new spdlog tool for ray (#10967)
* spdlog support

* fatal abort for spdlog

* print all logs in stderr if no logger given

* fix log test

* install signal handler for spdlog by reusing glog lib

* fix lint

* Avoid duplicated dump

* log rotation and fmt comments

* fix
2020-10-29 11:37:13 -07:00
Tao Wang
1d5694ddea
[GCS]Use direct getting instead of pub-sub to update load metrics in monitor.py (#11339) 2020-10-28 11:23:18 -07:00
Eric Liang
c933477915
[new scheduler] Pass test_basic and add CI builds with flag on (#11635) 2020-10-28 11:02:43 -07:00
Stephanie Wang
427b5af0ae
[Object spilling] Refactor raylet to add a local object manager class (#11647)
* Fix pytest...

* Release objects that have been spilled

* GCS object table interface refactor

* Add spilled URL to object location info

* refactor to include spilled URL in notifications

* improve tests

* Add spilled URL to object directory results

* Remove force restore call

* Merge spilled URL and location

* fix

* tmp

* refactor

* unit test skeleton

* unit testing

* unit test fixes

* cleanup

* cleanup

* update

* Separate pinning from waiting for object free, fixes pytest

* Update src/ray/raylet/local_object_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Tyler Westenbroek <westenbroekt@berkeley.edu>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-10-28 10:38:42 -04:00
fyrestone
05ad4c7499
[Dashboard] Optimize dashboard datacenter (#11391)
* Optimize dashboard datacenter

* Fix tests

* Fix tests

* Fix

* Fix CI

* python/build-wheel-macos.sh

Co-authored-by: 刘宝 <po.lb@antfin.com>
Co-authored-by: Max Fitton <maxfitton@anyscale.com>
2020-10-27 23:49:31 -07:00
fangfengbin
55a090fb16
[GCS]Optimize gcs client nodes get function (#11424)
* [GCS]Optimize gcs client nodes get function

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 21:13:19 -07:00
Tao Wang
273a712786
[GCS]Decouple node failure detector with resoure related operations (#11465) 2020-10-27 15:52:42 -07:00
fangfengbin
ebe9a8865c
[GCS]Fix a bug that creates invalid connection (#11590)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-27 10:08:06 -07:00
Ian Rodney
2da6ad2176
[core] Better error message for named actor not found (#11604) 2020-10-26 09:46:02 -07:00
Tao Wang
0fbee4da0c
[GCS] Remove unused ReportBatchHeartbeat/SubscribeHeartbeat (#11567)
* Remove unused message ReportBatchHeartbeat

* add up
2020-10-25 21:06:28 -07:00
Eric Liang
d3ee83205b
Remove crashing assert in actor creation for old scheduler (#11577)
* remove assert

* warn log
2020-10-24 00:05:26 -07:00
DK.Pino
9f804ade5f
[Placement Group]Add get all placement group api (#11460)
* add get all interface for placement group

* add get all interface for placement group

* make it work

* fix lint

* fix lint

* fix comment

* add cpp test

* fix python lint
2020-10-23 11:46:48 -07:00
Alex Wu
e02f4c0157
[New scheduler] queue by shape (#11381) 2020-10-21 15:56:06 -07:00
Edward Oakes
5d7f271e7d
Add --worker-port-list option to ray start (#11481) 2020-10-21 14:46:45 -05:00
Tao Wang
da2d3fbcfc
Remove unused field in heartbeat message (#11459) 2020-10-21 10:49:16 -07:00
Kai Yang
078a22d676
[Core] Allow creating tasks/actors in a detached actor when driver has exited (#11493)
* Allow creating tasks/actors in a detached actor when driver has exited

* lint

* Address comment
2020-10-21 10:45:29 -07:00
Xuxue1
7200ddb72d
Fix code_search_path failed in java (#11406)
Co-authored-by: xujiqiang eigen <xujiqiang@hpc1.ipa.aidigger.com>
2020-10-21 18:10:48 +08:00
fangfengbin
a075e37695
[GCS]Fix TestActorTableResubscribe bug (#11463)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-20 22:32:41 -07:00
Lingxuan Zuo
aed739fbf4
[Log] Ignore callstacktrace test for windows (#11413) 2020-10-20 15:21:29 +08:00
DK.Pino
1b3b009f7a
[PlacementGroup]Add guarded by in placement group scheduler ut (#11306)
* add GUARDED_BY for success_placement_groups_ and failure_placement_groups_ vector

* update lint

* update lint

* update logical

* update lint

* change int to unsigned int

* update lint

* rename vector_mutex_ to placement_group_requests_mutex_

* resolve comment

* add int() for windows
2020-10-19 18:54:35 -07:00
fangfengbin
da89cb19eb
[GCS]Fix node info idempotent bug (#11423)
* [GCS]Fix node info idempotent bug

* Fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-19 10:23:33 +08:00
SangBin Cho
666fcde8ca
[Placement group] Input validation (#11152)
* Add a basic input validation.

* Addressed code review.
2020-10-14 13:56:41 -07:00
SangBin Cho
b1481c6acf
Revert "[PlacementGroup]Add node manager test framework (#11174)" (#11398)
This reverts commit 241e765d3a.
2020-10-14 11:09:20 -07:00
Lingxuan Zuo
149ec5f6bf
[Log] dump stacktrace from glog lib (#11360)
* dump stacktrace from glog lib

* fix windows compile

* add comments for getcallstack
2020-10-14 10:52:12 -07:00
Kai Yang
abc6126814
[Java] Release actor instance reference when Ray.exitActor() is invoked (#11324) 2020-10-14 13:12:59 +08:00
fangfengbin
c926838411
[GCS]Fix GcsActorManagerTest multithreading bug (#11361)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 21:36:40 -07:00
fangfengbin
241e765d3a
[PlacementGroup]Add node manager test framework (#11174)
* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix ut bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-13 19:27:11 -07:00
fangfengbin
0c02427da2
[GCS]Eviction of destroyed actors cached in GCS (#11338) 2020-10-13 15:34:35 +08:00
SangBin Cho
c107eea551
[Core] Do not report stats when worker is already dead. (#11167)
* Fix.

* Addressed code reivew.

* Done.
2020-10-12 11:57:04 -07:00
Alex Wu
175fc41fbc
[Autoscaler] Account for resource backlog size (#11261) 2020-10-12 09:43:48 -07:00
fangfengbin
d1579819e9
[GCS]Eviction of dead nodes cached in GCS (#11323) 2020-10-12 15:54:32 +08:00
fangfengbin
31117b5e96
[GCS]Add job id to log (#11331) 2020-10-12 13:53:08 +08:00
SangBin Cho
9dd4561d1b
[Placement Group] Fix stress tests to pass when actors are scheduled. (#11151)
* Fix stress tests to pass when actors are created.

* Addressed code review.
2020-10-09 21:52:26 -07:00
fangfengbin
3eb2b9e216
[GCS]Random eviction of destroyed actors cached in GCS (#11189)
* add part code

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:47 -07:00
fangfengbin
ca36105d77
[TEST]Fix TestActorSubscribeAll bug (#11297)
Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-09 11:54:27 -07:00
Alex Wu
a6f91664c1
[New Scheduler] Multi tenancy edge case (#11164)
* .

* refactor

* .

* .

* done?

* .

* .

* .

* lint

* no light heartbeat, no tests, fields 2,3

* .

* manually clang format :(

* .

* .

* test

* .

* .

* task manager heartbeat

* lint

* .

* add reminder

* CR

* CR

* cleanup

* CR

* comment

* lint

* .

* .
2020-10-08 13:19:01 -07:00
SangBin Cho
37fa86f9a0
[Placement Group] Fix placement group bugs that happen when rescheduling. (#11263)
* Fix placement group bugs while autoscaling.

* Addressed code review.
2020-10-08 08:58:59 -07:00
Sumanth Ratna
14d8826e43
Fix overriden typo (#11227) 2020-10-07 19:11:07 -07:00
Alex Wu
d2a0d23b0e
[Core] Fix master build failure (#11217)
Co-authored-by: Alex Wu <alex@Alexs-MacBook-Pro.local>
2020-10-06 10:23:34 -07:00
Alex Wu
dc7c2a70b8
[Core] Report worker backlog in GCS heartbeat (#11039) 2020-10-05 22:00:44 -07:00
SangBin Cho
80cc161f3e
[Placement Group] Report placement group load through heartbeat. (#11129)
* In progress.

* Fix a minor issue.

* Removed unnecessary comments.

* Addressed code review.

* Fix build failure.

* remove stray logs.

* Move global state to a med size test to avoid windows CI breakage.
2020-10-04 16:47:22 -07:00
fangfengbin
1244dafad3
[GCS]Optimization: Clear task_spec of destroyed actors (#11149)
* Clear task_spec of destroyed actors

* fix commnet

* disable ut

* fix windows compile bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-03 00:00:41 -07:00
SangBin Cho
6974cea0cd
[Core] Use optional return instead of nullptr for the GetNode method. 2020-10-02 20:54:26 -07:00
Stephanie Wang
ada58abcd9
[Object spilling] Update object directory and reload spilled objects automatically (#11021)
* Fix pytest...

* Release objects that have been spilled

* GCS object table interface refactor

* Add spilled URL to object location info

* refactor to include spilled URL in notifications

* improve tests

* Add spilled URL to object directory results

* Remove force restore call

* Merge spilled URL and location

* fix

* CI

* build

* osx

* Fix multitenancy issues

* Skip windows tests
2020-10-02 15:52:42 -07:00
fangfengbin
180c259702
[GCS]Remove unused api(ServiceBasedActorInfoAccessor::AsyncRegister/ServiceBasedActorInfoAccessor::AsyncUpdate) (#11099)
* remove unused gcs api

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-10-02 00:54:28 -07:00
Alex Wu
a866be381c
[New Scheduler] Heartbeat (#11024)
* .

* refactor

* .

* .

* done?

* .

* .

* .

* lint

* no light heartbeat, no tests, fields 2,3

* .

* manually clang format :(

* .

* .

* test

* .

* .

* task manager heartbeat

* lint

* .

* add reminder

* CR

* CR

* cleanup

* CR

* comment

* lint

* .
2020-10-01 15:54:53 -07:00