Commit graph

1962 commits

Author SHA1 Message Date
Simon Mo
b291dd4486
[Metrics] Call GetMeasureDoubleByName to prevent override (#12860) 2020-12-15 09:39:39 -08:00
fangfengbin
43b9259d40
[GCS]GCS resource manager support scheduling resource (#12780)
* add part code

* add part code

* fix review comments

* rebase master

* add part code

* add part code

* fix review comments

* add part code

* fix code style

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-15 10:27:55 +08:00
Tao Wang
ac53e2f857
[GCS]Tell dead nodes to commit suicide (#12792)
* [GCS]Tell dead nodes to commit suicide

* fix comment, add ut
2020-12-14 11:42:00 -08:00
Tao Wang
35f7d84dbe
Revert heartbeat interval to keep ci stable (#12836)
* Revert heartbeat interval to keep ci stable

* fix missing one
2020-12-14 16:58:40 +08:00
fangfengbin
1e02b28abe
[GCS]Move node resource info to gcs resource manager (#12775)
* add part code

* add part code

* fix review comments

* fix ut bug

* rebase master

* add part code

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-13 20:37:34 +08:00
DK.Pino
153b24746c
[Placement Group] Refactor pg resource constrain in node manager (#12538)
* first version by pointer

* second version reference

* clean up

* add cpp ut

* lint

* extract LocalPlacementGroupManagerInterface

* lint

* fix commemt

* add idempotency test

* lint

* fix pg ut

* fix pg ut

* python lint

* fix pg ut timeout

* python lint

* fix comment

* lint

* lint
2020-12-12 23:32:15 -08:00
Eric Liang
b73d4831d4
Add grace period before warning of resource deadlock 2020-12-12 12:02:13 -08:00
fangfengbin
c22990a537
[GCS]GCS node manager rename GetNode to GetAliveNode (#12781) 2020-12-12 20:34:43 +08:00
Alex Wu
aa64cd4534
[New scheduler] Fix test_global_state (#12586) 2020-12-11 21:47:01 -08:00
Eric Liang
1ce745cf44
Add automatic local GC and plasma debug logs every 10 minutes by default (#12804) 2020-12-11 17:09:58 -08:00
Alex Wu
676ec363f6
[Object Manager] Pull Manager refactor (#12335) 2020-12-11 11:56:23 -08:00
Eric Liang
4ad4463be6
Add comments to clarify purpose of new scheduler queues (#12730)
* update

* clarify

* update
2020-12-11 11:53:09 -08:00
Tao Wang
295b6e5ce4
Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
Stephanie Wang
86b0741026
[new scheduler] Allocate resources for spilled back task to a local view of the remote node (#12711)
* Force report heartbeats if remote resources may be dirty

* lint

* typo

* typo

* unit test

* debug

* Revert "lint"

This reverts commit 6dc7e982ffee98185665eb7c3c8fde0d91938919.

* Revert "Force report heartbeats if remote resources may be dirty"

This reverts commit cbfa9405197df62f874107d55b46715ceae2abd2.

* Local view of resources

* debug travis

* debug

* debug

* debug

* weaken test

* cleanups

* lint

* Revert "debug travis"

This reverts commit 11ff5f4f84e64e9fbd4eecda5b3c7fd07a7130a4.

* revert

* const view, remove unused
2020-12-10 22:43:29 -05:00
Barak Michener
b7f246c451
[ray_client] Include multiple facets of the Ray API (#12736) 2020-12-10 19:09:34 -08:00
Edward Oakes
62d6b0a558
Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Kai Yang
e3b5deb741
[Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Stephanie Wang
a776209aec
Revert "Fix dashboard agent check ppid is raylet pid (#12256)" (#12729)
This reverts commit 3ce9286977.
2020-12-09 17:20:38 -05:00
dHannasch
d455cae036
Add period to error message. (#12716) 2020-12-09 15:58:21 -06:00
Keqiu Hu
ee012532fb
[core] Use node manager client pool for GCS service #10398 (#12368)
* raylet client pool

* Fix merging conflict

* Fix documentation typo

* fix linting

* address comments

* fix typo

* remove unintended logging

* address comments

* fix bazel file lint error
2020-12-09 12:44:40 -08:00
Alex Wu
0b6e44efb8
[New scheduler] Cluster Resource Scheduler dynamic resources (for placement groups) (#12518)
* prepare implemented

* dynamic resources

* .

* commit

* .

* .

* Still needs to be cleaned up

* Passes basic tests + cleanup

* .

* .

* .

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix

* lint

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-09 12:05:31 -08:00
fangfengbin
ef9ebbc636
[GCS]GCS based Actor Scheduling support actor colocation (#12707)
* [GCS]GCS based Actor Scheduling support actor colocation

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-09 11:54:23 -08:00
fyrestone
3ce9286977
Fix dashboard agent check ppid is raylet pid (#12256)
* Dashboard agent check ppid is raylet pid

* Improve implementation

* Refine code

* Make the RAY_NODE_PID environment required for dashboard agent

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-09 09:12:34 -05:00
Stephanie Wang
840de49161
Fix race condition between failure detection and references going out of scope (#12573)
* fix

* lint

* fix initialization
2020-12-08 23:49:55 -08:00
Stephanie Wang
50f28811ac
[new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
fangfengbin
93c0eb249c
[PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
fangfengbin
7e1422e925
[PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Philipp Moritz
73a1a232b9
Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin
260b07cf0c
[PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
SangBin Cho
0138c2dbb4
[Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang
21fcee28f9
[Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
fangfengbin
ff34563539
[PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Stephanie Wang
443339ab19
[core] Move out-of-memory handling into the plasma store and support async object creation (#12186)
* Refactor to extract creation request queue

* timer on oom

* move timer out

* Move evict_if_full and on_store_full into plasma store

* Remove client-side code

* revert

* Distinguish between transient and permanent OOM delays

* update

* Move out create request queue, unit test

* unit test

* Fix max retries

* test

* Do not pin restored objects

* First pass to add polling requests, unit test passes

* worker plasma client retries plasma requests

* cleanup

* Clean up after disconnected clients, check memory leaks

* Support immediate requests in request queue

* Option to try creating immediately

* lint

* Fix build, address comments

* doc

* fixes

* debug travis

* debug

* debug

* debug

* debug

* Revert "debug"

This reverts commit 6bf2f6ee5640e71630c4aecdb7ebf54911ea32db.

Revert "debug"

This reverts commit 73017099c9b06cdaae1217bf0e0f4d23ed68a9e5.

Revert "debug"

This reverts commit 5a155529e28cee9461a598b0cdf7b6a3cc194c93.

Revert "debug"

This reverts commit b50c2101afd45d4cf663daae857bfe1b40387703.

Revert "debug travis"

This reverts commit 012b8721dedf9bca46294ae75eee2815b160368b.

* Skip if new scheduler enabled

* error message

* merge
2020-12-02 13:25:54 -05:00
Ian Rodney
786f839ff3
[Windows] Fix windows build (#12555)
* fix remote watch

* remove const

* unfix remote-watch

* format
2020-12-02 09:37:40 -08:00
Kai Fricke
0a12eba603
Revert "Fix race condition between failure detection and references going out of scope (#12548)" (#12570)
This reverts commit 8801e87a
2020-12-02 10:20:17 -05:00
Stephanie Wang
8801e87afd
Fix race condition between failure detection and references going out of scope (#12548)
* fix

* lint
2020-12-01 20:52:30 -05:00
Barak Michener
6412dfaf38
[ray_client] actors v0 (#12388) 2020-12-01 13:12:08 -08:00
SangBin Cho
0e892908f7
[Object Spilling] Delete spilled objects when references are gone out of scope. (#12341) 2020-12-01 13:10:39 -08:00
Simon Mo
f596113fc7
[Core] Actor Retries Out of Order Tasks on Restart (#12338) 2020-12-01 09:35:54 -08:00
SangBin Cho
f6f3cc9af1
[Core]Remove checkpoint table (#12235)
* Delete an actor entry from node manager.

* Remove checkpoint table

* remote checkpoint interface

* remove checkpoint interface

* fix ExitActorTest

Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
2020-12-01 08:58:36 -08:00
Tao Wang
b85c6abc3e
Rename fields/variables from client id to node id (#12457) 2020-11-30 14:33:36 +08:00
Alex Wu
f1cc33a6a6
Actor resource backlog hotfix (#12471)
* prepare implemented

* works?

* deflek

* git

* deflek round 2

* .

* improve the test

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-11-29 20:55:50 -08:00
Eric Liang
9ad0f173d6
Prestart workers to avoid slow start when multi-tenancy is enabled (#12430) 2020-11-27 21:47:46 -08:00
Eric Liang
569eee5e71
Enable more new scheduler tests (#12421) 2020-11-27 16:10:38 -08:00
fangfengbin
d5215745e4
[PlacementGroup] Introduce GcsResourceManager and avoid copying resources when scheduling placement groups (#12253) 2020-11-26 11:21:58 +08:00
SangBin Cho
2e4e285ef0
[Object Spilling] Fusion small objects (#12087) 2020-11-25 10:13:32 -08:00
Tao Wang
4dd0aa7822
[GCS]make thread number of gcs rpc server configurable (#12257) 2020-11-25 11:40:29 +08:00
Tao Wang
5d47d02f81
[GCS]add callback for RegisterSelf api, make it done first (#12252) 2020-11-25 11:36:44 +08:00
Tao Wang
e025b9e788
[TEST]Move all WaitReady together (#12254) 2020-11-25 11:21:24 +08:00
Tao Wang
2af10c1b78
[GCS]Add new message ReportResourceUsage (#11848) 2020-11-25 11:18:26 +08:00