Commit graph

1638 commits

Author SHA1 Message Date
Stephanie Wang
50f28811ac
[new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
fangfengbin
93c0eb249c
[PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
fangfengbin
7e1422e925
[PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Philipp Moritz
73a1a232b9
Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin
260b07cf0c
[PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
SangBin Cho
0138c2dbb4
[Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang
21fcee28f9
[Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
fangfengbin
ff34563539
[PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Stephanie Wang
443339ab19
[core] Move out-of-memory handling into the plasma store and support async object creation (#12186)
* Refactor to extract creation request queue

* timer on oom

* move timer out

* Move evict_if_full and on_store_full into plasma store

* Remove client-side code

* revert

* Distinguish between transient and permanent OOM delays

* update

* Move out create request queue, unit test

* unit test

* Fix max retries

* test

* Do not pin restored objects

* First pass to add polling requests, unit test passes

* worker plasma client retries plasma requests

* cleanup

* Clean up after disconnected clients, check memory leaks

* Support immediate requests in request queue

* Option to try creating immediately

* lint

* Fix build, address comments

* doc

* fixes

* debug travis

* debug

* debug

* debug

* debug

* Revert "debug"

This reverts commit 6bf2f6ee5640e71630c4aecdb7ebf54911ea32db.

Revert "debug"

This reverts commit 73017099c9b06cdaae1217bf0e0f4d23ed68a9e5.

Revert "debug"

This reverts commit 5a155529e28cee9461a598b0cdf7b6a3cc194c93.

Revert "debug"

This reverts commit b50c2101afd45d4cf663daae857bfe1b40387703.

Revert "debug travis"

This reverts commit 012b8721dedf9bca46294ae75eee2815b160368b.

* Skip if new scheduler enabled

* error message

* merge
2020-12-02 13:25:54 -05:00
Ian Rodney
786f839ff3
[Windows] Fix windows build (#12555)
* fix remote watch

* remove const

* unfix remote-watch

* format
2020-12-02 09:37:40 -08:00
Kai Fricke
0a12eba603
Revert "Fix race condition between failure detection and references going out of scope (#12548)" (#12570)
This reverts commit 8801e87a
2020-12-02 10:20:17 -05:00
Stephanie Wang
8801e87afd
Fix race condition between failure detection and references going out of scope (#12548)
* fix

* lint
2020-12-01 20:52:30 -05:00
Barak Michener
6412dfaf38
[ray_client] actors v0 (#12388) 2020-12-01 13:12:08 -08:00
SangBin Cho
0e892908f7
[Object Spilling] Delete spilled objects when references are gone out of scope. (#12341) 2020-12-01 13:10:39 -08:00
Simon Mo
f596113fc7
[Core] Actor Retries Out of Order Tasks on Restart (#12338) 2020-12-01 09:35:54 -08:00
SangBin Cho
f6f3cc9af1
[Core]Remove checkpoint table (#12235)
* Delete an actor entry from node manager.

* Remove checkpoint table

* remote checkpoint interface

* remove checkpoint interface

* fix ExitActorTest

Co-authored-by: chaokunyang <shawn.ck.yang@gmail.com>
2020-12-01 08:58:36 -08:00
Tao Wang
b85c6abc3e
Rename fields/variables from client id to node id (#12457) 2020-11-30 14:33:36 +08:00
Alex Wu
f1cc33a6a6
Actor resource backlog hotfix (#12471)
* prepare implemented

* works?

* deflek

* git

* deflek round 2

* .

* improve the test

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-11-29 20:55:50 -08:00
Eric Liang
9ad0f173d6
Prestart workers to avoid slow start when multi-tenancy is enabled (#12430) 2020-11-27 21:47:46 -08:00
Eric Liang
569eee5e71
Enable more new scheduler tests (#12421) 2020-11-27 16:10:38 -08:00
fangfengbin
d5215745e4
[PlacementGroup] Introduce GcsResourceManager and avoid copying resources when scheduling placement groups (#12253) 2020-11-26 11:21:58 +08:00
SangBin Cho
2e4e285ef0
[Object Spilling] Fusion small objects (#12087) 2020-11-25 10:13:32 -08:00
Tao Wang
4dd0aa7822
[GCS]make thread number of gcs rpc server configurable (#12257) 2020-11-25 11:40:29 +08:00
Tao Wang
5d47d02f81
[GCS]add callback for RegisterSelf api, make it done first (#12252) 2020-11-25 11:36:44 +08:00
Tao Wang
e025b9e788
[TEST]Move all WaitReady together (#12254) 2020-11-25 11:21:24 +08:00
Tao Wang
2af10c1b78
[GCS]Add new message ReportResourceUsage (#11848) 2020-11-25 11:18:26 +08:00
Tao Wang
e1075c0a82
[GCS]Fill resource fields when re-report heartbeat after gcs restarted (#12097) 2020-11-25 11:07:02 +08:00
fangfengbin
1d909321c9
[PlacementGroup]Fix node manager release unused bundles bug (#12346) 2020-11-25 11:02:43 +08:00
fangfengbin
5934b20b96
[PlacementGroup]Fix destroy bundle resources bug (#12336)
* [PlacementGroup]Fix destroy bundle resources bug

* revert AddBundleLocations code change

* add comment

* fix review comments

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-25 09:45:26 +08:00
Lixin Wei
462c7fb575
[streaming] export aligned_ symbols from raylet.so (#12345) 2020-11-24 10:16:12 -06:00
ZhuSenlin
1ae4d2873a
[GCS] refactor gcs initialization (#11890) 2020-11-24 21:11:18 +08:00
fangfengbin
be7938ee09
[PlacementGroup]Fix AddBundleLocations bug (#12330)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-24 16:57:17 +08:00
dHannasch
2c4514a2c0
[minor] Refactor to expose RedisContext::PingPort (#12022) 2020-11-23 20:39:50 -08:00
fangfengbin
084f03797b
[Placement Group]Placement Group supports gcs failover(Part3) (#12036) 2020-11-23 16:57:58 +08:00
Eric Liang
dac09bd569
Fix actor_registry_ copied on each heartbeat; Improve receive object chunk debug messages (#12187) 2020-11-19 16:45:37 -08:00
Stephanie Wang
7bf5145d36
Lint plasma source files (#12171) 2020-11-19 19:08:18 -05:00
Eric Liang
de86d5aff7
ActorStatisticalData() debug metrics bog down raylet with 100% CPU (#12148)
* comment out bad

* update
2020-11-19 11:38:44 -08:00
SangBin Cho
7d67af6c2a
[Metrics] Add stats to measure process startup time + scheduling stats. (#12100)
* Add new stats.

* Fix issues.
2020-11-19 11:04:26 -08:00
Ian Rodney
7fcce785ed
[hotfix] Fix windows build (#12146)
* [hotfix] fix windows

* remove debug logs
2020-11-19 11:00:19 -08:00
Ian Rodney
e086ddc18f
[core] Add Recursive task cancelation (#11923) 2020-11-18 15:18:40 -08:00
Alex Wu
e9c9ba9c9f
[New Scheduler] Don't start tasks if the owner is dead (#12050) 2020-11-18 11:34:19 -08:00
Ameer Haj Ali
eef624750c
[ray client] ray wait() implementation (#12072) 2020-11-18 11:33:57 -08:00
dHannasch
b41f4fdec2
Extract the connection logic to reduce duplication. (#12016) 2020-11-18 00:12:58 -08:00
fangfengbin
d87af0da88
[PlacementGroup]Add gcs placement group manager debug info (#12061)
Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-18 11:15:38 +08:00
fangfengbin
f400333841
[Placement Group]Placement Group supports gcs failover(Part2) (#12003)
* add testcase

* fix ut

* fix review comment

* fix review comment

* fix review comments

* fix ut bug

* add part code

* add part code

* add part code

* add testcase

* add part code

* fix ut bug

* fix ut timeout bug

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-11-18 10:59:26 +08:00
Stephanie Wang
f6bdd5ab17
[New Scheduler] Spillback from the queue of tasks assigned to the local node (#12084) 2020-11-17 16:13:59 -08:00
dHannasch
b5dfdb2a21
Log the Redis shard addresses as originally received from the head GCS. (#12011) 2020-11-17 13:11:17 -08:00
dHannasch
010e6cef3f
Allow setting the RAY_BACKEND_LOG_LEVEL to trace. (#12012) 2020-11-17 13:10:23 -08:00
dHannasch
f0dcf01807
Clarify that Ray is not yet retrying to connect. (#12013) 2020-11-17 13:01:42 -08:00
DK.Pino
0f9e2fec12
[Placement Group] Add get / get all / remove interface for Placement Group Java api. (#11821)
* add placement group java get/get all interface

* add remove placement group api

* fix some issue like: Placement Group -> placement group

* extract dumplicate code to placement group utils

* specify running mode for placement group ut

* update checkGlobalStateAccessorPointerValid -> validateGlobalStateAccessorPointer

* use THROW_EXCEPTION_AND_RETURN_IF_NOT_OK

* update pg log print
2020-11-17 12:32:39 +08:00