Commit graph

1930 commits

Author SHA1 Message Date
Eric Liang
3e492a79ec
Increase the number of unique bits for actors to avoid handle collisions (#12894) 2020-12-18 15:59:03 -08:00
Eric Liang
92812f2e8a
Implement resource deadlock detection for new scheduler (#12961) 2020-12-18 12:17:54 -08:00
Barak Michener
5cfa1934e4
[ray_client]: Implement object retain/release and Data Streaming API (#12818) 2020-12-18 11:47:38 -08:00
fangfengbin
a442cd17e0
[GCS]Optimize gcs client reconnection (#12878)
* [GCS]Optimize gcs client reconnection

* fix review comment

* fix review comment

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-17 21:57:37 -08:00
dHannasch
cfefd7c70e
Test PingPort (#12954)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-17 21:15:42 -08:00
DK.Pino
6404f1e609
[Placement Group][New scheduler] New scheduler pg implementation (#12910) 2020-12-18 11:56:45 +08:00
Tao Wang
17152c84a7
[Tiny]Print raylet info after register (#12566) 2020-12-18 11:22:13 +08:00
dHannasch
d747071dd9
Test shard_context on already-created boost::asio::io_service. (#12917) 2020-12-17 14:26:30 -08:00
Allen
e6cb4f4bd7
[Core] Add log of address and port (#12908)
Co-authored-by: Allen Yin <allenyin@anyscale.io>
2020-12-17 00:25:29 -08:00
Yi Cheng
40032541dc
[core] Introduce fetch_local to ray.wait (#12526) 2020-12-16 23:44:28 -08:00
Tao Wang
12231ec2a6
Optimize heartbeat manager initialization (#12911) 2020-12-17 14:24:23 +08:00
SangBin Cho
057687e534
[New Scheduler] Fix test_failure.py by supporting infeasible tasks (#12738)
* Fix the first issue.

* ip

* In Progress.

* In progress.

* done.

* Remove unnecessary logs.

* Addressed code review + fix some test failures.

* Try fixing issues.

* Fix issues.

* Fix test issues.

* Fix issues.

* done.
2020-12-16 21:27:50 -08:00
Alex Wu
8b783ecafa
Fix pull manager retry (#12907) 2020-12-16 14:18:43 -08:00
fangfengbin
91878d18b5
[PlacementGroup]Fix placement group wait api disorder bug (#12827)
* [PlacementGroup]Fix placment group wait api disorder bug

* fix review comment

* fix review comment

* fix review comment

* fix review comments

* increase num_heartbeats_timeout

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-16 18:45:53 +08:00
Eric Liang
7ff314a5df
[New scheduler] Also unsubscribe get dependencies on unblock 2020-12-15 20:29:44 -08:00
Alex Wu
0031723ace
[New scheduler] Object spilling (#12857) 2020-12-15 11:05:38 -08:00
Edward Oakes
261b2f9053
Check for raylet PID as ppid in dashboard agent fate-sharing (#12867) 2020-12-15 12:13:11 -06:00
Max Fitton
e077bc4206
[Release] Bump master to 1.2.0 for 1.1.0 release (#12856) 2020-12-15 09:40:26 -08:00
Simon Mo
b291dd4486
[Metrics] Call GetMeasureDoubleByName to prevent override (#12860) 2020-12-15 09:39:39 -08:00
fangfengbin
43b9259d40
[GCS]GCS resource manager support scheduling resource (#12780)
* add part code

* add part code

* fix review comments

* rebase master

* add part code

* add part code

* fix review comments

* add part code

* fix code style

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-15 10:27:55 +08:00
Tao Wang
ac53e2f857
[GCS]Tell dead nodes to commit suicide (#12792)
* [GCS]Tell dead nodes to commit suicide

* fix comment, add ut
2020-12-14 11:42:00 -08:00
Tao Wang
35f7d84dbe
Revert heartbeat interval to keep ci stable (#12836)
* Revert heartbeat interval to keep ci stable

* fix missing one
2020-12-14 16:58:40 +08:00
fangfengbin
1e02b28abe
[GCS]Move node resource info to gcs resource manager (#12775)
* add part code

* add part code

* fix review comments

* fix ut bug

* rebase master

* add part code

* fix ut bug

* fix ut bug

* fix review comments

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-13 20:37:34 +08:00
DK.Pino
153b24746c
[Placement Group] Refactor pg resource constrain in node manager (#12538)
* first version by pointer

* second version reference

* clean up

* add cpp ut

* lint

* extract LocalPlacementGroupManagerInterface

* lint

* fix commemt

* add idempotency test

* lint

* fix pg ut

* fix pg ut

* python lint

* fix pg ut timeout

* python lint

* fix comment

* lint

* lint
2020-12-12 23:32:15 -08:00
Eric Liang
b73d4831d4
Add grace period before warning of resource deadlock 2020-12-12 12:02:13 -08:00
fangfengbin
c22990a537
[GCS]GCS node manager rename GetNode to GetAliveNode (#12781) 2020-12-12 20:34:43 +08:00
Alex Wu
aa64cd4534
[New scheduler] Fix test_global_state (#12586) 2020-12-11 21:47:01 -08:00
Eric Liang
1ce745cf44
Add automatic local GC and plasma debug logs every 10 minutes by default (#12804) 2020-12-11 17:09:58 -08:00
Alex Wu
676ec363f6
[Object Manager] Pull Manager refactor (#12335) 2020-12-11 11:56:23 -08:00
Eric Liang
4ad4463be6
Add comments to clarify purpose of new scheduler queues (#12730)
* update

* clarify

* update
2020-12-11 11:53:09 -08:00
Tao Wang
295b6e5ce4
Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
Stephanie Wang
86b0741026
[new scheduler] Allocate resources for spilled back task to a local view of the remote node (#12711)
* Force report heartbeats if remote resources may be dirty

* lint

* typo

* typo

* unit test

* debug

* Revert "lint"

This reverts commit 6dc7e982ffee98185665eb7c3c8fde0d91938919.

* Revert "Force report heartbeats if remote resources may be dirty"

This reverts commit cbfa9405197df62f874107d55b46715ceae2abd2.

* Local view of resources

* debug travis

* debug

* debug

* debug

* weaken test

* cleanups

* lint

* Revert "debug travis"

This reverts commit 11ff5f4f84e64e9fbd4eecda5b3c7fd07a7130a4.

* revert

* const view, remove unused
2020-12-10 22:43:29 -05:00
Barak Michener
b7f246c451
[ray_client] Include multiple facets of the Ray API (#12736) 2020-12-10 19:09:34 -08:00
Edward Oakes
62d6b0a558
Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Kai Yang
e3b5deb741
[Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Stephanie Wang
a776209aec
Revert "Fix dashboard agent check ppid is raylet pid (#12256)" (#12729)
This reverts commit 3ce9286977.
2020-12-09 17:20:38 -05:00
dHannasch
d455cae036
Add period to error message. (#12716) 2020-12-09 15:58:21 -06:00
Keqiu Hu
ee012532fb
[core] Use node manager client pool for GCS service #10398 (#12368)
* raylet client pool

* Fix merging conflict

* Fix documentation typo

* fix linting

* address comments

* fix typo

* remove unintended logging

* address comments

* fix bazel file lint error
2020-12-09 12:44:40 -08:00
Alex Wu
0b6e44efb8
[New scheduler] Cluster Resource Scheduler dynamic resources (for placement groups) (#12518)
* prepare implemented

* dynamic resources

* .

* commit

* .

* .

* Still needs to be cleaned up

* Passes basic tests + cleanup

* .

* .

* .

* Apply suggestions from code review

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* fix

* lint

Co-authored-by: Alex <alex@anyscale.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-09 12:05:31 -08:00
fangfengbin
ef9ebbc636
[GCS]GCS based Actor Scheduling support actor colocation (#12707)
* [GCS]GCS based Actor Scheduling support actor colocation

* fix review comment

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-09 11:54:23 -08:00
fyrestone
3ce9286977
Fix dashboard agent check ppid is raylet pid (#12256)
* Dashboard agent check ppid is raylet pid

* Improve implementation

* Refine code

* Make the RAY_NODE_PID environment required for dashboard agent

Co-authored-by: 刘宝 <po.lb@antfin.com>
2020-12-09 09:12:34 -05:00
Stephanie Wang
840de49161
Fix race condition between failure detection and references going out of scope (#12573)
* fix

* lint

* fix initialization
2020-12-08 23:49:55 -08:00
Stephanie Wang
50f28811ac
[new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
fangfengbin
93c0eb249c
[PlacementGroup]Support acquire and return bundle resource from gcs resource manager (#12349) 2020-12-08 10:29:57 +08:00
fangfengbin
7e1422e925
[PlacementGroup]Fix placement group strict spread bug when node dead (#12647)
* [PlacementGroup]Fix strict spread bug when node dead

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-07 21:50:28 +08:00
Philipp Moritz
73a1a232b9
Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin
260b07cf0c
[PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
SangBin Cho
0138c2dbb4
[Metrics] Remove redundant unit specification. (#12595) 2020-12-04 00:06:21 -08:00
Kai Yang
21fcee28f9
[Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
fangfengbin
ff34563539
[PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00