Commit graph

1419 commits

Author SHA1 Message Date
Alex Wu
136c8ff19e
[NewScheduler] Pass test_basic.py (#10059)
* .

* .

* Cleanup

* .

* whoops

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>

* CR

* .

* .

* done

* .

* Unit tests

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2020-08-21 15:00:08 -07:00
Barak Michener
f03caa4532
rpc: Follow-up by sharing the core worker client pool within the core worker. (#10206)
* Share CoreWorkerClientPool

* Format
2020-08-21 11:01:22 -07:00
Stephanie Wang
85e57a7a98
[Object spilling] Look up the location of the primary raylet from the owner's metadata (#10197)
* Get the primary copy from the owner, python test, some node manager fixes

* fixes and todo

* update

* lint

* fix build
2020-08-20 14:46:59 -07:00
fangfengbin
a462ae2747
[Placement Group]Add strict spread strategy (#10174)
* support STRICT_SPREAD strategy

* fix review comments

* rebase master

* fix lint error

* fix lint error

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-20 10:18:58 -07:00
SangBin Cho
224933b5e4
[Placement Group] Remove API part 2 (#10215)
* Initial progress done.

* Fix mistake.

* Addressed code review.

* Fix cpp build issue.

* Addressed code review.
2020-08-20 09:50:13 -07:00
fangfengbin
9734dbca3e
[Placement Group]Reschedule bundles when the node of bundles is dead (#10021) 2020-08-19 13:24:42 -07:00
SangBin Cho
263df6163c
[Placement Group] Placement group remove api part 1 (#10063)
* Added basic rpc calls.

* fix issues.

* Fix the gcs server not getting request issue.

* In Progress.

* Basic logic done. Tests are required.

* In progress.

* In progress in refactoring context.

* Revert "In progress in refactoring context."

This reverts commit 38236256cf1306c60dd203e75d45ceb4509c8106.

* Working now.

* Python test works.

* Lint.

* Addressed code review.

* Addressed code review.

* Lint.

* Added unit tests.

* Done, but one of unit tests fail

* Addressed code review.

* Addressed the last code review.

* Fix the wrong test case.
2020-08-18 12:44:00 -07:00
Simon Mo
bedc2c24c8
Export Metrics in OpenCensus Protobuf Format (#10080) 2020-08-18 11:32:42 -07:00
SangBin Cho
053188dfbe
[Placement Group] Support Placement Group state table. (#10090)
* Done.

* Addressed code review.

* Linting.

* Fix lint.

* Fix lint.

* Fix a test.

* Lint.

* Add a lint sleep to test.

* Fix the lint issue.

* Fixed doc build error.
2020-08-17 09:24:50 -07:00
fangfengbin
edd783bc32
[Placement Group]Add soft pack strategy (#10099) 2020-08-17 12:01:34 +08:00
Tao Wang
fba5906ce3
[GCS] Re-report heartbeat when gcs server restarts (#10040)
* Retry to send failed heartbeat when light heartbeat enalbed

* Re-report heartbeat when gcs server restarts

* remove is_pubsub_server_restarted

* add lock per comment

* minor change, name related
2020-08-14 17:37:20 -07:00
Siyuan (Ryans) Zhuang
17ca1d8ff4
[Core] Object spilling prototype (#9818) 2020-08-14 15:39:10 -07:00
Robert Nishihara
36e626e95d
Revert "[Dashboard] Start the new dashboard (#9860)" (#10116)
This reverts commit 739933e5b8.
2020-08-14 14:06:57 -07:00
fangfengbin
3a6fa7d622
[Placement Group]Optimize placement group strict pack strategy (#9924)
* add part code

* add code

* add part code

* rm used import

* add part code

* add part code

* add part code

* add part code

* add part code

* add part code

* fix review comment

* add testcase

* use ResourceSet

* fix review comment

* fix ut bug

Co-authored-by: 灵洵 <fengbin.ffb@antfin.com>
2020-08-13 23:58:52 -07:00
Simon Mo
01f38bc5d1
CoreWorker correctly push metrics to agent (#10031) 2020-08-13 16:44:53 -07:00
Ícaro Aragão
b77d6bf87d
[GCS] Improve fallback for getting local valid IP for GCS server (#10004) 2020-08-13 16:29:47 -05:00
SangBin Cho
86b1db3f11
[Stats] Make metrics report time configurable (#10036)
* Done.

* Lint.

* Address code review.

* Address code review.

* Remove wrong commit.

* Fix a test error.
2020-08-13 00:30:24 -07:00
fyrestone
739933e5b8
[Dashboard] Start the new dashboard (#9860) 2020-08-13 11:01:46 +08:00
fangfengbin
701e26e0af
[GCS]Add node realtime resource view (#10043) 2020-08-12 10:52:17 +08:00
Zhuohan Li
a6fed4820e
[Core] Preliminary implementation of ownership-based object directory (#9735) 2020-08-11 15:04:13 -07:00
SangBin Cho
946ae74817
[GCS Actor Management] Race condition around creating -> created phase. (#10035)
* Fix the issue.

* Address a code review.
2020-08-11 12:31:27 -07:00
Basasuya
0400a88bf1
[EVENT] Basic Function and Definition (#9657) 2020-08-11 17:36:07 +08:00
Kai Yang
3bc17fa62a
[Core] Multi-tenancy: Pass env variables from job config to worker processes (#10022) 2020-08-10 14:31:37 -07:00
Alex Wu
2ebf76c7a3
[New Scheduler] Additional unit tests (#9990) 2020-08-10 11:44:06 -07:00
SangBin Cho
eb6b10221e
Increase the num of trials to reduce the probability of failing sample_test (#10007) 2020-08-10 10:05:33 -07:00
Kai Yang
37821f0b4c
Support unlimited JVM options (#9910) 2020-08-10 16:08:33 +08:00
fangfengbin
26b36a1982
Optimize node register&worker failure log (#9833) 2020-08-10 11:41:45 +08:00
fangfengbin
a2bfdcbf24
[Placement Group]Trigger placement group scheduling when a new node is added (#9905) 2020-08-10 10:56:17 +08:00
Barak Michener
8e76796fd0
ci: Redo format.sh --all script & backfill lint fixes (#9956) 2020-08-07 16:49:49 -07:00
Barak Michener
1d01c668f0
rpc: Core Worker client pool (#9934) 2020-08-07 16:34:29 -07:00
Tao Wang
8bea875673
[TEST]Check if port is free before start up redis (#9974)
* [TEST]Check if port is free before start up redis

* per comment
2020-08-07 10:15:12 -07:00
SangBin Cho
44826878ff
[Core] Remove Legacy Raylet Code (#9936)
* Remove a flag and some methods in node manager including HandleDisconnectedActor, ResubmitTask, and HandleTaskReconstruction

* Make actor creator always required + remove raylet transport

* Remove actor reporter + remove FinishAssignedActorCreationTask

* Remove actor tasks.

* Remove finishactortask and switched it to finishactorcreation task

* Remove reconstruction policy.

* Remove lineage cache.

* Formatting.

* Remove actor frontier code.

* Removed build error.

* Revert "Remove reconstruction policy."

This reverts commit 9d25c9bced4da5fbcac5d484d51013345f16513b.

* Recover HandleReconstruction to mark expired objects as failed.
2020-08-06 16:37:50 -07:00
SangBin Cho
ec2f1a225e
[Stats] Metrics Export User Interface Part 1 (#9913)
* Metrics export port expose done.

* Support exposing metrics port + metrics agent service discovery through ray.nodes()

* Formatting.

* Added a doc.

* Linting.

* Change the location of metrics agent port.

* Addressed code review.

* Addressed code review.
2020-08-06 16:16:29 -07:00
Eric Liang
7d4f204aa8
[Placement Group] Allow scheduling a task on any bundle (-1, default) (#9885)
* wip

* wip

* fix tests

* wip

* wip

* wip

* wip

* wip

* add test

* update

* update

* remov debug

* comments
2020-08-06 00:05:21 -07:00
Tao Wang
1760586628
[GCS]Use an asynchronous PING to avoid blocking other operations (#9871)
* Use separate redis client to avoid its sync command blocking other operations

* use redis_failure_detector_client_

* use async command to ping redis

* format log
2020-08-05 19:10:53 -07:00
SangBin Cho
68899e2f8e
[GCS Actor Management] Fix race condition for DEPENDENCIES_UNREADY states. (#9883)
* Fix issues.

* Address code review.

* Addressed code review 2.

* Fix formatting.

* Addressed code review 3/

* Addressed code review.
2020-08-05 12:22:12 -07:00
SangBin Cho
685182923c
[Core] Fix detached actor local mode when gcs actor management is on. (#9839)
* Fix local mode detached actor.

* Revert changes.
2020-08-05 09:04:24 -07:00
kisuke95
ddc1e483fb
Fix actor table Delete bug (#9499) 2020-08-05 18:05:51 +08:00
kisuke95
80d2544f6b
Fix vector<bool> for loop (#9907) 2020-08-05 17:49:37 +08:00
fangfengbin
193d11ab8b
Optimize placement group log (#9891) 2020-08-05 14:41:32 +08:00
chaokunyang
3323ad9d59
[HOTFIX] Fix master build with missing placement group argument (#9868)
* fix common task submit default placement group

* fix java_function
2020-08-04 11:19:15 -07:00
Barak Michener
c16e1b9524
src/ray/protobuf: Break proto rules into a proper BUILD file (#9792) 2020-08-04 11:12:45 -07:00
Kai Yang
27cd323ce1
[Core] Multi-tenancy: Job isolation & implement per job config (except for env variables) (#9500) 2020-08-04 15:51:29 +08:00
kisuke95
28b1f7710c
[Core] Error info pubsub (Remove ray.errors API) (#9665) 2020-08-04 14:04:29 +08:00
fangfengbin
8c3fc1db76
Optimize actor creation log (#9781) 2020-08-04 10:29:30 +08:00
Zhijun Fu
4f2e4f31dd
async grpc calls should always return void (#9533) 2020-08-03 12:44:02 -07:00
Stephanie Wang
37a9c5783c
[core] Report resource load by shape (#9806)
* Report and aggregate resource load by shape

* python test

* python test

* x

* update
2020-07-31 16:57:30 -07:00
Eric Liang
b73080c85f
Allow tasks to be used with placement groups (#9738) 2020-07-31 10:51:37 -07:00
fangfengbin
3900643948
Add actor states definitions & transition diagram doc (#9754) 2020-07-31 15:35:25 +08:00
Kai Yang
02fd950252
[Java] Local and distributed ref counting in Java (#9371) 2020-07-31 11:49:31 +08:00