Commit graph

1720 commits

Author SHA1 Message Date
fangfengbin
9ae5bba7cf
[GCS]Fix gcs table storage GetAll and GetByJobId api bug (#13195) 2021-01-07 10:37:00 +08:00
Siyuan (Ryans) Zhuang
02ae6c5a9a
[Core] Fix incorrect comment (#13228) 2021-01-06 11:37:29 -08:00
Lingxuan Zuo
01d4638b49
[Log] fix spdlog init race (#12973)
* fix spdlog init race

* use global logger

* refine logger name and constructor
2021-01-06 11:02:54 -08:00
dHannasch
695833082d
[Redis] Note that each Redis Connect retry takes two minutes (#12183)
* Slightly alter error message so it's the same in both cases.

* Each retry takes about two minutes.
2021-01-06 11:00:58 -08:00
SangBin Cho
32dc5676b4
[Metrics] Record per node and raylet cpu / mem usage (#12982)
* Record per node and raylet cpu / mem usage

* Add comments.

* Addressed code review.
2021-01-05 21:57:21 -08:00
fangfengbin
779b3876f6
[GCS]Fix TestActorSubscribeAll bug (#13193) 2021-01-06 13:52:39 +08:00
fangfengbin
dd14e5a3b3
[BugFix][GCS]Fix gcs_actor_manager_test multithreading bug (#13158) 2021-01-06 10:47:06 +08:00
Tao Wang
a0bbf2bfc2
Notify listeners after registered node stored (#13069) 2021-01-05 11:18:03 +08:00
fangfengbin
88eaa87e3a
Remove unused file(object_manager_integration_test.cc) (#12989) 2021-01-05 11:09:36 +08:00
Eric Liang
dfb326d4b5
Surface object store spilling statistics in ray memory (#13124) 2021-01-04 17:35:39 -08:00
Stephanie Wang
b765914a1b
Revert "Enabling the cancellation of non-actor tasks in a worker's queue (#12117)" (#13178)
This reverts commit b4d688b4a6.
2021-01-04 17:27:48 -08:00
Siyuan (Ryans) Zhuang
46cf433f0e
[Core] Remove Arrow dependencies (#13157)
* remove arrow ubsan

* remove arrow build depend

* remove arrow buffer
2021-01-04 11:19:09 -08:00
Gabriele Oliaro
b4d688b4a6
Enabling the cancellation of non-actor tasks in a worker's queue (#12117)
* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting
2021-01-04 09:52:29 -08:00
Clark Zinzow
c2bff64699
[Core] Locality-aware leasing: Milestone 1 - Owned refs, pinned location (#12817)
* Locality-aware leasing for owned refs (pinned locations).

* LessorPicker --> LeasePolicy.

* Consolidate GetBestNodeIdForTask and GetBestNodeIdForObjects.

* Update comments.

* Turn on locality-aware leasing feature flag by default.

* Move local fallback logic to LeasePolicy, move feature flag check to CoreWorker constructor, add local-only lease policy.

* Add lease policy consulting assertions to the direct task submitter tests.

* Add lease policy tests.

* LocalityLeasePolicy --> LocalityAwareLeasePolicy.

* Add missing const declarations.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Add RAY_CHECK for raylet address nullptr when creating lease client.

* Make the fact that LocalLeasePolicy always returns the local node more explicit.

* Flatten GetLocalityData conditionals to make it more readable.

* Add ReferenceCounter::GetLocalityData() unit test.

* Add data-intensive microbenchmarks for single-node perf testing.

* Add data-intensive microbenchmarks for simulated cluster perf testing.

* Remove redundant comment.

* Remove data-intensive benchmarks.

* Add locality-aware leasing Python test.

* Formatting changes in ray_perf.py.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-04 09:49:08 -08:00
fangfengbin
25f9f0d781
[GCS] Move resource usage info to gcs resource manager (#13059) 2020-12-25 15:17:45 +08:00
Siyuan (Ryans) Zhuang
cf9952a028
[Core] Remote outdated external store (#13080)
* remove outdated external store
2020-12-24 17:30:06 -08:00
Siyuan (Ryans) Zhuang
bf7f6a7de3
[Core] Remove cuda support in plasma store (#13070)
* remove cuda support in plasma store
2020-12-24 13:24:56 -08:00
Stephanie Wang
4461f9980a
Refactor TaskDependencyManager, allow passing bundles of objects to ObjectManager (#13006)
* New dependency manager

* Switch raylet to new DependencyManager

* PullManager accepts bundles

* Cleanup, remove old task dependency manager

* x

* PullManager unit tests

* lint

* Unit tests

* Rename

* lint

* test

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* Update src/ray/raylet/dependency_manager.cc

Co-authored-by: SangBin Cho <rkooo567@gmail.com>

* x

* lint

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2020-12-23 18:36:00 -08:00
Stephanie Wang
d95c8b8a41
[core][new scheduler] Move tasks from ready to dispatch to waiting on argument eviction (#13048)
* Add index for tasks to dispatch

* Task dependency manager interface

* Unsubscribe dependencies and tests

* NodeManager

* Revert "Add index for tasks to dispatch"

This reverts commit c6ccb9aa306e00f80d34b991055e4e83872595ea.

* tmp

* Move back to waiting if args not ready

* update
2020-12-23 09:33:43 -08:00
DK.Pino
6e19facc7f
[GCS] Delete redis gcs client and redis_xxx_accessor (#12996) 2020-12-23 20:31:46 +08:00
fangfengbin
646c4201ac
[GCS]Decouple gcs resource manager and gcs node manager (#13012) 2020-12-23 11:25:01 +08:00
fyrestone
62a5832007
[Dashboard] Add GET /logical/actors API (#12913) 2020-12-23 11:14:23 +08:00
Alex Wu
ea8d782be1
[core] Pull Manager exponential backoff (#13024) 2020-12-21 19:17:51 -08:00
Eric Liang
8068041006
Don't release resources during plasma fetch (#13025) 2020-12-21 18:32:40 -08:00
Eric Liang
03a5b90ed6
Revert "Revert "Increase the number of unique bits for actors to avoi… (#12990) 2020-12-21 15:16:42 -08:00
Kai Yang
5a6801dde7
[Core] Remove delete_creating_tasks (#12962) 2020-12-22 00:01:27 +08:00
fangfengbin
85a4435ba0
[GCS]Fix redis store client AsyncPutWithIndex unordered bug (#13002) 2020-12-21 20:02:50 +08:00
Barak Michener
c576f0b073
[ray_client] Implement a gRPC streaming logs API for the client (#13001) 2020-12-20 19:35:34 -08:00
fangfengbin
4caa6c6d78
[GCS]GCS resource manager remove cluster_resources_ (#12972) 2020-12-21 11:00:25 +08:00
Barak Michener
e715ade2d1
Support retrieval of named actor handles (#13000)
Change-Id: I05d31c9c67943d2a0230782cbdaa98341584cbc7
2020-12-20 16:34:50 -08:00
Barak Michener
80f6dd16b2
[ray_client] Implement optional arguments to ray.remote() and f.options() (#12985) 2020-12-20 15:43:48 -08:00
Barak Michener
7ab9164f1b
[ray_client] Integrate with test_basic, test_basic_2 and test_actor (#12964) 2020-12-20 14:54:18 -08:00
fangfengbin
3fab93b61b
Fix scheduling_resources comment errors (#12991)
* Fix scheduling_resources comment error

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-20 20:20:07 +08:00
Eric Liang
64c97d25d3
Enable by default new scheduler (#12735) 2020-12-19 13:22:24 -08:00
Eric Liang
5d987f5988
Revert "Increase the number of unique bits for actors to avoid handle collisions (#12894)" (#12988)
This reverts commit 3e492a79ec.
2020-12-18 23:51:44 -08:00
dHannasch
a092433bc8
[core] Use the ConnectWithoutRetries error message (#12732) 2020-12-18 22:34:34 -08:00
SangBin Cho
9d939e6674
[Object Spilling] Implement level triggered logic to make streaming shuffle work + additional cleanup (#12773) 2020-12-18 19:31:14 -08:00
Alex Wu
404161a3ff
[Autoscaler/Core] Remove autoscaler spam (#12952) 2020-12-18 18:22:45 -08:00
Kai Yang
ac5ea2c13d
[Java] Fix output parsing in RunManager (#12968)
* Fix output parsing in RunManager

* change log level

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-18 18:22:12 -08:00
Eric Liang
6ece291f35
Clean up block/unblock handling of resources in new scheduler (#12963) 2020-12-18 16:00:54 -08:00
Eric Liang
3e492a79ec
Increase the number of unique bits for actors to avoid handle collisions (#12894) 2020-12-18 15:59:03 -08:00
Eric Liang
92812f2e8a
Implement resource deadlock detection for new scheduler (#12961) 2020-12-18 12:17:54 -08:00
Barak Michener
5cfa1934e4
[ray_client]: Implement object retain/release and Data Streaming API (#12818) 2020-12-18 11:47:38 -08:00
fangfengbin
a442cd17e0
[GCS]Optimize gcs client reconnection (#12878)
* [GCS]Optimize gcs client reconnection

* fix review comment

* fix review comment

* add part code

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-17 21:57:37 -08:00
dHannasch
cfefd7c70e
Test PingPort (#12954)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-17 21:15:42 -08:00
DK.Pino
6404f1e609
[Placement Group][New scheduler] New scheduler pg implementation (#12910) 2020-12-18 11:56:45 +08:00
Tao Wang
17152c84a7
[Tiny]Print raylet info after register (#12566) 2020-12-18 11:22:13 +08:00
dHannasch
d747071dd9
Test shard_context on already-created boost::asio::io_service. (#12917) 2020-12-17 14:26:30 -08:00
Allen
e6cb4f4bd7
[Core] Add log of address and port (#12908)
Co-authored-by: Allen Yin <allenyin@anyscale.io>
2020-12-17 00:25:29 -08:00
Yi Cheng
40032541dc
[core] Introduce fetch_local to ray.wait (#12526) 2020-12-16 23:44:28 -08:00