Commit graph

2383 commits

Author SHA1 Message Date
SangBin Cho
3b865b463a
[Core] Fix GPU first scheduling that is not working with placement group (#19141)
* done

* Revert "done"

This reverts commit 56b18f0a7d14c5466d726c3ed1264f3e1506771e.

* ip

* Revert "Revert "done""

This reverts commit a34c90b0920893f4efbf171b8159f0d08a10dca0.

* Done

* Remove unnecessary log message

* skip test on windows

* Handle the code review.
2021-10-11 00:12:25 -07:00
liuyang-my
5353c5c2f1
Define Java Proxy and RayServeHandle (#18630) 2021-10-10 23:39:04 -07:00
gjoliver
635010d460
Update build rules and patches for darwin_arm64 platform. (#19037)
* Update build rules and patches for darwin_arm64 platform.

Changes include:

Update nelhage/rules_boost package from current version (08/5/2020) to 5/27/2021 version.
Remove rules_boost-undefine-boost_fallthrough.patch, since BOOST_FALLTHROUGH seems to be defined now.
Minor changes to rules_boost-windows-linkopts.patch to use default condition to add -lpthread flag for all platforms.
Add darwin_arm64 config to BUILD files for lib civetweb pulled in via prometheu dependency.

* upgrade boost to 1.74.0 from 1.71.0 to match the udpated build file for windows.

* Fix ray_cpp_pkg

* Use boost/bind/bind.hpp

boost/bind.hpp and global namespace placeholders are deprecated.

* lint

* Use absl::bind_front when possible. Otherwise, NOLINT

* lint

* lint

* lint

* lint

* more lint

* final lint

* trigger build
2021-10-09 18:48:35 -07:00
Guyang Song
bae543c956
[runtime env] support eager_install in runtime env (#17949) 2021-10-09 17:59:57 +08:00
mwtian
b066627539 [Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 (#19216) 2021-10-08 12:58:18 -07:00
chenk008
3780a73b45
[Core] Add worker resource info to runtime env (#18804) 2021-10-08 10:37:29 -07:00
Edward Oakes
9cf19b67cc
[serve] Remove log poll client from replicas (#19145)
In general, broadcasting changes to the replicas via the LongPollClient is hard to reason about (it circumvents our versioning semantics as there's no rolling update). Ideally we would only be using the LongPollClient to broadcast replica membership and nothing else.
2021-10-08 12:32:42 -05:00
Guyang Song
c4bc05bbab
set event_log_reporter_enabled True by default (#18112) 2021-10-07 23:09:36 -07:00
SangBin Cho
afaee05e1e
[Placement Group] Fix placement group removal leak (#19138) 2021-10-07 22:04:12 -07:00
Stephanie Wang
940f84cedb
[core] Remove unused plasma promotion path (#19122)
* remove unused

* lint

* lint

* lint
2021-10-07 10:55:50 -07:00
SangBin Cho
0ef0d9a77d
Revert "[core] Assign tasks to the first available worker (#18167)" (#19180)
This reverts commit 545db13800.
2021-10-07 10:38:37 -07:00
SangBin Cho
22f4ffed08
Disable cpu-only-nodes preferred scheduling that breaks placement groups. (#19129)
* Add a regression test for the short term

* done

* address code review

* lint
2021-10-07 05:34:04 -07:00
Chen Shen
1ed5f622c2
[Core] QuickExit CoreWorker when GetCoreWorker is called after shutdown 2021-10-06 15:07:57 -07:00
Stephanie Wang
545db13800
[core] Assign tasks to the first available worker (#18167)
* Convert worker pool to queue

* Start up to backlog size more workers

* fixes

* Prestart workers according to num available CPUs

* lint

* x

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* dedicated workers

* Fix tests

* x

* fix

* asan

* asan

* Workers can only exec tasks with same job ID

* size_t for runtime env hash, fix unit tests

* include job ID in runtime env hash, remove from worker registration msg

* x

* conflict

* debug

* Schedule and dispatch periodically, skip if no new tasks

* Update src/ray/common/task/task_spec.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/scheduling/cluster_task_manager.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Update src/ray/raylet/worker_pool.h

Co-authored-by: Eric Liang <ekhliang@gmail.com>

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-10-05 13:45:50 -07:00
Chen Shen
1efcf5c3d5
[Core][CoreWorker ThreadSafety 1/n] Ensure global_worker_ is protected by mutex #19073 2021-10-05 05:32:28 -07:00
Yi Cheng
2cff293810
fix (#19094) 2021-10-05 01:53:05 -07:00
Yi Cheng
056c3af699
[core] Update placement group retry implementation (#18842)
* exp backoff

* up

* format

* up

* up

* up

* up

* up

* format

* fix

* up

* format

* adjust ordering

* up

* Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)"

This reverts commit 2e99fb215f.

* up

* update

* format

* up

* format

* fix

* Revert "Revert "[tune] Cache unstaged placement groups for potential re-use (#18706)""

This reverts commit 93425fdb986059e53699623a0fc8590c062e139b.

* up

* format

* fix lint

* up

* up

* up

* up

* check

* add test1

* format

* up

* add test

* up

* up

* up

* fix

* up

* up

* up

* add test

* format

* up

* up

* fix lint

* format

* fix

* format

* fix

* up
2021-10-04 21:31:56 -07:00
SangBin Cho
83cb992d5b
Revert pull retry (#19068)
* Revert "[Object manager] fix comments"

This reverts commit 56debfc063.

* Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)"

This reverts commit d12e35ce53.

* Fix a lint issue
2021-10-04 11:20:43 -07:00
SangBin Cho
7fcf1bf57e
[Dashboard] Refine the dashboard restart logic. (#18973)
* in progress

* Refine the dashboard agent retry logic

* refine

* done

* lint
2021-10-04 05:01:51 -07:00
mwtian
56debfc063
[Object manager] fix comments 2021-10-01 11:42:07 -07:00
Stephanie Wang
c052395f4e
[core] Remove "plasma promotion" for serialized ObjectRefs 2021-10-01 10:39:55 -07:00
architkulkarni
b0a5564f4e
[Serve] Integrate metrics with minimal autoscaling algorithm and add e2e test (#18793) 2021-10-01 10:21:12 -07:00
Tom Birch
aa0cab5cae
Don't export absl symbols as they collide with tensorflow (#18870)
Co-authored-by: Tom Birch <tom@powerlinespro.com>
2021-10-01 13:20:59 +08:00
mwtian
49a57aa477
[Scheduling] Report resource demand for infeasible 1-CPU tasks (#19000) 2021-09-30 22:03:02 -07:00
Edward Oakes
8e5d48d668
[runtime_env] Remove deprecated override_environment_variables and worker_env fields (#18213) 2021-09-30 18:55:24 -05:00
mwtian
d12e35ce53
[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955) 2021-09-30 10:19:54 -07:00
Simon Mo
910553c3bb
[Core] Add private method to retrieve current task queue length (#18964) 2021-09-30 09:20:04 -07:00
Stephanie Wang
5eddaabd11
[core] Fix bug in dependency resolution for actor handles (#18862)
* x

* lint
2021-09-29 13:25:31 -07:00
Jiajun Yao
ed9118393c
Listen to 127.0.0.1 by default on mac osx (#18904) 2021-09-29 11:40:19 -07:00
Dmitri Gekhtman
944309c017
Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)" (#18954)
* Revert "[nightly] Deflaky nightly test many_nodes_actor_test (#18582)"

This reverts commit fc6a739e4b.

* move to large test

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-09-29 11:02:14 -04:00
Chong-Li
42744f29ee
[GCS] Make Gcs-based actor scheduler's bookkeeping consistent (#18546)
* Make Gcs-based scheduler's bookkeeping consistent

* Remove this from lambda function

* Fix lambda function

* Trigger SchedulePendingActors

* Test for acquiring/releasing resources

* Reorganize structure

* Avoid overloading post

* Fix gcs_actor_manager_test

* Fix post counter and rename some func

* Fix unique_ptr

* Fix unique_ptr

* Fix book lint error

* Lint

Co-authored-by: Chong-Li <lc300133@antgroup.com>
2021-09-29 05:53:34 -07:00
Lixin Wei
a6a02779fe
[Core] remove verbose log from task execution (#18736) 2021-09-29 00:31:33 -07:00
Yi Cheng
96dff6e46d
[core] fix implicit merge conflict (#18961) 2021-09-28 19:18:54 -07:00
Edward Oakes
73b8936aa8
[runtime_env] Unify rpc::RuntimeEnv with serialized_runtime_env field (#18641) 2021-09-28 15:13:15 -05:00
Yi Cheng
4af07a8917
[rpc] cpu improvement of protobuf in gcs (#17933) 2021-09-28 11:47:19 -07:00
SangBin Cho
a0a02f4982
[Placement Group] Fix placement group high cpu usage part 1 (#18652) 2021-09-28 11:14:59 -07:00
Yi Cheng
e3dd1e3751
Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)" (#18871)
* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057644.

* up
2021-09-28 05:53:52 -07:00
Chen Shen
057c425122
[Core][CoreWorker] call shutdown in the correct thread (#18910) 2021-09-28 01:29:47 -07:00
Chen Shen
25d14cb4de
Ensure task_execution_service_ is destructed first (#18913) 2021-09-27 18:31:56 -07:00
Chen Shen
cbd7dc749c
[Core][CoreWorker] fix data race of exiting_ 2021-09-27 10:55:03 -07:00
mwtian
66aac2e219
[C++] Use RayConfig to read internal environment variables only once (#18869)
* store environ on first access

* fix

* Use RayConfig

* fix

* fix

* Revert removal of headers. They are actually used.

* rename

* fix lint

* format

* use std::getenv()

* fix
2021-09-25 12:27:42 -07:00
Jiajun Yao
e79f271b05
Fix nil redis array element (#18813) 2021-09-24 20:11:43 -07:00
Eric Liang
11a2dfcaab
Improve unschedulable task warning messages by integrating with the autoscaler (#18724) 2021-09-24 12:19:58 -07:00
Stephanie Wang
7b1e594412
[core] Fix bug in ref counting protocol for nested objects (#18821)
* Fix assertion crash

* test, lint

* todo

* tests

* protocol

* test

* fix

* lint

* header

* recursive

* note

* forward test

* lock

* lint

* unneeded check
2021-09-23 09:45:12 -07:00
mwtian
e41109a5e7
[Client] Use async rpc for remote call and actor creation (#18298)
* Use async rpc for remote calls, task and actor creations.

* fix

* check placement

* check placement group. wait for id in destructor

* fix

* fix exception in destructor

* Add test

* revert change

* Fix comment

* fix
2021-09-22 18:30:50 -07:00
Yi Cheng
8dd3057644
Revert "[test] add unit test for PR #17634 (#18585)" (#18830)
This reverts commit 73c3cff18b.
2021-09-22 16:51:02 -07:00
Yi Cheng
73c3cff18b
[test] add unit test for PR #17634 (#18585) 2021-09-22 14:39:30 -07:00
Yi Cheng
fc6a739e4b
[nightly] Deflaky nightly test many_nodes_actor_test (#18582) 2021-09-20 22:43:48 -07:00
DK.Pino
d329101469
Revert Revert "[Placement Group] Support infeasible placement groups for Placement Group." (#18735)
* fix conflict

* cxx lint
2021-09-20 20:18:12 -07:00
Yi Cheng
07babd807c
Revert "Revert "[core] Async submitting actor registerring (#18009)" (#18719)" (#18722) 2021-09-20 19:17:00 -07:00