Commit graph

2363 commits

Author SHA1 Message Date
Kai Yang
81be461ba2
[Core] Limit starting workers with maximum_startup_concurrency per worker type (#16214) 2021-06-09 13:11:53 +08:00
Eric Liang
deda35fb4a
Batch the AddSpilledURLs RPC (#16303) 2021-06-08 12:10:35 -07:00
Alex Wu
ae1cb12221
Revert "[GCS] Bookkeeping normal task resources in GCS (#16185)" (#16315)
This reverts commit f2384a9743.
2021-06-08 11:02:28 -07:00
Chong-Li
f2384a9743
[GCS] Bookkeeping normal task resources in GCS (#16185) 2021-06-08 19:58:15 +08:00
Lixin Wei
870a0c16a3
[Logging] Change std::exit to std::_Exit (#16280)
* change abort to exit

* change to std::_Exit
2021-06-08 00:14:17 -07:00
Lixin Wei
75196cf7f4
[scheduler] Clean up TaskRequest (#16288) 2021-06-07 11:38:34 -07:00
SangBin Cho
f867c27eda
[Object spilling] Fix race condition that deletes files at the wrong timing. (#16153)
* Error fix.

* remove debug code

* Add unit test

* Fix a test failure
2021-06-07 09:56:55 -07:00
Eric Liang
1d8cb2d19e
Add event stats documentation, fix misc race condition (#16236)
* update

* stats

* udpate

* fix
2021-06-06 12:44:30 -07:00
Stephanie Wang
dd73e8d31b
[core] Add object store debug information (#16232)
* debug

* todo

* periodic dump

* Build and debug

* x

* debug

* more debug
2021-06-04 19:42:00 -07:00
yncxcw
e13509075d
[Core] Make the the exit type explict for workers being killed TryKillingIdleWorkers (#16211) 2021-06-04 18:23:36 -07:00
Lixin Wei
59a2879216
[New Scheduler] Remove Useless Fields in Cluster Resource Data (#16254)
* non-tests done

* test modifed
2021-06-04 18:00:13 -07:00
Eric Liang
527d51b83a
Allow configuring internal config with RAY_{name} env vars. 2021-06-04 15:37:31 -07:00
Lixin Wei
cf58cd76c7
[Logging] Disable Core Dumps in Fatal Logging (#16189) 2021-06-04 11:44:08 -07:00
Eric Liang
608991999c
Fix release resources race that leads to extra worker launches (#16184) 2021-06-03 18:35:45 -07:00
Eric Liang
a9db4e62cb
Unlimited plasma allocations by falling back to a filesystem allocator (off by default) (#16097) 2021-06-03 18:35:09 -07:00
SangBin Cho
611da62739
Fix atof bug (#16140) 2021-06-02 10:25:25 -07:00
Stephanie Wang
ce25d4e896
[core] Record Plasma object sources and dump on out of memory (#16179)
* debug

* lint, build

* clean up logs

* fix build
2021-06-02 10:04:15 -07:00
DK.Pino
9497a65a57
commit (#16183) 2021-06-02 06:50:04 -07:00
Lixin Wei
113c7fdecc
[core] Fix ResourceMapToTaskRequest (#16172) 2021-06-01 12:20:03 -07:00
Alex Wu
de0f856b68
[namespaces] Isolation for named placement groups (#16000) 2021-06-01 05:50:19 -07:00
Chong-Li
d5d0072635
Refactor RayletBasedActorScheduler (#16018) 2021-05-31 15:28:00 +08:00
Lixin Wei
3d37e3a315
[Refactor] Replace FractionalResourceQuantity with FixedPoint (#16052)
* refactor

* fix

* fix compilation

* fix

* fix cross-platform compilation

* lint

* fix test

* Revert "fix test"

This reverts commit 0ff23b125ce4159b91cc170dbc17b5ed70c9ab11.

* change rounding to truncating

* Update BUILD.bazel

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-05-28 09:32:51 -07:00
SangBin Cho
d0dc9abdfc
[Plasma store] Improve the OOM logging message. (#16051) 2021-05-27 10:09:58 -07:00
Yi Cheng
5d0b302121
[core] Trigger global gc when plasma store is under pressure. (#15775) 2021-05-27 10:07:59 -07:00
Tao Wang
881e4913f1
Don't broadcast empty resources data (#16104) 2021-05-27 10:06:32 -07:00
DK.Pino
ea0ee86063
[Placement Group]Fix actor scheduling with Placement Group bug. (#16006) 2021-05-26 22:16:38 -07:00
Eric Liang
2f4628fdb4
Fix CHECK_FAIL when scheduling task with duplicate object requests (#16063) 2021-05-26 15:13:16 -07:00
Stephanie Wang
55bb1e93b4
[core] Wait for objects to be sealed before throwing OutOfMemory (#15955)
* Wait for objects to seal

* x

* comments

* error code
2021-05-26 14:18:32 -07:00
Eric Liang
3d1ba4a70e
Add feature flag for plasma overcommit (#16061) 2021-05-26 10:53:57 -07:00
Kai Yang
853d650e29
Revert "Revert "[Object spilling] Avoid worker crash when an object is spille… (#15964)" (#16012)
This reverts commit 29aa336a4d.
2021-05-25 23:48:24 -07:00
Eric Liang
ea6bdfb9c1
Prevent object store from allocating over the specified limit even if there is memory fragmentation (#15951) 2021-05-24 17:56:11 -07:00
Yi Cheng
7c45480542
[runtime env] Introduce OS envs to skip GC for runtime env in local node; (#15984) 2021-05-21 12:49:22 -07:00
Eric Liang
29aa336a4d
Revert "[Object spilling] Avoid worker crash when an object is spille… (#15964)
This reverts commit 061e3fbde3.
2021-05-20 21:17:59 -07:00
SangBin Cho
a1375a955b
Pubsub registration / unregistration idempotency (#15896)
* Make AddEntry idempotent.

* Done.
2021-05-20 18:40:06 -07:00
Kai Yang
061e3fbde3
[Object spilling] Avoid worker crash when an object is spilled right after being restored (#15903)
* Fix check failure when memory pressure is high

* Add test

* lint
2021-05-20 18:36:11 -07:00
Frank Luan
c87b76632d
[plasma] Reset OOM timer as objects are being spilled (#15431)
* Fix deserializer in metrics.Counter

* Fix restore_spilled_objects() for external object spilling

* WIP reset OOM timer

* Add test

* Revert style change

* pytest

* Simplify test

* Fix test

* Make tests faster
2021-05-20 13:13:54 -07:00
Alex Wu
ec997c0145
[client] Client builder API namespace support (#15934)
* add namespace to client

* done?

* address comments

Co-authored-by: Alex <alex@anyscale.com>
2021-05-20 12:36:05 -07:00
Alex Wu
cd2fc7792f
[dashboard] Snapshot of cluster state (#15868) 2021-05-20 08:10:32 -07:00
Yi Cheng
874558e813
[runtime env] Put runtime env into runtime context; (#15895) 2021-05-20 08:08:45 -07:00
Ian Rodney
4825f1b2a5
[client] One Driver per RayClient Server (#15923) 2021-05-19 15:40:49 -07:00
architkulkarni
c3d06697bb
[Core] Add dynamic conda env install in shim process (#15881) 2021-05-19 15:46:42 -05:00
Eric Liang
836c739fe5
Revert "[client] One Driver per RayClient Server (#15875)" (#15922)
This reverts commit 97d1414f23.
2021-05-19 11:58:29 -07:00
Ian Rodney
97d1414f23
[client] One Driver per RayClient Server (#15875) 2021-05-19 09:03:09 -07:00
qicosmos
8790bb465b
[C++ worker] Remove func ptr offset (#15809) 2021-05-19 18:03:39 +08:00
architkulkarni
194c5e3a96
[Core] Cache workers by runtime_env in worker pool (#15782)
* pass RuntimeEnv in task spec as opaque string

* lint

* set correct empty value for json: "{}" not ""

* add comment for field in proto

* fix worker pool test by checking both "" and "{}"

* add RAY_CHECK todo

* make dict empty if all values null

* remove unnecessary ser/de

* fix

* address comments

* add WorkerCacheKey with hash function

* clean up

* add naive impl., dedicated workers never killed

* put dedicated workers in idle_of_all_languages

* pipe env hash from worker.py -> Worker

* fully pipe through hash, basic cache test passing

* use int type for runtime env hash

* convert Worker env hash type from size_t to int

* fix

* add method to MockWorker to fix cpp tests

* make compatible with java streaming test

* restore old dynamic_options code to fix java test

* address comments

* add comment about sorting before hash

* add comments for private members of WorkerCacheKey
2021-05-18 00:19:27 -07:00
Alex Wu
69f228d22d
[core] Record actor+job start/end times and metadata (#15803) 2021-05-17 21:38:39 -07:00
Frank Luan
0dc34566fe
Refactor raylet to allocate+write+seal one return object at a time (#15757)
* Refactor raylet to allocate+write+seal one return object at a time

* Fix build

* Fix C++ and Java runtime

* Skip Windows testing

* Fix java and cpp runtime

* Fix warnings

* Fix cpp and java tests

* Fix cpp and java runtime

Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2021-05-17 20:06:08 -07:00
SangBin Cho
ff461634b0
[Core] Improved bad error message. (#15663)
* Improved bad error message.

* Update src/ray/raylet/node_manager.cc

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* lint.

* Add a pid

Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-05-17 19:38:05 -07:00
Alex Wu
3e94114336
Namespaces (#15774) 2021-05-17 10:04:22 -07:00
SangBin Cho
259fcbd5bd
[Pubsub] Generalize the pubsub interface and adapt it for ref counting protocol (#15446)
* Add mock code first

* In the initial progress.

* Fix the number error

* In progress.

* in more pgoress.

* in progress.

* lint.

* Prototype done.

* Fix compilation bug.

* Now it is working with reference counting.

* Remove template.

* lint.

* Fixed issues.

* Fix reference count test.

* Reference count test passes now.

* Fixed the test array problem

* Addressed code review.

* lint.

* Addressed half of code review.

* Fix tests.

* Addressed the most critical issue.

* Make subscriber thread-safe.

* Revert "Make subscriber thread-safe."

This reverts commit 9a6a52197cfa8463ab60dfaae9530ad3c0ed8790.

* Fixed test failures. The only failure now is the asan failure.

* Reset test suites and see if it fixes the issue.

* Fix a flaky test

* Addressed code review.
2021-05-13 09:29:02 -07:00