Kai Yang
81be461ba2
[Core] Limit starting workers with maximum_startup_concurrency per worker type ( #16214 )
2021-06-09 13:11:53 +08:00
Eric Liang
deda35fb4a
Batch the AddSpilledURLs RPC ( #16303 )
2021-06-08 12:10:35 -07:00
Alex Wu
ae1cb12221
Revert "[GCS] Bookkeeping normal task resources in GCS ( #16185 )" ( #16315 )
...
This reverts commit f2384a9743
.
2021-06-08 11:02:28 -07:00
Chong-Li
f2384a9743
[GCS] Bookkeeping normal task resources in GCS ( #16185 )
2021-06-08 19:58:15 +08:00
Lixin Wei
870a0c16a3
[Logging] Change std::exit to std::_Exit ( #16280 )
...
* change abort to exit
* change to std::_Exit
2021-06-08 00:14:17 -07:00
Lixin Wei
75196cf7f4
[scheduler] Clean up TaskRequest ( #16288 )
2021-06-07 11:38:34 -07:00
SangBin Cho
f867c27eda
[Object spilling] Fix race condition that deletes files at the wrong timing. ( #16153 )
...
* Error fix.
* remove debug code
* Add unit test
* Fix a test failure
2021-06-07 09:56:55 -07:00
Eric Liang
1d8cb2d19e
Add event stats documentation, fix misc race condition ( #16236 )
...
* update
* stats
* udpate
* fix
2021-06-06 12:44:30 -07:00
Stephanie Wang
dd73e8d31b
[core] Add object store debug information ( #16232 )
...
* debug
* todo
* periodic dump
* Build and debug
* x
* debug
* more debug
2021-06-04 19:42:00 -07:00
yncxcw
e13509075d
[Core] Make the the exit type explict for workers being killed TryKillingIdleWorkers ( #16211 )
2021-06-04 18:23:36 -07:00
Lixin Wei
59a2879216
[New Scheduler] Remove Useless Fields in Cluster Resource Data ( #16254 )
...
* non-tests done
* test modifed
2021-06-04 18:00:13 -07:00
Eric Liang
527d51b83a
Allow configuring internal config with RAY_{name} env vars.
2021-06-04 15:37:31 -07:00
Lixin Wei
cf58cd76c7
[Logging] Disable Core Dumps in Fatal Logging ( #16189 )
2021-06-04 11:44:08 -07:00
Eric Liang
608991999c
Fix release resources race that leads to extra worker launches ( #16184 )
2021-06-03 18:35:45 -07:00
Eric Liang
a9db4e62cb
Unlimited plasma allocations by falling back to a filesystem allocator (off by default) ( #16097 )
2021-06-03 18:35:09 -07:00
SangBin Cho
611da62739
Fix atof bug ( #16140 )
2021-06-02 10:25:25 -07:00
Stephanie Wang
ce25d4e896
[core] Record Plasma object sources and dump on out of memory ( #16179 )
...
* debug
* lint, build
* clean up logs
* fix build
2021-06-02 10:04:15 -07:00
DK.Pino
9497a65a57
commit ( #16183 )
2021-06-02 06:50:04 -07:00
Lixin Wei
113c7fdecc
[core] Fix ResourceMapToTaskRequest ( #16172 )
2021-06-01 12:20:03 -07:00
Alex Wu
de0f856b68
[namespaces] Isolation for named placement groups ( #16000 )
2021-06-01 05:50:19 -07:00
Chong-Li
d5d0072635
Refactor RayletBasedActorScheduler ( #16018 )
2021-05-31 15:28:00 +08:00
Lixin Wei
3d37e3a315
[Refactor] Replace FractionalResourceQuantity with FixedPoint ( #16052 )
...
* refactor
* fix
* fix compilation
* fix
* fix cross-platform compilation
* lint
* fix test
* Revert "fix test"
This reverts commit 0ff23b125ce4159b91cc170dbc17b5ed70c9ab11.
* change rounding to truncating
* Update BUILD.bazel
Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-05-28 09:32:51 -07:00
SangBin Cho
d0dc9abdfc
[Plasma store] Improve the OOM logging message. ( #16051 )
2021-05-27 10:09:58 -07:00
Yi Cheng
5d0b302121
[core] Trigger global gc when plasma store is under pressure. ( #15775 )
2021-05-27 10:07:59 -07:00
Tao Wang
881e4913f1
Don't broadcast empty resources data ( #16104 )
2021-05-27 10:06:32 -07:00
DK.Pino
ea0ee86063
[Placement Group]Fix actor scheduling with Placement Group bug. ( #16006 )
2021-05-26 22:16:38 -07:00
Eric Liang
2f4628fdb4
Fix CHECK_FAIL when scheduling task with duplicate object requests ( #16063 )
2021-05-26 15:13:16 -07:00
Stephanie Wang
55bb1e93b4
[core] Wait for objects to be sealed before throwing OutOfMemory ( #15955 )
...
* Wait for objects to seal
* x
* comments
* error code
2021-05-26 14:18:32 -07:00
Eric Liang
3d1ba4a70e
Add feature flag for plasma overcommit ( #16061 )
2021-05-26 10:53:57 -07:00
Kai Yang
853d650e29
Revert "Revert "[Object spilling] Avoid worker crash when an object is spille… ( #15964 )" ( #16012 )
...
This reverts commit 29aa336a4d
.
2021-05-25 23:48:24 -07:00
Eric Liang
ea6bdfb9c1
Prevent object store from allocating over the specified limit even if there is memory fragmentation ( #15951 )
2021-05-24 17:56:11 -07:00
Yi Cheng
7c45480542
[runtime env] Introduce OS envs to skip GC for runtime env in local node; ( #15984 )
2021-05-21 12:49:22 -07:00
Eric Liang
29aa336a4d
Revert "[Object spilling] Avoid worker crash when an object is spille… ( #15964 )
...
This reverts commit 061e3fbde3
.
2021-05-20 21:17:59 -07:00
SangBin Cho
a1375a955b
Pubsub registration / unregistration idempotency ( #15896 )
...
* Make AddEntry idempotent.
* Done.
2021-05-20 18:40:06 -07:00
Kai Yang
061e3fbde3
[Object spilling] Avoid worker crash when an object is spilled right after being restored ( #15903 )
...
* Fix check failure when memory pressure is high
* Add test
* lint
2021-05-20 18:36:11 -07:00
Frank Luan
c87b76632d
[plasma] Reset OOM timer as objects are being spilled ( #15431 )
...
* Fix deserializer in metrics.Counter
* Fix restore_spilled_objects() for external object spilling
* WIP reset OOM timer
* Add test
* Revert style change
* pytest
* Simplify test
* Fix test
* Make tests faster
2021-05-20 13:13:54 -07:00
Alex Wu
ec997c0145
[client] Client builder API namespace support ( #15934 )
...
* add namespace to client
* done?
* address comments
Co-authored-by: Alex <alex@anyscale.com>
2021-05-20 12:36:05 -07:00
Alex Wu
cd2fc7792f
[dashboard] Snapshot of cluster state ( #15868 )
2021-05-20 08:10:32 -07:00
Yi Cheng
874558e813
[runtime env] Put runtime env into runtime context; ( #15895 )
2021-05-20 08:08:45 -07:00
Ian Rodney
4825f1b2a5
[client] One Driver per RayClient Server ( #15923 )
2021-05-19 15:40:49 -07:00
architkulkarni
c3d06697bb
[Core] Add dynamic conda env install in shim process ( #15881 )
2021-05-19 15:46:42 -05:00
Eric Liang
836c739fe5
Revert "[client] One Driver per RayClient Server ( #15875 )" ( #15922 )
...
This reverts commit 97d1414f23
.
2021-05-19 11:58:29 -07:00
Ian Rodney
97d1414f23
[client] One Driver per RayClient Server ( #15875 )
2021-05-19 09:03:09 -07:00
qicosmos
8790bb465b
[C++ worker] Remove func ptr offset ( #15809 )
2021-05-19 18:03:39 +08:00
architkulkarni
194c5e3a96
[Core] Cache workers by runtime_env in worker pool ( #15782 )
...
* pass RuntimeEnv in task spec as opaque string
* lint
* set correct empty value for json: "{}" not ""
* add comment for field in proto
* fix worker pool test by checking both "" and "{}"
* add RAY_CHECK todo
* make dict empty if all values null
* remove unnecessary ser/de
* fix
* address comments
* add WorkerCacheKey with hash function
* clean up
* add naive impl., dedicated workers never killed
* put dedicated workers in idle_of_all_languages
* pipe env hash from worker.py -> Worker
* fully pipe through hash, basic cache test passing
* use int type for runtime env hash
* convert Worker env hash type from size_t to int
* fix
* add method to MockWorker to fix cpp tests
* make compatible with java streaming test
* restore old dynamic_options code to fix java test
* address comments
* add comment about sorting before hash
* add comments for private members of WorkerCacheKey
2021-05-18 00:19:27 -07:00
Alex Wu
69f228d22d
[core] Record actor+job start/end times and metadata ( #15803 )
2021-05-17 21:38:39 -07:00
Frank Luan
0dc34566fe
Refactor raylet to allocate+write+seal one return object at a time ( #15757 )
...
* Refactor raylet to allocate+write+seal one return object at a time
* Fix build
* Fix C++ and Java runtime
* Skip Windows testing
* Fix java and cpp runtime
* Fix warnings
* Fix cpp and java tests
* Fix cpp and java runtime
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu>
2021-05-17 20:06:08 -07:00
SangBin Cho
ff461634b0
[Core] Improved bad error message. ( #15663 )
...
* Improved bad error message.
* Update src/ray/raylet/node_manager.cc
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
* lint.
* Add a pid
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-05-17 19:38:05 -07:00
Alex Wu
3e94114336
Namespaces ( #15774 )
2021-05-17 10:04:22 -07:00
SangBin Cho
259fcbd5bd
[Pubsub] Generalize the pubsub interface and adapt it for ref counting protocol ( #15446 )
...
* Add mock code first
* In the initial progress.
* Fix the number error
* In progress.
* in more pgoress.
* in progress.
* lint.
* Prototype done.
* Fix compilation bug.
* Now it is working with reference counting.
* Remove template.
* lint.
* Fixed issues.
* Fix reference count test.
* Reference count test passes now.
* Fixed the test array problem
* Addressed code review.
* lint.
* Addressed half of code review.
* Fix tests.
* Addressed the most critical issue.
* Make subscriber thread-safe.
* Revert "Make subscriber thread-safe."
This reverts commit 9a6a52197cfa8463ab60dfaae9530ad3c0ed8790.
* Fixed test failures. The only failure now is the asan failure.
* Reset test suites and see if it fixes the issue.
* Fix a flaky test
* Addressed code review.
2021-05-13 09:29:02 -07:00