Eric Liang
9b17c35bee
Fix PullManager handling of get requests and liveness issues ( #16394 )
2021-06-25 13:01:46 -07:00
architkulkarni
06dfd8dddb
Revert "[Dashboard][event] Basic event module ( #16283 )" ( #16676 )
...
This reverts commit 5afa53aa64
.
2021-06-25 09:38:18 -07:00
Lixin Wei
a9d6e93977
[scheduler] Rename TaskRequest to ResourceRequest ( #16649 )
2021-06-25 08:50:20 -07:00
architkulkarni
503641c2c2
[Core] [runtime env] add C++ test for caching workers by runtime env hash ( #16664 )
2021-06-25 09:38:37 -05:00
SongGuyang
e74d9d3ded
[runtime env] Download runtime env(conda) in agent instead of setup_worker ( #16525 )
2021-06-25 19:39:05 +08:00
fyrestone
5afa53aa64
[Dashboard][event] Basic event module ( #16283 )
2021-06-25 13:59:02 +08:00
mwtian
49b8b86488
Remove empty ClusterTaskManager::ScheduleInfeasibleTasks() ( #16665 )
2021-06-24 22:34:57 -07:00
Alex Wu
bfe85326f2
[core] Cleanup dead pubsub related code ( #16629 )
2021-06-24 19:36:56 -07:00
Alex Wu
8ffaa8d3fa
Refactor pubsub to support GCS publisher/raylet client ( #16624 )
...
* .
* .
* .
* .
* .
* import error :(
* boop
* .
* fix tests
* fix tests
* .
* cleanup
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-24 15:30:42 -07:00
Gabriele Oliaro
3e2f608145
Work stealing! ( #15475 )
...
* work_stealing one commit squash
* using random task id to request workers
* inlining methods in direct_task_transport.h
* faster checking for presence of stealable tasks in RequestNewWorkerIfNeeded
* linting
* fixup! using random task id to request workers
* estimating number of tasks to steal based only on tasks in flight
* linting
* fixup! linting
* backup of changes
* fixed issue in scheduling queue test after merge
* linting
* redesigned work stealing. compiles but not tested
* all tests passing locally
* fixup! all tests passing locally
* fixup! fixup! all tests passing locally
* fixed big bug in StealTasksIfNeeded
* rev1
* rev2 (before removing the work_stealing param)
* removed work_stealing flag, fixed existing unit tests
* added unit tests; need to figure out how to assign distinct worker ids in GrantWorkerLease
* fixed work stealing test
* revisions, added two more unit/regression tests
* test
2021-06-23 17:08:28 -07:00
Frank Luan
9249287a36
Object spilling threshold ( #16558 )
...
* Object spilling threshold
* clang-format
* Make tests more lenient
* Fix tests
* Fix tests
* Address comments
* Fix tests lint
* Refactor
* Fix tests
* Fix cpp tests
* Address comments
2021-06-23 16:54:41 -07:00
Eric Liang
29afaa34b6
FetchOrReconstruct message can get re-ordered until after task finishes, leaking get bundles
2021-06-23 14:02:05 -07:00
chenk008
82d92d0d61
[Core]Use worker shim PID to check worker registration ( #16398 )
2021-06-22 21:12:53 -07:00
Eric Liang
dd439dd108
fix seg ( #16620 )
2021-06-22 17:45:06 -07:00
Tao Wang
d1db4744e3
[large scale]Get next job id from gcs instead of redis - python part ( #16528 )
2021-06-22 14:06:30 +08:00
Eric Liang
21b22da3dd
Fix race condition is using CreateRequestQueue for inbound chunks
2021-06-21 22:35:54 -07:00
Stephanie Wang
e7b752cf33
[core] Fix bug in task dependency management for duplicate args ( #16365 )
...
* Pytest
* Skip on windows
* C++
2021-06-21 22:32:04 -07:00
SangBin Cho
5efeb5334b
Revert "Same worker id in python and c++ ( #16568 )" ( #16600 )
...
This reverts commit 9b5c0c32da
.
2021-06-21 18:58:31 -07:00
Tao Wang
2affe97f1a
[Core][Minor]Remove the hard check when disconnect GCS client ( #16572 )
2021-06-22 09:29:25 +08:00
Alex Wu
9b5c0c32da
Same worker id in python and c++ ( #16568 )
...
* .
* .
* test
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-21 13:22:52 -07:00
Eric Liang
a0da009645
Allocate inbound object chunks using CreateRequestQueue instead of immediate allocation ( #16523 )
2021-06-20 09:22:12 -07:00
Alex Wu
319d4fb164
Job timestamp should always be in milliseconds (fixed) ( #16548 )
...
* .
* Revert "Revert "Job timestamp should always be in milliseconds (#16455 )" (#16545 )"
This reverts commit 5030ed8588
.
* .
* .
* .
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-18 17:07:21 -07:00
Amog Kamsetty
416cf3a2e7
Revert "Revert "Enable TryCreateImmediately to use the fallback allocation" ( #16542 )" ( #16544 )
...
This reverts commit 36fd741e6f
.
2021-06-18 15:39:37 -07:00
Alex Wu
5030ed8588
Revert "Job timestamp should always be in milliseconds ( #16455 )" ( #16545 )
...
This reverts commit 1df19a04fe
.
2021-06-18 12:37:05 -07:00
Amog Kamsetty
36fd741e6f
Revert "Enable TryCreateImmediately to use the fallback allocation" ( #16542 )
...
This reverts commit 41cf2e3d50
.
2021-06-18 12:22:18 -07:00
architkulkarni
54d66ac637
[Core] iterate over entire dispatch queue instead of returning when worker unavailable ( #16535 )
2021-06-18 13:25:45 -05:00
Eric Liang
41cf2e3d50
Enable TryCreateImmediately to use the fallback allocation
2021-06-18 10:49:34 -07:00
architkulkarni
6498ca3995
[Core] [runtime env] Don't delete working_dir from runtime env ( #16475 )
2021-06-18 10:15:20 -05:00
Stephanie Wang
5eb51c8b26
[core] Make object directory robust to out-of-order updates ( #16314 )
...
* Sequence ops
* id
* fix
* lint
2021-06-17 20:40:35 -07:00
Alex Wu
6696c0c165
Revert "[Placement Group] Support infeasible placement groups for Placement Group. ( #16188 )" ( #16509 )
...
This reverts commit 7f91cfedd5
.
2021-06-17 11:04:01 -07:00
Alex Wu
1df19a04fe
Job timestamp should always be in milliseconds ( #16455 )
...
* .
* .
* .
* .
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-17 00:05:55 -07:00
Tao Wang
2523072a3d
[large scale]Use gcs client instead of redis client to increase job id ( #16190 )
...
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
2021-06-17 15:01:32 +08:00
DK.Pino
7f91cfedd5
[Placement Group] Support infeasible placement groups for Placement Group. ( #16188 )
...
* init
* update comment
* update logical
* ut failing
* compile passing
* add ut
* lint
* fix comment
* lint
* fix ut and typo
* fix ut and typo
* lint
* typo
2021-06-16 21:48:39 -07:00
Alex Wu
45357ff590
[core] Fix multi-node placement group/job config bugs ( #16345 )
...
* .
* .
* seems to work?
* seems to work?
* .
* implement delete
* implement delete
* .
* tests
* .
* .
* .
* fix
* .
* .
* .
* .
* fix
* fix
* bump timeout
* bump timeout
* .
Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-16 21:12:20 -07:00
Eric Liang
3209084213
Fix fd reuse errors with plasma fallback allocation ( #16451 )
2021-06-16 19:28:23 -07:00
Amog Kamsetty
b986938f0f
Revert "[Pubsub] Use a pubsub module for Ownership based object directory ( #16407 )" ( #16486 )
...
This reverts commit 90599d3562
.
2021-06-16 15:38:11 -07:00
Tao Wang
1a1b0da8c9
Run fn in specified io service completely ( #15539 )
2021-06-16 14:53:17 -07:00
Clark Zinzow
00eb833de2
[Core] Stopgap fix for async actor lost object bug, and adds reproduction as test. ( #16414 )
...
* Support asyncio with max_concurrency == 1.
* Added test that reproduces lost object error.
* Create a fiber thread per caller instead of sharing a fiber thread among all callers.
* Formatting.
* Remove debug print statement.
* Try to accomodate dumb stupid linter that apparently doesn't know that async list comprehensions landed in Python 3.6, let alone await in list literals.
2021-06-16 12:39:45 -07:00
SangBin Cho
90599d3562
[Pubsub] Use a pubsub module for Ownership based object directory ( #16407 )
...
* in progress
* In progress 2
* progress
* OBOD pubsub done
* Fix
* Fix a bug.
* Clean up getObjectLocationOwner
* Fix a build issue.
* Lint issue.
* test fix in progress
* continue debugging
* in progress
* Fix issues again.
* Formatting
* formating
* fix issues.
* Revert "fix issues."
This reverts commit 2da577e68abc6278e03d64a60e8b96c3136145bf.
* Fix a critical bug.
* Revert "Revert "fix issues.""
This reverts commit 6546ecbd1eb9798de0bf990b30b85a3ca3e5b4ad.
* Addressed code review.
2021-06-16 09:15:13 -07:00
Eric Liang
1ef207abb6
Call Unblockifneeded ( #16422 )
2021-06-15 08:40:23 -07:00
Chong-Li
500248163f
[GCS] Fix: bookkeeping normal task resources in GCS ( #16371 )
2021-06-15 21:13:25 +08:00
Eric Liang
992437eafe
Yield plasma lock to other threads during long-running gets ( #16408 )
2021-06-14 16:23:05 -07:00
Simon Mo
5f4495108e
Fix macOS compilation ( #16412 )
2021-06-14 13:30:38 -07:00
SangBin Cho
b4e2ca39f9
[Pubsub] Using OBOD command batch for both reference counting and wait for object eviction ( #16334 )
...
* In progress/
* Basic implementation for wait for object eviction done
* Port ref count
* Fixing tests.
* Fix unit testse and remove unnecessary code
* In progress with ref count test
* Command batch done.
* done.
* Add a implementation note
* Fix all issues.
* Addressed the first batch of code review.
* one last thing; fix unit test
* Fix all issues.
* Fix a type issue.
* Fix the type issue
2021-06-14 10:10:35 -07:00
Eric Liang
f93ca2b673
Make it much simpler to turn on event stats ( #16401 )
2021-06-14 09:51:24 -07:00
Eric Liang
acb439e8f2
Prioritize get requests over wait request, and disallow overcommit of wait requests in unlimited allocation mode ( #16351 )
2021-06-12 14:06:43 -07:00
Chen Shen
24e409f948
[spilled object push optimization 3/3] ObjectManager Push from Spilled Object ( #16364 )
2021-06-11 15:57:51 -07:00
Eric Liang
47bbca04be
Add fallback allocator stats to "ray memory" ( #16362 )
2021-06-10 18:33:59 -07:00
Chen Shen
dd677f367e
[spilled object push optimization 2/3] Refactor ObjectManager's Push for integrating with SpilledObject ( #16352 )
2021-06-10 16:29:19 -07:00
Eric Liang
b0b160b701
Make fallback directory for plasma configurable based on tempdir ( #16361 )
2021-06-10 14:55:10 -07:00