Commit graph

1807 commits

Author SHA1 Message Date
architkulkarni
3ce03a52bc
Revert "Revert "Revert "Unhandled exception handler based on local ref counti… (#14113)" (#14136)
This reverts commit e457872fe1.
2021-02-16 11:47:09 -08:00
Barak Michener
c43a64230e
[ray_client]: Fix mutual recursion (#14122) 2021-02-16 10:37:58 -08:00
SangBin Cho
4ad79ca963
[Object Spilling] Remove LRU eviction (#13977)
* done.

* formatting.

* done.

* done.
2021-02-15 14:24:53 -08:00
Eric Liang
e457872fe1
Revert "Revert "Unhandled exception handler based on local ref counti… (#14113)
* Revert "Revert "Unhandled exception handler based on local ref counting (#14049)" (#14099)"

This reverts commit b45ae76765.

* reomve test

* fix

* fix
2021-02-15 14:11:11 -08:00
SangBin Cho
b45ae76765
Revert "Unhandled exception handler based on local ref counting (#14049)" (#14099)
This reverts commit 9dc671ae02.
2021-02-14 22:08:32 -08:00
Alex Wu
5636af8084
[hotfix] Fix mac build (#14075)
* .

* done?

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-14 14:26:51 -08:00
Eric Liang
9dc671ae02
Unhandled exception handler based on local ref counting (#14049) 2021-02-12 22:58:38 -08:00
Clark Zinzow
c7ff69f4bf
[OBOD] Add support for ownership-based object directory object recovery. (#14066) 2021-02-12 11:58:31 -08:00
Clark Zinzow
cd7e567a57
[Core] Ownership-based Object Directory - Added support for object spilling in the ownership-based object directory. (#13948)
* Add support for object spilling in the ownership-based object directory.

* Move owner address hashmap into pinned_objects_ and objects_pending_spill_.

* Update local object manager tests.

* Feedback and misc. fixes.

* Move spilled unpin callback lambda to std::binded private method.

* Skip test_delete_objects_multi_node test on MacOS for now.
2021-02-11 10:36:22 -08:00
Ameer Haj Ali
d87a82e891
Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)" (#14050)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)"

This reverts commit 6f9d39fb3e.

* fake news

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-10 17:59:08 -08:00
Stephanie Wang
fc89984162
Subtract from num bytes in use (#13944) 2021-02-10 12:22:08 -08:00
architkulkarni
6f9d39fb3e
Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)
This reverts commit 7a6f8054d1.
2021-02-10 12:16:52 -08:00
fangfengbin
1754359281
[Core]Fix ray.kill doesn't cancel pending actor bug (#14025) 2021-02-10 15:30:21 +08:00
Ameer Haj Ali
7a6f8054d1
[Autoscaler] Monitor refactor for backward compatability. (#13970) 2021-02-09 21:41:50 -08:00
Kai Yang
e0b81796c5
Revert "Revert "[Java] fix test hang occasionally when running FailureTest (#13934)" (#13992)" (#14008) 2021-02-09 12:43:26 -08:00
Simon Mo
f51c26bae6
Revert "[Core]Fix ray.kill doesn't cancel pending actor bug (#13254)" (#14013)
This reverts commit 2092b097ea.
2021-02-09 11:36:38 -08:00
fangfengbin
2092b097ea
[Core]Fix ray.kill doesn't cancel pending actor bug (#13254) 2021-02-09 10:59:14 +08:00
Simon Mo
ec94214957
Revert "[Java] fix test hang occasionally when running FailureTest (#13934)" (#13992)
This reverts commit bcf9457abb.
2021-02-08 11:30:30 -08:00
Kai Yang
bcf9457abb
[Java] fix test hang occasionally when running FailureTest (#13934) 2021-02-08 18:21:50 +08:00
Kai Yang
4b4941435d
[Java] fix actor restart failure when multi-worker is turned on (#13793) 2021-02-07 21:12:54 +08:00
Simon Mo
ea4154df80
[Hotfix] Master compilation error on MacOS. (#13946) 2021-02-05 16:07:45 -08:00
fyrestone
eee624cf5f
Revert "Fix passing env on windows (#13253)" (#13828) 2021-02-05 13:03:16 +08:00
fangfengbin
8a5999c12a
[GCS]Fix bug that gcs client does not set last_resource_usage_ (#13856) 2021-02-05 11:51:25 +08:00
DK.Pino
fb89f9c2c8
[Placement Group] Support named placement group (#13755) 2021-02-05 11:04:51 +08:00
Tao Wang
44aa9c173f
Rename timeout to period with heartbeat interval (#13872) 2021-02-04 10:37:28 +08:00
Tao Wang
e0d9c8f0a8
Always replace DEL with UNLINK (#13832) 2021-02-04 10:30:00 +08:00
Clark Zinzow
407302f93a
[Core] Ownership-based Object Directory - Changed infinite short-poll location subscription to long-poll. (#13841) 2021-02-03 14:16:42 -08:00
SangBin Cho
cb9fa90203
[Object Spilling] Add consumed bytes to detect thrashing. (#13853) 2021-02-03 14:16:26 -08:00
Alex Wu
f14171ced9
[Core] Put raylet ip's in resource usage report (#13871)
* .

* done?

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-02-03 11:28:56 -08:00
Gabriele Oliaro
79310452e7
Enabling the cancellation of non-actor tasks in a worker's queue 2 (#13244)
* wrote code to enable cancellation of queued non-actor tasks

* minor changes

* bug fixes

* added comments

* rev1

* linting

* making ActorSchedulingQueue::CancelTaskIfFound raise a fatal error

* bug fix

* added two unit tests

* linting

* iterating through pending_normal_tasks starting from end

* fixup! iterating through pending_normal_tasks starting from end

* fixup! fixup! iterating through pending_normal_tasks starting from end

* post merge fixes

* added debugging instructions, pulled Accept() out of guarded loop

* removed debugging instructions, linting

* first commit

* lint

* lint

* added hack to avoid race condition in test stress

* moved hack

* fix test cancel

* removed hack (hopefully no longer needed)

* Revert "removed hack (hopefully no longer needed)"

This reverts commit 99d0e7c91539f290700f50aaaed805dcde04a5ee.

* added sleep in mock_worker.cc

* sleep function fixup to work on windows

* sleep in test_fast both for force=true and force=false

* linting

Co-authored-by: Ian <ian.rodney@gmail.com>
2021-02-03 10:20:12 -08:00
fangfengbin
b4684cf37a
Fix bug that otal_commands_queued_ is not initialized (#13852) 2021-02-03 10:00:15 +08:00
Eric Liang
fa4290090d
Add Ray client protocol version (#13846) 2021-02-02 00:19:08 -08:00
SangBin Cho
886217c333
[Object Spilling] Skip normal ray.get path when spilling objects. (#13831) 2021-02-01 16:03:34 -08:00
Stephanie Wang
754bee9282
[core][object spillin] Fix bugs in admission control (#13781) 2021-02-01 10:48:21 -08:00
Tao Wang
1d2ab018b0
Use right reserve size (#13829) 2021-02-01 15:49:34 +08:00
Lingxuan Zuo
b5f0aed974
[Log] use default stderr logger if no raylog starting (#13762) 2021-02-01 11:13:06 +08:00
Stephanie Wang
30f82329e3
[core] Add debug information for the PullManager and LocalObjectManager (#13782)
* Add debug info

* Formatting.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-29 17:55:46 -08:00
Hao Chen
0f3a3e14aa
Only delete local object in CoreWorkerPlasmaStoreProvider:::WarmupStore (#13788) 2021-01-29 20:24:09 +08:00
Stephanie Wang
42d501d747
[core] Pin arguments during task execution (#13737)
* tmp

* Pin task args

* unit tests

* update

* test

* Fix
2021-01-28 19:07:10 -08:00
Tao Wang
56ee6ef55f
[GCS]only update states related fields when publish actor table data (#13448) 2021-01-28 11:12:57 +08:00
Simon Mo
4f1f558802
[Core] Hotfix Windows Compilation Error for ClusterTaskManager (#13754)
* [Core] Hotfix Windows Compilation Error for ClusterTaskManager

* fix
2021-01-27 19:01:56 -08:00
Alex Wu
c0fe816466
[Core/Autoscaler] Properly clean up resource backlog from (#13727) 2021-01-27 15:30:58 -08:00
Eric Liang
56a9523020
Fix high CPU usage in object manager due to O(n^2) iteration over active pulls list (#13724) 2021-01-27 14:02:22 -08:00
DK.Pino
7f6d326ad8
[Placement Group]Add detached support for placement group. (#13582) 2021-01-27 18:51:26 +08:00
SangBin Cho
8baafacb1e
[Logging] Log rotation config (#13375)
* In Progress.

* formatting.

* in progress.

* linting.

* Done.

* Fix typo.

* Fixed the issue.
2021-01-26 20:15:55 -08:00
Lingxuan Zuo
f9f2bfa778
[Metric] Fix crashed when register metric view in multithread (#13485)
* Fix crashed when register metric view in multithread

* fix comments

* fix
2021-01-25 20:32:08 +08:00
SangBin Cho
edbb2937d3
[Object Spilling] Multi node file spilling V2. (#13542)
* done.

* done.

* Fix a mistake.

* Ready.

* Fix issues.

* fix.

* Finished the first round of code review.

* formatting.

* In progress.

* Formatting.

* Addressed code review.

* Formatting

* Fix tests.

* fix bugs.

* Skip flaky tests for now.
2021-01-23 23:15:32 -08:00
Qing Wang
8ef835ff03
Remove idle actor from worker pool. (#13523) 2021-01-23 13:57:30 +08:00
Kai Yang
90f1e408de
[Java] Add fetchLocal parameter in Ray.wait() (#13604) 2021-01-22 17:55:00 +08:00
Stephanie Wang
0998d69968
[core] Admission control for pulling objects to the local node (#13514)
* Admission control, TODO: tests, object size

* Unit tests for admission control and some bug fixes

* Add object size to object table, only activate pull if object size is known

* Some fixes, reset timer on eviction

* doc

* update

* Trigger OOM from the pull manager

* don't spam

* doc

* Update src/ray/object_manager/pull_manager.cc

Co-authored-by: Eric Liang <ekhliang@gmail.com>

* Remove useless tests

* Fix test

* osx build

* Skip broken test

* tests

* Skip failing tests

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2021-01-21 16:46:42 -08:00