Commit graph

3623 commits

Author SHA1 Message Date
Hao Zhang
0b1fbc5e83
[PR 1/6] Collective in Ray (#12637)
Co-authored-by: YLJALDC <dal177@ucsd.edu>
2020-12-12 01:26:36 -08:00
Alex Wu
aa64cd4534
[New scheduler] Fix test_global_state (#12586) 2020-12-11 21:47:01 -08:00
Edward Oakes
03d869d51c
Hold GIL while submitting (actor) tasks (#12803) 2020-12-11 21:47:16 -06:00
Edward Oakes
aec5c9879e
Add tests for atexit handler behavior (#12808) 2020-12-11 21:47:05 -06:00
Edward Oakes
6262ee1f76
Clarify docs for atexit behavior when using ray.kill (#12807) 2020-12-11 21:45:39 -06:00
Eric Liang
1ce745cf44
Add automatic local GC and plasma debug logs every 10 minutes by default (#12804) 2020-12-11 17:09:58 -08:00
Simon Mo
3d8c1cbae6
[Serve] Fix Serve Release Tests (#12777) 2020-12-11 11:53:47 -08:00
fangfengbin
9ded69fdaa
[Hotfix] Fix python client lint error (#12783) 2020-12-11 10:15:53 -08:00
Simon Mo
68d7fa2137
Fix exit_actor in asyncio mode (#12693) 2020-12-11 09:35:17 -08:00
Edward Oakes
699ded5328
[serve] Initial commit for CLI (#12770) 2020-12-11 10:31:29 -06:00
Tao Wang
295b6e5ce4
Split heartbeat message (#12535)
* first

* xxx

* Split heartbeat message

* only report resource usage when changed

* Fix GetAllResourceUsage

* Fix report resource usage

* Increase default heartbeat interval

* regularize heartbeat interval in test case
2020-12-11 21:19:57 +08:00
Stephanie Wang
86b0741026
[new scheduler] Allocate resources for spilled back task to a local view of the remote node (#12711)
* Force report heartbeats if remote resources may be dirty

* lint

* typo

* typo

* unit test

* debug

* Revert "lint"

This reverts commit 6dc7e982ffee98185665eb7c3c8fde0d91938919.

* Revert "Force report heartbeats if remote resources may be dirty"

This reverts commit cbfa9405197df62f874107d55b46715ceae2abd2.

* Local view of resources

* debug travis

* debug

* debug

* debug

* weaken test

* cleanups

* lint

* Revert "debug travis"

This reverts commit 11ff5f4f84e64e9fbd4eecda5b3c7fd07a7130a4.

* revert

* const view, remove unused
2020-12-10 22:43:29 -05:00
Barak Michener
b7f246c451
[ray_client] Include multiple facets of the Ray API (#12736) 2020-12-10 19:09:34 -08:00
Edward Oakes
62d6b0a558
Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Edward Oakes
c7b6ec88ef
[serve] Make serve __del__ log DEBUG level (#12766) 2020-12-10 18:14:55 -06:00
Edward Oakes
3c44c0d3e4
[serve] Long polling for routes in http server (#12724) 2020-12-10 18:02:02 -06:00
Eric Squires
9f70293700
Remove debug extras from setup.py (#12751) 2020-12-10 16:23:11 -06:00
architkulkarni
3fd3cb96ed
[Utils] Add Queue async and batch methods (#12578) 2020-12-10 10:49:18 -06:00
Ian Rodney
38ba238606
[serve] Create FutureResults from ControllerAPI (#12577) 2020-12-10 10:44:08 -06:00
Kai Yang
e3b5deb741
[Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Ameer Haj Ali
2f8e308444
[autoscaler] LoadMetrics missed logger.debug (#12714) 2020-12-09 17:19:36 -08:00
Richard Liaw
974570b4fb
oops (#12728)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-09 13:38:10 -08:00
Edward Oakes
c9873cdbc3
[Serve] Remove unused assign_request wrapper (#12721) 2020-12-09 12:22:43 -08:00
Ian Rodney
19542c5eb0
[docker] Default to ray-ml image (#12703) 2020-12-09 11:49:16 -08:00
Alex Wu
bd7e26b768
[Autoscaler] Temporarily suppress "Removed stale ip mappings" message. (#12689) 2020-12-08 21:55:10 -08:00
Barak Michener
dc4b5c7aa3
[ray_client] Passing actors to actors (#12585)
* start building tests around passing handles to handles

Change-Id: Ie8c3de5c8ce789c3ec8d29f0702df80ba598279f

* clean up the switch statements by moving to a method, implement state tranfer, extend test

Change-Id: Ie7b6493db3a6c203d3a0b262b8fbacb90e5cdbc5

* passing

Change-Id: Id88dc0a41da1c9d5ba68f754c5b57141aae47beb

* flush out tests

Change-Id: If77c0f586e9e99449d494be4e85f854e4a7a4952

* formatting

Change-Id: I497c07cee70b52453b221ed4393f04f6f560061e

* fix python3.6 and other attributes

Change-Id: I5a2c5231e8a021184d9dfc3e346df7f71fc93257

* address documentation

Change-Id: I049d841ed1f85b7350c17c05da4a4d81d5cb03df

* formatting

Change-Id: I6a2b32a2466ffc9f03fc91ac17901b9c1a49505c

* use the pickled handle as the id bytes for actors

Change-Id: I9ddcb41d614de65d42d6f0382fe0faa7ad2c2ade

* pydoc

Change-Id: I9b32a0f383d5ff5ac052e61929b7ae3e42a89fc5

* format

Change-Id: Iac0010bb990a4025a98139ab88700030b2e9e7f5

* todos

Change-Id: I7b550800cf7499403e8a17b77484bc46f20f0afc

* tests

Change-Id: If8ebf6a335baeb113c1332acc930c41a6b4f5384

* fix lint

Change-Id: I019f41e0ec341d39bbbbd39aa43d9fb5f8b57cf0

* nits

Change-Id: I2e6813d8db34f4ce008326faa095d414c10eee95

* add some tricky, python3.6-troublesome type checking

Change-Id: Ib887fc943a6e7084002bc13dfbe113b69b4d9317
2020-12-08 21:54:55 -08:00
Ameer Haj Ali
a4dbb271bd
[hotfix][autoscaler] Request resources refactor2 (#12661)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* request_resources -> min workers

* test fixes

* add race condition tests

* Eric

* fixes

* semi final

* semi final

* lint

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2020-12-08 18:41:30 -08:00
Philipp Moritz
343b479ae2
[TEST] Fix Ray windows build for debugger (#12671)
* Fix Ray windows build for debugger

* update
2020-12-08 18:12:48 -08:00
Stephanie Wang
50f28811ac
[new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
Kai Fricke
df10b84113
[Release] release tests yamls for Tune & GPU (#12496) 2020-12-08 10:15:07 -08:00
Gekho457
f61bc79a87
Dmitri/k8s command runner home try again (#12609) 2020-12-08 11:44:22 -06:00
Keqiu Hu
2a9079aef9
[grpc]'ray memory' fails if there are many objects in scope #8502 (#12673) 2020-12-08 09:36:53 -08:00
SangBin Cho
162f361dab
[Logging] Fix log monitor issue (#12588)
* Try fixing issues.

* Verficiation.
2020-12-07 22:01:18 -08:00
SangBin Cho
b1f2b142d5
[Core] Ensure global state is connected when exception hook is called from the driver. (#12655) 2020-12-07 18:28:32 -08:00
fangfengbin
401d342602
[PlacementGroup]Add PlacementGroup wait python api (#12601) 2020-12-07 13:53:49 +08:00
Philipp Moritz
73a1a232b9
Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin
260b07cf0c
[PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
Kai Fricke
1c0d10f67e
[tune] Add xgboost_ray integration (#12572) 2020-12-04 13:59:20 -08:00
Kai Fricke
219c445648
[tune] verbosity refactor second attempt (#12571)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-04 13:56:26 -08:00
Xianyang Liu
7cad648370
[SGD] Fixes TorchTrainer scales up (#12563) 2020-12-04 13:55:15 -08:00
Marci
f965537ae9
[tune] Callable accepted for register_env (#12618) 2020-12-04 12:21:25 -08:00
Kai Yang
21fcee28f9
[Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
Eric Liang
8cebe1e79c
[autoscaler] Fix worker capping fifo test in new scheduler (#12512) 2020-12-03 17:21:35 -08:00
Richard Liaw
1ce5e0e99f
[tune] Fix file descriptor leak by syncer (#12590) 2020-12-03 13:39:04 -08:00
Eric Liang
36e46ed923
Revert "[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417)" (#12607)
This reverts commit f669830de6.
2020-12-03 12:57:59 -08:00
Simon Mo
1f7a4806ff
[Serve] Fix Flask Request self reference (#12560)
* [Serve] Fix Flask Request self reference

* Working flag

* Fix
2020-12-03 10:45:04 -06:00
Gekho457
f669830de6
[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417) 2020-12-03 10:43:16 -06:00
fangfengbin
ff34563539
[PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Richard Liaw
7c58a85fed
[tune] fix Tensorboard file descriptor leak (#12425) 2020-12-03 00:06:54 -08:00
Eric Liang
62fbe63f34
Disable flaky test test_delete_objects_multi_node (#12584)
* update

* fix

* update
2020-12-02 19:19:12 -08:00