Commit graph

3460 commits

Author SHA1 Message Date
Edward Oakes
62d6b0a558
Fix max_task_retries for named actors (#12762) 2020-12-10 18:24:55 -06:00
Edward Oakes
c7b6ec88ef
[serve] Make serve __del__ log DEBUG level (#12766) 2020-12-10 18:14:55 -06:00
Edward Oakes
3c44c0d3e4
[serve] Long polling for routes in http server (#12724) 2020-12-10 18:02:02 -06:00
Eric Squires
9f70293700
Remove debug extras from setup.py (#12751) 2020-12-10 16:23:11 -06:00
architkulkarni
3fd3cb96ed
[Utils] Add Queue async and batch methods (#12578) 2020-12-10 10:49:18 -06:00
Ian Rodney
38ba238606
[serve] Create FutureResults from ControllerAPI (#12577) 2020-12-10 10:44:08 -06:00
Kai Yang
e3b5deb741
[Multi-tenancy] Delete flag enable_multi_tenancy and remove old code path (#10573) 2020-12-10 19:01:40 +08:00
Ameer Haj Ali
2f8e308444
[autoscaler] LoadMetrics missed logger.debug (#12714) 2020-12-09 17:19:36 -08:00
Richard Liaw
974570b4fb
oops (#12728)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-09 13:38:10 -08:00
Edward Oakes
c9873cdbc3
[Serve] Remove unused assign_request wrapper (#12721) 2020-12-09 12:22:43 -08:00
Ian Rodney
19542c5eb0
[docker] Default to ray-ml image (#12703) 2020-12-09 11:49:16 -08:00
Alex Wu
bd7e26b768
[Autoscaler] Temporarily suppress "Removed stale ip mappings" message. (#12689) 2020-12-08 21:55:10 -08:00
Barak Michener
dc4b5c7aa3
[ray_client] Passing actors to actors (#12585)
* start building tests around passing handles to handles

Change-Id: Ie8c3de5c8ce789c3ec8d29f0702df80ba598279f

* clean up the switch statements by moving to a method, implement state tranfer, extend test

Change-Id: Ie7b6493db3a6c203d3a0b262b8fbacb90e5cdbc5

* passing

Change-Id: Id88dc0a41da1c9d5ba68f754c5b57141aae47beb

* flush out tests

Change-Id: If77c0f586e9e99449d494be4e85f854e4a7a4952

* formatting

Change-Id: I497c07cee70b52453b221ed4393f04f6f560061e

* fix python3.6 and other attributes

Change-Id: I5a2c5231e8a021184d9dfc3e346df7f71fc93257

* address documentation

Change-Id: I049d841ed1f85b7350c17c05da4a4d81d5cb03df

* formatting

Change-Id: I6a2b32a2466ffc9f03fc91ac17901b9c1a49505c

* use the pickled handle as the id bytes for actors

Change-Id: I9ddcb41d614de65d42d6f0382fe0faa7ad2c2ade

* pydoc

Change-Id: I9b32a0f383d5ff5ac052e61929b7ae3e42a89fc5

* format

Change-Id: Iac0010bb990a4025a98139ab88700030b2e9e7f5

* todos

Change-Id: I7b550800cf7499403e8a17b77484bc46f20f0afc

* tests

Change-Id: If8ebf6a335baeb113c1332acc930c41a6b4f5384

* fix lint

Change-Id: I019f41e0ec341d39bbbbd39aa43d9fb5f8b57cf0

* nits

Change-Id: I2e6813d8db34f4ce008326faa095d414c10eee95

* add some tricky, python3.6-troublesome type checking

Change-Id: Ib887fc943a6e7084002bc13dfbe113b69b4d9317
2020-12-08 21:54:55 -08:00
Ameer Haj Ali
a4dbb271bd
[hotfix][autoscaler] Request resources refactor2 (#12661)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* request_resources -> min workers

* test fixes

* add race condition tests

* Eric

* fixes

* semi final

* semi final

* lint

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2020-12-08 18:41:30 -08:00
Philipp Moritz
343b479ae2
[TEST] Fix Ray windows build for debugger (#12671)
* Fix Ray windows build for debugger

* update
2020-12-08 18:12:48 -08:00
Stephanie Wang
50f28811ac
[new scheduler] Always spill back to a feasible node if the local node is not feasible (#12557)
* fix

lint

* feasible nodes

* Enable test, cleanup

* Revert "fix"

This reverts commit aef81d04c0b4560b758f846e1afdafbdb5552efe.

* unit test

* doc
2020-12-08 13:46:58 -05:00
Kai Fricke
df10b84113
[Release] release tests yamls for Tune & GPU (#12496) 2020-12-08 10:15:07 -08:00
Gekho457
f61bc79a87
Dmitri/k8s command runner home try again (#12609) 2020-12-08 11:44:22 -06:00
Keqiu Hu
2a9079aef9
[grpc]'ray memory' fails if there are many objects in scope #8502 (#12673) 2020-12-08 09:36:53 -08:00
SangBin Cho
162f361dab
[Logging] Fix log monitor issue (#12588)
* Try fixing issues.

* Verficiation.
2020-12-07 22:01:18 -08:00
SangBin Cho
b1f2b142d5
[Core] Ensure global state is connected when exception hook is called from the driver. (#12655) 2020-12-07 18:28:32 -08:00
fangfengbin
401d342602
[PlacementGroup]Add PlacementGroup wait python api (#12601) 2020-12-07 13:53:49 +08:00
Philipp Moritz
73a1a232b9
Ray debugger stepping between tasks (#12075) 2020-12-06 21:50:18 -08:00
fangfengbin
260b07cf0c
[PlacementGroup]Add PlacementGroup wait java api (#12499)
* add part code

* add part code

* add part code

* add part code

* fix review comments

* fix compile bug

* fix compile bug

* fix review comments

* fix review comments

* fix code style

* add part code

* fix review comments

* fix review comments

* fix code style

* rebase master

* fix bug

* fix lint error

* fix compile bug

* fix newline issue

Co-authored-by: 灵洵 <fengbin.ffb@antgroup.com>
2020-12-05 16:40:04 +08:00
Kai Fricke
1c0d10f67e
[tune] Add xgboost_ray integration (#12572) 2020-12-04 13:59:20 -08:00
Kai Fricke
219c445648
[tune] verbosity refactor second attempt (#12571)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2020-12-04 13:56:26 -08:00
Xianyang Liu
7cad648370
[SGD] Fixes TorchTrainer scales up (#12563) 2020-12-04 13:55:15 -08:00
Marci
f965537ae9
[tune] Callable accepted for register_env (#12618) 2020-12-04 12:21:25 -08:00
Kai Yang
21fcee28f9
[Java] Simplify Ray.init() by invoking ray start internally (#10762) 2020-12-04 14:33:45 +08:00
Eric Liang
8cebe1e79c
[autoscaler] Fix worker capping fifo test in new scheduler (#12512) 2020-12-03 17:21:35 -08:00
Richard Liaw
1ce5e0e99f
[tune] Fix file descriptor leak by syncer (#12590) 2020-12-03 13:39:04 -08:00
Eric Liang
36e46ed923
Revert "[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417)" (#12607)
This reverts commit f669830de6.
2020-12-03 12:57:59 -08:00
Simon Mo
1f7a4806ff
[Serve] Fix Flask Request self reference (#12560)
* [Serve] Fix Flask Request self reference

* Working flag

* Fix
2020-12-03 10:45:04 -06:00
Gekho457
f669830de6
[autoscaler/k8s] Use ray node's HOME in Kubernetes command runner. (#12417) 2020-12-03 10:43:16 -06:00
fangfengbin
ff34563539
[PlacementGroup]Fix bug that kill workers mistakenly when gcs restarts (#12568) 2020-12-03 17:50:48 +08:00
Richard Liaw
7c58a85fed
[tune] fix Tensorboard file descriptor leak (#12425) 2020-12-03 00:06:54 -08:00
Eric Liang
62fbe63f34
Disable flaky test test_delete_objects_multi_node (#12584)
* update

* fix

* update
2020-12-02 19:19:12 -08:00
Edward Oakes
8058c1eb54
[serve] Add option to not start HTTP servers (#11627) 2020-12-02 16:49:34 -06:00
Kaushik B
7422abddb4
[tune] trim kwargs in shim instantiation functions (#12544) 2020-12-02 12:07:00 -08:00
Richard Liaw
da42bf29d0
[tune] horovod release test (#12495) 2020-12-02 12:04:54 -08:00
Stephanie Wang
443339ab19
[core] Move out-of-memory handling into the plasma store and support async object creation (#12186)
* Refactor to extract creation request queue

* timer on oom

* move timer out

* Move evict_if_full and on_store_full into plasma store

* Remove client-side code

* revert

* Distinguish between transient and permanent OOM delays

* update

* Move out create request queue, unit test

* unit test

* Fix max retries

* test

* Do not pin restored objects

* First pass to add polling requests, unit test passes

* worker plasma client retries plasma requests

* cleanup

* Clean up after disconnected clients, check memory leaks

* Support immediate requests in request queue

* Option to try creating immediately

* lint

* Fix build, address comments

* doc

* fixes

* debug travis

* debug

* debug

* debug

* debug

* Revert "debug"

This reverts commit 6bf2f6ee5640e71630c4aecdb7ebf54911ea32db.

Revert "debug"

This reverts commit 73017099c9b06cdaae1217bf0e0f4d23ed68a9e5.

Revert "debug"

This reverts commit 5a155529e28cee9461a598b0cdf7b6a3cc194c93.

Revert "debug"

This reverts commit b50c2101afd45d4cf663daae857bfe1b40387703.

Revert "debug travis"

This reverts commit 012b8721dedf9bca46294ae75eee2815b160368b.

* Skip if new scheduler enabled

* error message

* merge
2020-12-02 13:25:54 -05:00
Richard Liaw
a21523c709
[tune/core] serialization debugging utility (#12142)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2020-12-02 00:52:17 -08:00
Kai Fricke
63b85df828
[xgb] update docs (#12549) 2020-12-01 23:17:23 -08:00
Simon Mo
e428134137
[Hotfix] Pin llvmlite for windows build (#12559) 2020-12-01 19:43:08 -08:00
Siyuan (Ryans) Zhuang
615f974313
Add context for "test_buffer_alignment" (#12519) 2020-12-01 19:27:14 -08:00
Sven Mika
19c8033df2
[RLlib] Fix most remaining RLlib algos for running with trajectory view API. (#12366)
* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* WIP.

* LINT and fixes.
MB-MPO and MAML not working yet.

* wip

* update

* update

* rmeove

* remove dep

* higher

* Update requirements_rllib.txt

* Update requirements_rllib.txt

* relpos

* no mbmpo

Co-authored-by: Eric Liang <ekhliang@gmail.com>
2020-12-01 17:41:10 -08:00
Richard Liaw
4dc16730a7
[tune] with-params fix (#12522) 2020-12-01 16:47:03 -08:00
Simon Mo
7022278ce9
Deflake Serve tests (#12542) 2020-12-01 13:42:21 -08:00
Barak Michener
6412dfaf38
[ray_client] actors v0 (#12388) 2020-12-01 13:12:08 -08:00
SangBin Cho
0e892908f7
[Object Spilling] Delete spilled objects when references are gone out of scope. (#12341) 2020-12-01 13:10:39 -08:00