Commit graph

3761 commits

Author SHA1 Message Date
architkulkarni
3ce03a52bc
Revert "Revert "Revert "Unhandled exception handler based on local ref counti… (#14113)" (#14136)
This reverts commit e457872fe1.
2021-02-16 11:47:09 -08:00
SangBin Cho
b05f87d7b2
[Object Spilling] Share the same S3 session for smart_open spilling. (#13904) 2021-02-16 10:40:55 -08:00
Barak Michener
c43a64230e
[ray_client]: Fix mutual recursion (#14122) 2021-02-16 10:37:58 -08:00
SangBin Cho
684bb32cdf
Fix assert get_outer_ref None failed + Support better traceback. (#14126)
* in progress.

* Better exception handling & stacktrace.

* done.
2021-02-16 10:09:01 -08:00
Richard Liaw
864956f817
fix-skopt (#14116)
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-16 14:36:19 +01:00
Eric Liang
e434ffe06c
[tune] Avoid crash in client mode when return results creating logdir (#14115) 2021-02-15 19:25:14 -08:00
Ian Rodney
350fb5b9d1
[autoscaler] Remove Hardcoded 8265 (#14112) 2021-02-15 18:04:00 -08:00
Patrick Ames
da0c2c99a0
[autoscaler] Fix bad reference error when specifying IamInstanceProfile by name in config. (#14083) 2021-02-15 16:29:36 -08:00
Jack Parker-Holder
ebb6e552d2
[tune] PB2 - add small constant (#14118) 2021-02-15 16:04:10 -08:00
Edward Oakes
5e763893ea
[serve] Don't overwrite self.handle in StarletteEndpoint (#14111) 2021-02-15 17:51:54 -06:00
SangBin Cho
4ad79ca963
[Object Spilling] Remove LRU eviction (#13977)
* done.

* formatting.

* done.

* done.
2021-02-15 14:24:53 -08:00
Eric Liang
e457872fe1
Revert "Revert "Unhandled exception handler based on local ref counti… (#14113)
* Revert "Revert "Unhandled exception handler based on local ref counting (#14049)" (#14099)"

This reverts commit b45ae76765.

* reomve test

* fix

* fix
2021-02-15 14:11:11 -08:00
architkulkarni
496dd297e5
skip test_basic_reconstruction_actor_task on win (#14110) 2021-02-15 10:17:33 -08:00
architkulkarni
0fb96a61fc
[Serve] Add support for variable routes (#13968) 2021-02-15 11:42:42 -06:00
Richard Liaw
4d727e4cdf
[tune] enable more tests (#13969)
* try-this

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* test

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix-tests

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* address

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* real-ray

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix-client

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* fix-race-condition

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* revert-new-tune-tests

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* Revert "revert-new-tune-tests"

This reverts commit 3866b920bc47ac4b5cb9dab8f7b9d50e4acdb27a.

* format

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* update

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

* build

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-15 09:19:55 -08:00
SangBin Cho
b45ae76765
Revert "Unhandled exception handler based on local ref counting (#14049)" (#14099)
This reverts commit 9dc671ae02.
2021-02-14 22:08:32 -08:00
architkulkarni
75568f856c
skip restart and multi restart test on win (#14084) 2021-02-14 15:17:54 -08:00
Eric Liang
9dc671ae02
Unhandled exception handler based on local ref counting (#14049) 2021-02-12 22:58:38 -08:00
Erik Erlandson
ff1b26274e
[operator] expose RAY_CONFIG_DIR env var (fix #14074) (#14076) 2021-02-12 17:47:00 -08:00
architkulkarni
20f6cc2cb2
skip test_basic_reconstruction_put on win (#14082) 2021-02-12 15:47:00 -08:00
Clark Zinzow
c9a9d422c7
[OBOD] Disable the ownership-based object directory for all tests that use ray.objects(). (#14065) 2021-02-12 12:12:57 -08:00
Amog Kamsetty
a430ac2334
[Tune] Revert Pinning Tune Dependencies (#14059)
* remove lockfiles

* docker

* remove constraint file

* fix
2021-02-11 15:43:09 -08:00
Clark Zinzow
cd7e567a57
[Core] Ownership-based Object Directory - Added support for object spilling in the ownership-based object directory. (#13948)
* Add support for object spilling in the ownership-based object directory.

* Move owner address hashmap into pinned_objects_ and objects_pending_spill_.

* Update local object manager tests.

* Feedback and misc. fixes.

* Move spilled unpin callback lambda to std::binded private method.

* Skip test_delete_objects_multi_node test on MacOS for now.
2021-02-11 10:36:22 -08:00
Ian Rodney
f6cfc44dbd
[autoscaler] run setup commands with restart_only=True (#13836) 2021-02-10 20:17:20 -08:00
Ameer Haj Ali
d87a82e891
Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)" (#14050)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* Revert "Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)"

This reverts commit 6f9d39fb3e.

* fake news

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-10 17:59:08 -08:00
architkulkarni
6f9d39fb3e
Revert "[Autoscaler] Monitor refactor for backward compatability. (#13970)" (#14046)
This reverts commit 7a6f8054d1.
2021-02-10 12:16:52 -08:00
fangfengbin
1754359281
[Core]Fix ray.kill doesn't cancel pending actor bug (#14025) 2021-02-10 15:30:21 +08:00
Dmitri Gekhtman
8ca0a32819
HotFix k8s autoscaling (#14024) 2021-02-09 22:34:24 -08:00
Eric Liang
8b7cf7cab9
Add tip on how to disable Ray OOM handler (#14017) 2021-02-09 21:52:22 -08:00
Ameer Haj Ali
7a6f8054d1
[Autoscaler] Monitor refactor for backward compatability. (#13970) 2021-02-09 21:41:50 -08:00
Eric Liang
7f342eb371
Update example shuffle script (#14021) 2021-02-09 20:47:41 -08:00
Clark Zinzow
79c7c181f3
[dask-on-ray] Add multiple return DataFrame shuffle optimization. (#13951) 2021-02-09 15:39:48 -08:00
Simon Mo
f51c26bae6
Revert "[Core]Fix ray.kill doesn't cancel pending actor bug (#13254)" (#14013)
This reverts commit 2092b097ea.
2021-02-09 11:36:38 -08:00
Alex Wu
1dcdfe9101
[autoscaler/dashboard] Publish resource usage in units of bytes (#14002) 2021-02-09 10:27:26 -08:00
Crissman Loomis
43083b9653
[docs] optuna variable typo (#14006)
* fix variable name typo

* align
2021-02-09 09:51:29 -08:00
Kai Fricke
3c8b164882
[tune] pass trainable function name when using tune.with_parameters (#14009) 2021-02-09 08:51:14 -08:00
fangfengbin
2092b097ea
[Core]Fix ray.kill doesn't cancel pending actor bug (#13254) 2021-02-09 10:59:14 +08:00
Dmitri Gekhtman
081f3e5f07
[autoscaler][kubernetes] Ray client setup, example config simplification, example scripts. (#13920) 2021-02-08 20:00:34 -06:00
Ameer Haj Ali
1643bc5c4f
Fix autoscaler wrong parameter names (#13966)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* improve code readability

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-02-08 13:19:33 -08:00
Xianyang Liu
918ad84f08
[core] Java worker should respect the user provided node_ip_address (#13732) 2021-02-08 11:59:06 +08:00
Richard Liaw
7231b6b91c
[core/client] enable more tests (#13961) 2021-02-07 19:37:52 -08:00
Richard Liaw
3a230fa1a4
[ray_client] close ray connection upon client deactivation (#13919) 2021-02-07 13:11:38 -08:00
Clark Zinzow
f070b3c9a9
[dask-on-ray] Fix Dask-on-Ray test: Python 3 dictionary .values() is a view, and is not indexable (#13945) 2021-02-05 21:21:41 -08:00
Travis Addair
cbd3598970
[tune] Fixed wait_for_gpu to handle str representations of ordinal IDs (#13936)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-05 15:41:24 -08:00
Hao Chen
e1a5e5bad4
Fix test_actor_restart (#13901) 2021-02-05 14:08:43 -08:00
Amog Kamsetty
f44f368eae
[Tune] Add try-except to FailureInjectorCallback (#13939) 2021-02-05 11:02:42 -08:00
Eric Liang
f782ed59a0
Ray client version check strict eq (#13926) 2021-02-05 00:06:10 -08:00
DK.Pino
fb89f9c2c8
[Placement Group] Support named placement group (#13755) 2021-02-05 11:04:51 +08:00
Kathryn Zhou
982c606b86
Add more user-friendly error message upon async def remote task (#13915) 2021-02-04 18:33:33 -08:00
architkulkarni
e89bbcbd44
[Serve] Revert "Revert "[Serve] Fix ServeHandle serialization"" and disable failing Windows test (#13771) 2021-02-04 14:50:01 -08:00