Commit graph

7118 commits

Author SHA1 Message Date
Barak Michener
26ba95e96d
[python/ray]: add cloudpickle dependency (#13838)
Change-Id: I248a2174c27cacb84a1cf0fd1feaa99535a90b71
2021-02-01 15:27:39 -08:00
Ian Rodney
1ee5d5faff
[AWS] Fill-in AMI if not provided (#13808)
* fill in default ami if not provided

* lint fix

* quick test

* Update python/ray/tests/aws/test_autoscaler_aws.py

* Update python/ray/tests/aws/test_autoscaler_aws.py

* fix test

* fix tests

* fix lint

* remove bad test

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-02-01 14:30:48 -08:00
Barak Michener
55566bc797
[ray_client]: Add python version check and test (and some minor fixes along the way) (#13722) 2021-02-01 13:04:38 -08:00
Stephanie Wang
754bee9282
[core][object spillin] Fix bugs in admission control (#13781) 2021-02-01 10:48:21 -08:00
SongGuyang
6e53a71978
bug fix for doc (#13834) 2021-02-01 21:13:43 +08:00
SongGuyang
361e5f0bef
support dynamic library loading in C++ worker (#13734) 2021-02-01 19:24:33 +08:00
Tao Wang
1d2ab018b0
Use right reserve size (#13829) 2021-02-01 15:49:34 +08:00
Ameer Haj Ali
9d7b8b58a2
[autoscaler] Remove min workers from multi node type examples (#13814)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* remove global min_workers from mult-node-type-examples

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
2021-01-31 23:29:57 -08:00
SangBin Cho
d1ec787d9d
[Object Spilling] Turn on by default. (#13745)
* Done.

* in progress.

* in progress.

* fixed tests.

* Fix.
2021-01-31 23:28:37 -08:00
Amog Kamsetty
2ba77ae3a2
[Release] Fix SGD+Tune long running distributed release test (#13812)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-31 21:05:50 -08:00
Lingxuan Zuo
b5f0aed974
[Log] use default stderr logger if no raylog starting (#13762) 2021-02-01 11:13:06 +08:00
Ameer Haj Ali
660857ffab
Fix windows test (#13811) 2021-01-29 21:10:59 -08:00
Dominic Ming
4b60c388ef
[Dashboard] fix new dashboard entrance and some table problem (#13790) 2021-01-30 10:42:16 +08:00
Stephanie Wang
30f82329e3
[core] Add debug information for the PullManager and LocalObjectManager (#13782)
* Add debug info

* Formatting.

Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2021-01-29 17:55:46 -08:00
Simon Mo
a3796b3ed5
[CI] Add other Travis Linux builds to buildkite (#13769) 2021-01-29 15:48:02 -08:00
Simon Mo
194656731d
[CI] Deflake test_basics and skip test_component_failures_3 (#13801) 2021-01-29 15:47:21 -08:00
Simon Mo
50808024eb
Revert "[autoscaler] Better validation for min_workers and max_workers (#13779)" (#13807)
This reverts commit 4d6817c683.
2021-01-29 15:43:01 -08:00
Barak Michener
9441f85e1a
[client] Hook runtime context (#13750)
Change-Id: I701d21e53900b5f3fb0e23e09f59e8316c7ba623
2021-01-29 12:58:41 -08:00
SangBin Cho
c21a79ae6e
[Object Spilling] 100GB shuffle release test (#13729) 2021-01-29 12:38:06 -08:00
Ian Rodney
1a9a0024d5
[Wheel] Build Py36 & Py38 in separate deploy (#13797) 2021-01-29 12:28:40 -08:00
Siyuan (Ryans) Zhuang
0b598c0f05
[Serialization] API for deregistering serializers; code & doc cleanup (#13471)
* make methods private, remove confusion brackets and usages

* unregister serializer; fix doc

* Cleanup doc

* rename unregister -> deregister
2021-01-29 10:27:05 -08:00
Eric Liang
b20a38febb
[autoscaler] Avoid launching GPU nodes when the workload only has CPU tasks. (#13776)
* wip

* avoid gpus

* update

* update
2021-01-29 09:50:28 -08:00
Ameer Haj Ali
4d6817c683
[autoscaler] Better validation for min_workers and max_workers (#13779)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

* fix error msg

* validate sum min_workers < max_workers

* 1 more edge case test

* lint

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-29 09:41:56 -08:00
Kai Fricke
9a413144b1
[tune] dynamic global checkpointing interval (#13736)
* Add scalability tests

* Move experiment checkpointing into a manager class

* Dynamic global checkpointing

* Actually write checkpoints

* Remove debug message

* Pass `force`

* Pre-review

* Revert scalability commits

* Revert scalability commits

* Apply suggestions from code review
2021-01-29 17:14:46 +01:00
Hao Chen
0f3a3e14aa
Only delete local object in CoreWorkerPlasmaStoreProvider:::WarmupStore (#13788) 2021-01-29 20:24:09 +08:00
Dominic Ming
752da83bb7
[Dashboard] Add the new dashboard code and prompt users to try it (#11667) 2021-01-29 15:22:26 +08:00
Stephanie Wang
42d501d747
[core] Pin arguments during task execution (#13737)
* tmp

* Pin task args

* unit tests

* update

* test

* Fix
2021-01-28 19:07:10 -08:00
Ian Rodney
813a7ab0e2
[docker] Build Python3.6 & Python3.8 Docker Images (#13548) 2021-01-28 15:24:50 -08:00
Tanja Bayer
0c906a8b93
[Docker] usage of python-version (#13011)
Co-authored-by: Tanja Bayer <tanja.bayer@widas.de>
Co-authored-by: Ian Rodney <ian.rodney@gmail.com>
2021-01-28 14:27:54 -08:00
architkulkarni
cb771f263d
[Serve] Add ServeHandle metrics (#13640) 2021-01-28 14:40:47 -06:00
Sven Mika
4bc257f4fb
[RLlib] Fix custom multi action distr (#13681) 2021-01-28 19:28:48 +01:00
Lena Kashtelyan
c583113d66
[Ax] Align optimization mode and reported SEM with Ax (#13611)
* [Ax] Align optimization mode and reported SEM with Ax

Ensure that `mode` aligns with the mode set in Ax + report SEM as None rather than as 0.0 to make use of Ax noise inference

* Account for review

* Update ax.py

* Fix lint

* Fix tests, ad additional checks

* Fix tests for python 3.6

Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-01-28 19:01:51 +01:00
Yuri Rocha
b01b0f80aa
[RLlib] Fix multiple Unity3DEnvs trying to connect to the same custom port (#13519) 2021-01-28 13:28:08 +01:00
cathrinS
d4ef5c5993
[RLlib] Atari-RAM-Preprocessing, unsigned observation vector results in a false preprocessed observation (#13013) 2021-01-28 12:07:00 +01:00
Tao Wang
56ee6ef55f
[GCS]only update states related fields when publish actor table data (#13448) 2021-01-28 11:12:57 +08:00
architkulkarni
cb95ff1e56
[Serve] Add "endpoint registered" message to router log (#13752) 2021-01-27 19:03:15 -08:00
Simon Mo
4f1f558802
[Core] Hotfix Windows Compilation Error for ClusterTaskManager (#13754)
* [Core] Hotfix Windows Compilation Error for ClusterTaskManager

* fix
2021-01-27 19:01:56 -08:00
Simon Mo
c10abbb1bb
Revert "[Serve] Fix ServeHandle serialization (#13695)" (#13753)
This reverts commit 202fbdf38c.
2021-01-27 17:47:42 -08:00
Eric Liang
2e01d5d26e
Report failed deserialization of errors in Ray client 2021-01-27 17:37:50 -08:00
Zhe Zhang
0e7343ec19
[docs] Fix MLflow / Tune example in documentation (#13740)
Minor fixes to make it runnable
2021-01-27 17:16:29 -08:00
Dmitri Gekhtman
40234ad631
[autoscaler][AWS] Make sure subnets belong to same VPC as user-specified security groups (#13558)
* initial commit

* Filter subnets by security groups' VPCs

* fix stubs

* wip

* Fix inbound rule logic. Tests WIP.

* wip

* unit test

* example yaml

* Unit test tests for bug being fixed

* Update python/ray/tests/aws/utils/constants.py

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>

Co-authored-by: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com>
2021-01-27 17:00:52 -08:00
architkulkarni
28cf5f91e3
[docs] change MLFlow to MLflow in docs (#13739) 2021-01-27 16:53:15 -08:00
Simon Mo
25fa391193
[Core] Add private on_completed callback for ObjectRef (#13688) 2021-01-27 16:32:00 -08:00
SangBin Cho
32ec0d205f
[Object Spilling] Remove job id from the io worker log name. (#13746) 2021-01-27 16:26:32 -08:00
Ian Rodney
bdf0c00989
Revert "Revert "[CLI] Fix Ray Status with ENV Variable set (#13707) (#13726) 2021-01-27 15:33:33 -08:00
Alex Wu
c0fe816466
[Core/Autoscaler] Properly clean up resource backlog from (#13727) 2021-01-27 15:30:58 -08:00
Simon Mo
3644df415a
[CI] Add retry to java doc test (#13743) 2021-01-27 14:18:06 -08:00
Eric Liang
56a9523020
Fix high CPU usage in object manager due to O(n^2) iteration over active pulls list (#13724) 2021-01-27 14:02:22 -08:00
Ian Rodney
c5209e2dab
[Docker] default to /home/ray (#13738) 2021-01-27 13:46:07 -08:00
Ian Rodney
b4bcb9b60a
[Docker] Use Cuda 11 (#13691) 2021-01-27 13:45:30 -08:00