Commit graph

533 commits

Author SHA1 Message Date
SangBin Cho
c21a79ae6e
[Object Spilling] 100GB shuffle release test (#13729) 2021-01-29 12:38:06 -08:00
Ian Rodney
b4bcb9b60a
[Docker] Use Cuda 11 (#13691) 2021-01-27 13:45:30 -08:00
Alex Wu
840987c7af
Scalability Envelope Tests (#13464) 2021-01-25 18:48:31 -08:00
Simon Mo
fe8262afd0
Add K8s test to release process (#13694) 2021-01-25 16:53:52 -08:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Kai Fricke
8804758409
[xgboost] Add XGBoost release tests (#13456)
* Add XGBoost release tests

* Add more xgboost release tests

* Use failure state manager

* Add release test documentation

* Fix wording

* Automate fault tolerance tests
2021-01-20 18:40:23 +01:00
Simon Mo
c963cbc038
Fix Docker Permission for Serve release test again (#13543) 2021-01-19 12:23:30 -08:00
Sven Mika
93c0a5549b
[RLlib] Deprecate vf_share_layers in top-level PPO/MAML/MB-MPO configs. (#13397) 2021-01-19 09:51:35 +01:00
SangBin Cho
1179db1fc2
Remove an unnecessary file (#13499) 2021-01-15 18:29:12 -08:00
Eric Liang
ee6332dbb0
Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
SangBin Cho
d09df55b14
Update ID specification doc (#13356) 2021-01-15 15:15:51 -08:00
Simon Mo
16e8c4a69f
[Release] Fix Serve release test (#13303)
The Docker image we were using now uses `ray` users so we have to call
sudo.
2021-01-14 12:23:53 -08:00
SangBin Cho
0428537d0b
[Object Spilling] Long running object spilling test (#13331)
* done.

* formatting.
2021-01-12 16:53:13 -08:00
Kai Fricke
518427627b
[tune] buffer trainable results (#13236)
* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-12 18:52:47 +01:00
Simon Mo
c32ad2fef5
[Release] Use ray-ml image for logn running test (#13267) 2021-01-07 10:31:46 -08:00
Max Fitton
5094734205
Update autoscaler-cluster yaml files for release tests (#13114) 2021-01-07 11:44:57 -06:00
Simon Mo
01dcb993c7
[Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)
Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)
2021-01-07 08:57:24 -08:00
Max Fitton
0d61ea9b06
[Release] Add 1.1.0 release test logs (#13054)
* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries
2021-01-06 11:03:16 -08:00
Max Fitton
d018212db5
[Release] Update Release Process Documentation (#13123) 2021-01-04 11:09:43 -08:00
Alex Wu
a79c9fcac3
[release tests] test_many_tasks fix (#12984) 2020-12-22 11:05:33 -08:00
Max Fitton
e077bc4206
[Release] Bump master to 1.2.0 for 1.1.0 release (#12856) 2020-12-15 09:40:26 -08:00
Simon Mo
3d8c1cbae6
[Serve] Fix Serve Release Tests (#12777) 2020-12-11 11:53:47 -08:00
Eric Squires
9f70293700
Remove debug extras from setup.py (#12751) 2020-12-10 16:23:11 -06:00
Kai Fricke
df10b84113
[Release] release tests yamls for Tune & GPU (#12496) 2020-12-08 10:15:07 -08:00
SangBin Cho
3ee4612696
[Release] Fix cluster.yaml (#12589)
* Fix cluster.yaml

* Updated to use manylinux2014
2020-12-07 13:52:30 -08:00
Richard Liaw
da42bf29d0
[tune] horovod release test (#12495) 2020-12-02 12:04:54 -08:00
Eric Liang
9f322db71d
Add many_ppo long running test (#12364)
* add new tes

* update

* update
2020-11-24 16:00:33 -08:00
Sven Mika
4afaa46028
[RLlib] Increase the scope of RLlib's regression tests. (#12200) 2020-11-24 22:18:31 +01:00
Edward Oakes
32d159a2ed
Fix release directory & RELEASE_PROCESS.md (#12269) 2020-11-23 14:28:59 -06:00
Simon Mo
5df9f07ff3
[CI] Use Docker image for microbenchmarks (#12189)
* [CI] Use Docker image for microbenchmarks

* Update cluster.yaml
2020-11-19 17:54:40 -08:00
Edward Oakes
2feba4409c
[serve] Fix long running failure test (#11805) 2020-11-09 11:21:03 -06:00
Barak Michener
05c4e3fb2a
[build] Build wheels with manylinux2014 (#11621)
* necessary changes

* Split bazel install

* manylinux2014

* change references to manylinux2014

* Fix lint

* port alex's docker build changes

* fix config issue

* remove extra manylinux2010 requirement script

* revert SHA overwrite

* wip

* incompatible_linklibs

* fix nits
2020-11-03 19:36:32 -08:00
Barak Michener
4348ecf850
Clean up release tests (#11420) 2020-10-22 17:04:41 -07:00