Commit graph

114 commits

Author SHA1 Message Date
Kai Fricke
153a8b8fec
[release] convert tune release tests (#15913) 2021-06-01 11:19:15 -07:00
Sven Mika
c9d220bcda
[RLlib] Upgrade RLlib regression test scripts to new testing tool - RLlib release logs for 1.4. (#16080) 2021-06-01 17:39:18 +02:00
Amog Kamsetty
da6f28d777
[Release] Add multi-node, multi-GPU SGD release test (#16046) 2021-05-31 16:23:04 -07:00
SangBin Cho
9fa3b9f6f3
[Nightly test] Test non streaming shuffle (#16150) 2021-05-31 15:28:02 -07:00
SangBin Cho
94dc06d852
[Nightly test] improve error detection (#16102)
* improve error detection

* improve gitignore

* fix
2021-05-27 00:33:21 -07:00
SangBin Cho
ee1ccb569d
[Test] Nightly shuffle test (#15998)
* shuffle daily test update.

* lint

* Improve testing.

* Download the real nightly.

* Addressed code review.

* fix typo

* fix issue

* fix the broken release test

* Updated the test.
2021-05-24 15:33:31 -07:00
mwtian
5462c6e7de
Fix link to release checklist from release process doc. (#15793) 2021-05-13 13:34:54 -07:00
SangBin Cho
259fcbd5bd
[Pubsub] Generalize the pubsub interface and adapt it for ref counting protocol (#15446)
* Add mock code first

* In the initial progress.

* Fix the number error

* In progress.

* in more pgoress.

* in progress.

* lint.

* Prototype done.

* Fix compilation bug.

* Now it is working with reference counting.

* Remove template.

* lint.

* Fixed issues.

* Fix reference count test.

* Reference count test passes now.

* Fixed the test array problem

* Addressed code review.

* lint.

* Addressed half of code review.

* Fix tests.

* Addressed the most critical issue.

* Make subscriber thread-safe.

* Revert "Make subscriber thread-safe."

This reverts commit 9a6a52197cfa8463ab60dfaae9530ad3c0ed8790.

* Fixed test failures. The only failure now is the asan failure.

* Reset test suites and see if it fixes the issue.

* Fix a flaky test

* Addressed code review.
2021-05-13 09:29:02 -07:00
Eric Liang
0dfd43c61b
Add nightly release test directory and add shuffle release test (#15671)
* update

* udpate

* update

* update

* update

* Adjust script/release test json

* remove

* update

* lint

Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-05-08 14:21:55 -07:00
Kai Fricke
8db2e5c23a
[release] Move xgboost tune small + microbenchmark release test to new release automation (#15619) 2021-05-08 20:38:39 +01:00
Kai Fricke
1d52ab819f
[release] release 1.3.0 results and test updates (#15366)
Convert a number of release tests and add logs for release 1.3.0
2021-05-04 22:10:04 +01:00
Jenna Kwon
15da948214
Support object spilling mode and data load failure mode in dask_on_ra… (#15601)
* Support object spilling mode and data load failure mode in dask_on_ray_large_scale_test.py

* Remove freq and time decimation

Co-authored-by: Jenna Kwon <jkkwon@amazon.com>
2021-05-04 10:57:49 -07:00
Amog Kamsetty
ebc44c3d76
[CI] Upgrade flake8 to 3.9.1 (#15527)
* formatting

* format util

* format release

* format rllib/agents

* format rllib/env

* format rllib/execution

* format rllib/evaluation

* format rllib/examples

* format rllib/policy

* format rllib utils and tests

* format streaming

* more formatting

* update requirements files

* fix rllib type checking

* updates

* update

* fix circular import

* Update python/ray/tests/test_runtime_env.py

* noqa
2021-05-03 14:23:28 -07:00
SangBin Cho
df9329160e
[Tests] Dask on ray release test (#15256)
* done.

* Linting.

* Update readme

* Update.

* Fix issues.
2021-04-15 10:30:17 -07:00
SangBin Cho
d0e83c43ca
[Release Test] Modify parameter to reduce stress (#15048)
* Fix.

* Fix.
2021-04-14 18:27:20 -07:00
Richard Liaw
59bf3a7b22
ray[cluster] -> ray[default] (#15251) 2021-04-14 09:37:04 -07:00
Edward Oakes
0f9d1bb223
Serve failure release test fix (#15276)
This test is currently not tested in CI
2021-04-13 17:49:29 +01:00
Edward Oakes
e4ca337e16
[serve] Change remaining tests to use deployment API (#15167) 2021-04-08 08:15:38 -05:00
Richard Liaw
e72f6b0377
Fix ray[full] -> ray[cluster] #15112
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
2021-04-05 09:55:00 -07:00
Kai Fricke
b366500938
[tune] fix long running release test WIP (#14866)
- Use placement groups
- Introduce time between checks for failure testing
- Use gloo instead of nccl
2021-03-25 11:03:22 +01:00
Amog Kamsetty
233f174984
Update release instructions (#14882) 2021-03-24 12:41:50 -07:00
SangBin Cho
5f7ce293fe
[Test] Large scale dask on ray test (#14340)
* Add a test.

* Add a test.

* d

* Modify the release doc.

* Addressed code review.
2021-03-23 11:00:35 -07:00
Kai Fricke
7364a7a327
[tune] Move Optuna to ask(fixed_distributions) interface (#14731)
Adjusting to changes in Optuna 2.6.0. Old interface was marked as deprecated.
2021-03-22 12:25:37 +01:00
Ian Rodney
eb12033612
[Code Cleanup] Switch to use ray.util.get_node_ip_address() (#14741)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-03-18 13:10:57 -07:00
Kai Fricke
4014168928
[tune] Introduce durable() wrapper to convert trainables into durable trainables (#14306)
* [tune] Introduce `durable()` wrapper to convert trainables into durable trainables

* Fix wrong check

* Improve docs, add FAQ for tackling overhead

* Fix bugs in `tune.with_parameters`

* Update doc/source/tune/api_docs/trainable.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Update doc/source/tune/_tutorials/_faq.rst

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-26 13:59:28 +01:00
SangBin Cho
5740b2391e
Add multi node data processing cluster.yaml (#14198) 2021-02-19 16:16:55 -08:00
Kai Fricke
a0f73cf3f7
[xgboost] Update XGBoost release test configs (#13941)
* Update XGBoost release test configs

* Use GPU containers

* Fix elastic check

* Use spot instances for GPU

* Add debugging output

* Fix success check, failure checking, outputs, sync behavior

* Update release checklist, rename mounts
2021-02-17 23:00:49 +01:00
Alex Wu
4846a6c2d0
Release process update (#13798) 2021-02-15 11:40:49 -08:00
Kai Fricke
1ef2a6790c
[tune] add scalability release tests (#13986)
* Add scalability tests

* Network overhead cluster

* Update xgboost tests

* Document release tests

* Don't raise on failed trial

* Update to multi node yamls

* Update yamls

* Revert xgboost test changes

* Fix import

* Update release/tune_tests/scalability_tests/workloads/test_bookkeeping_overhead.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Pass aws credentials (WIP)

* Update durable trainable example

* Update xgboost sweep

* Change xgboost scope, fix durable trainable stop condition

* Fix max depth to limit total test length

* Add cluster information to test descriptions. Update release checklist/process docs

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-02-10 17:16:31 +01:00
Kai Fricke
1e113d2e6e
[tune/xgboost] Update release test docs (#13880)
* Update release test docs

* Update
2021-02-04 13:10:56 +01:00
Amog Kamsetty
2ba77ae3a2
[Release] Fix SGD+Tune long running distributed release test (#13812)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-31 21:05:50 -08:00
SangBin Cho
c21a79ae6e
[Object Spilling] 100GB shuffle release test (#13729) 2021-01-29 12:38:06 -08:00
Ian Rodney
b4bcb9b60a
[Docker] Use Cuda 11 (#13691) 2021-01-27 13:45:30 -08:00
Alex Wu
840987c7af
Scalability Envelope Tests (#13464) 2021-01-25 18:48:31 -08:00
Simon Mo
fe8262afd0
Add K8s test to release process (#13694) 2021-01-25 16:53:52 -08:00
Ameer Haj Ali
b7dd7ddb52
deprecate useless fields in the cluster yaml. (#13637)
* prepare for head node

* move command runner interface outside _private

* remove space

* Eric

* flake

* min_workers in multi node type

* fixing edge cases

* eric not idle

* fix target_workers to consider min_workers of node types

* idle timeout

* minor

* minor fix

* test

* lint

* eric v2

* eric 3

* min_workers constraint before bin packing

* Update resource_demand_scheduler.py

* Revert "Update resource_demand_scheduler.py"

This reverts commit 818a63a2c86d8437b3ef21c5035d701c1d1127b5.

* reducing diff

* make get_nodes_to_launch return a dict

* merge

* weird merge fix

* auto fill instance types for AWS

* Alex/Eric

* Update doc/source/cluster/autoscaling.rst

* merge autofill and input from user

* logger.exception

* make the yaml use the default autofill

* docs Eric

* remove test_autoscaler_yaml from windows tests

* lets try changing the test a bit

* return test

* lets see

* edward

* Limit max launch concurrency

* commenting frac TODO

* move to resource demand scheduler

* use STATUS UP TO DATE

* Eric

* make logger of gc freed refs debug instead of info

* add cluster name to docker mount prefix directory

* grrR

* fix tests

* moving docker directory to sdk

* move the import to prevent circular dependency

* smallf fix

* ian

* fix max launch concurrency bug to assume failing nodes as pending and consider only load_metric's connected nodes as running

* small fix

* deflake test_joblib

* lint

* placement groups bypass

* remove space

* Eric

* first ocmmit

* lint

* exmaple

* documentation

* hmm

* file path fix

* fix test

* some format issue in docs

* modified docs

* joblib strikes again on windows

* add ability to not start autoscaler/monitor

* a

* remove worker_default

* Remove default pod type from operator

* Remove worker_default_node_type from rewrite_legacy_yaml_to_availble_node_types

* deprecate useless fields

Co-authored-by: Ameer Haj Ali <ameerhajali@ameers-mbp.lan>
Co-authored-by: Alex Wu <alex@anyscale.io>
Co-authored-by: Alex Wu <itswu.alex@gmail.com>
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Ameer Haj Ali <ameerhajali@Ameers-MacBook-Pro.local>
Co-authored-by: root <root@ip-172-31-56-188.us-west-2.compute.internal>
Co-authored-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2021-01-23 12:06:51 -08:00
Kai Fricke
8804758409
[xgboost] Add XGBoost release tests (#13456)
* Add XGBoost release tests

* Add more xgboost release tests

* Use failure state manager

* Add release test documentation

* Fix wording

* Automate fault tolerance tests
2021-01-20 18:40:23 +01:00
Simon Mo
c963cbc038
Fix Docker Permission for Serve release test again (#13543) 2021-01-19 12:23:30 -08:00
Sven Mika
93c0a5549b
[RLlib] Deprecate vf_share_layers in top-level PPO/MAML/MB-MPO configs. (#13397) 2021-01-19 09:51:35 +01:00
SangBin Cho
1179db1fc2
Remove an unnecessary file (#13499) 2021-01-15 18:29:12 -08:00
Eric Liang
ee6332dbb0
Bump dev branch to 2.0 to avoid endless version bump toil (#13497)
* wip

* fix

* fix
2021-01-15 17:41:17 -08:00
SangBin Cho
d09df55b14
Update ID specification doc (#13356) 2021-01-15 15:15:51 -08:00
Simon Mo
16e8c4a69f
[Release] Fix Serve release test (#13303)
The Docker image we were using now uses `ray` users so we have to call
sudo.
2021-01-14 12:23:53 -08:00
SangBin Cho
0428537d0b
[Object Spilling] Long running object spilling test (#13331)
* done.

* formatting.
2021-01-12 16:53:13 -08:00
Kai Fricke
518427627b
[tune] buffer trainable results (#13236)
* Working prototype

* Pass buffer length, fix tests

* Don't buffer per default

* Dispatch and process save in one go, added tests

* Fix tests

* Pass adaptive seconds to train_buffered, stop result processing after STOP decision

* Fix tests, add release test

* Update tests

* Added detailed logs for slow operations

* Update python/ray/tune/trial_runner.py

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>

* Apply suggestions from code review

* Revert tests and go back to old tuning loop

* nit

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-01-12 18:52:47 +01:00
Simon Mo
c32ad2fef5
[Release] Use ray-ml image for logn running test (#13267) 2021-01-07 10:31:46 -08:00
Max Fitton
5094734205
Update autoscaler-cluster yaml files for release tests (#13114) 2021-01-07 11:44:57 -06:00
Simon Mo
01dcb993c7
[Serve] Rescale Serve's Long Running Test to Cluster Mode (#13247)
Now that `HeadOnly` becomes the new default HTTP location, we can
re-enable the long running tests to use local multi-clusters.
(also fixed the controller's API to match up to date, we should
have caught these, I will open issues for this.)
2021-01-07 08:57:24 -08:00
Max Fitton
0d61ea9b06
[Release] Add 1.1.0 release test logs (#13054)
* Add microbenchmark to release logs

* check in many_tasks stress test result

* Add results of placement group stress test for 1.1.0

* Add result for test_dead_actors test and correct the name of test_many_tasks.txt

* Add rllib regression test result

* Add pytorch test results for rllib

* remove extraneous log entries
2021-01-06 11:03:16 -08:00
Max Fitton
d018212db5
[Release] Update Release Process Documentation (#13123) 2021-01-04 11:09:43 -08:00