Commit graph

761 commits

Author SHA1 Message Date
Ricky Xu
8c0b0272ce
Make state API release tests stable (#28274)
Make state API release tests stable - it has been passing in the last few days.

Signed-off-by: rickyyx <rickyx@anyscale.com>
2022-09-02 13:43:49 -07:00
Ricky Xu
5e0cf74377
remove env (#28218)
Try not to set special flags for nightly test.

Signed-off-by: rickyyx <rickyx@anyscale.com>
2022-09-01 11:58:13 -07:00
Stephanie Wang
213e24cafd
[tests] Remove unnecessary sleep time from pipelined ingest tests #28182
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
2022-08-31 17:43:58 -07:00
Ricky Xu
ed2929185c
[Core][State Observability] Wait for all nodes in release test (#28190)
Release tests are failing in buildkite run - however succeeds reliably in manual retry.
Suspected it's because not all nodes available when running with large number of actors.
2022-08-31 13:52:19 -07:00
Artur Niederfahrenhorst
f420407b0d
[ML] Pin Pydantic <= 1.9.2 (#28205)
CI is red because of a dependency issue around dataclass_transform .

Signed-off-by: Artur Niederfahrenhorst <artur@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-08-31 13:35:18 -07:00
Alex Wu
e643b75129
[release][ci] Update disk size on release tests (#28156)
The minimum size is 300GB

Signed-off-by: Alex Wu <alex@anyscale.io>

Signed-off-by: Alex Wu <alex@anyscale.io>
Signed-off-by: Kai Fricke <krfricke@users.noreply.github.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-08-30 09:29:11 -07:00
Yi Cheng
4d91f516ca
[nightly] Add serve ha chaos test into nightly test. (#27413)
This PR adds a serve ha test. The flow of the tests is:

1. check the kube ray build
2. start ray service
3. warm up the cluster
4. start killing nodes
5. get the stats and make sure it's good
2022-08-29 16:55:36 -07:00
Ricky Xu
7e560ad92c
[Core][State Observability] Release test app configs to bypass default limit (#27969)
This is needed since we are stress-testing the State APIs in release test, and we will need to have a larger max limit than the system default max limit, otherwise, the APIs would return error.
2022-08-24 18:41:54 -07:00
Chen Shen
da79015be3
[2.0] update 2.0.0 benchmarks #27810
update 2.0.0 benchmarks
2022-08-19 10:34:33 -07:00
Jiajun Yao
7d981d6ced
Mark dataset_shuffle_push_based_random_shuffle_100tb as stable (#27963)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-17 15:05:15 -07:00
Kai Fricke
b91246a093
[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902)
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-16 19:16:08 +02:00
xwjiang2010
a3236b6225
[air] fix ptl release test (#27773)
Signed-off-by: xwjiang2010 xwjiang2010@gmail.com
2022-08-15 14:47:33 -07:00
xwjiang2010
68cc544da6
[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 19:03:40 +02:00
xwjiang2010
f77ec350fa
[release test] remove dask/modin_xgboost test completely. (#27865)
The original script was removed in https://github.com/ray-project/ray/pull/27816
This is just to clean up some remainings.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 16:55:33 +02:00
Jian Xiao
5a18b1fc45
Spread the actors in data ingest benchmark, which 2x the throughput (#27620)
The consuming actors were not spread and this PR fixed it, which improved throughput by 2x.
2022-08-11 11:47:54 -07:00
Ricky Xu
5ea4747448
[Core][State Observability] Nightly release test for state API (#26610)
* Initial

* Correctness test skeleton

* Added limit for listing

* Updated grpc config

* no more waiting

* metrics

* Updated constant and add test

* renamed

* actors

* actors

* actors

* dada

* actor dead?

* Script

* correct test name

* limit

* Added timeout

* release test /2

* Merged

* format+doc

* wip

Signed-off-by: rickyyx <ricky@anyscale.com>

* revert packag-lock

Signed-off-by: rickyyx <rickyx@anyscale.com>

* wip

* results

Signed-off-by: rickyx <rickyx@anyscale.com>

Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <ricky@anyscale.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Co-authored-by: rickyyx <ricky@anyscale.com>
2022-08-11 07:01:01 -07:00
Artur Niederfahrenhorst
0dceddb912
[RLlib] Move learning_starts logic from buffers into training_step(). (#26032) 2022-08-11 13:07:30 +02:00
matthewdeng
8eca6ae852
[rllib][release] mark long_running_many_ppo as unstable (#26874)
Per #26718 (comment)
2022-08-10 17:58:33 -07:00
Avnish Narayan
aee008ab49
[RLlib] PPO release tests tuned and re-enabled. (#27564) 2022-08-08 21:04:19 +02:00
Jian Xiao
30cf449807
Add data ingest benchmark (#27533)
Make sure Dataset/DatasetPipeline work performantly for data ingestion.
2022-08-05 12:31:06 -07:00
Avnish Narayan
6a31b61580
[RLlib] CQL change hparams and data reading strategy (#27451) 2022-08-04 18:55:32 -07:00
Avnish Narayan
55209692ee
[RLlib] Deflake MARWIL and BC and remove memory leak from torch MARWIL policy (#27406) 2022-08-03 16:53:12 -07:00
Jimmy Yao
1c1cca2736
[release/ray-lightning] adjust the release test of ray lightning master
First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.
2022-08-03 16:01:32 +01:00
Simon Mo
8ac6d02502
[Serve][Nightly] Environment for Nightly K8s Tests (#27126) 2022-08-02 23:05:47 -07:00
kourosh hakhamaneshi
bda5026428
[RLlib] Fix A2C release tests (#27314) 2022-08-02 10:44:52 -07:00
Kai Fricke
d527c7b335
[air/benchmarks] Drop OMP_NUM_THREADS in vanilla torch/tf training (#27256)
Ray automatically sets OMP_NUM_THREADS=1, potentially limiting multithreading in native pytorch/tensorflow. If this leads to performance differences, we should address this either in Ray Train or in Ray core.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-02 13:38:01 +01:00
Kai Fricke
149c031c4b
[tune/release] Do not use spot instances in k8s tests (#27250)
Spot instances are not being booted up, so let's go without them.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-02 11:30:41 +01:00
xwjiang2010
c9579fea1c
[air] update pytorch_training_e2e.py to use iter_torch_batches. (#27241)
update pytorch_training_e2e.py to use iter_torch_batches.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-01 19:23:01 +01:00
Dmitri Gekhtman
8bdeb30510
[docs][ml][kuberay] Add a --disable-check flag to the XGBoost benchmark. (#27277)
This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases.

Specifically, the reason is this:
We will include the XGBoost benchmark as an example workload for the KubeRay documentation.
The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure.
(When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)
2022-07-29 14:31:10 -07:00
Jun Gong
e6e10ce4cf
[RLlib] Revert 41c9ef70. (#27243)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
2022-07-29 11:05:15 -07:00
Kai Fricke
ee05fc94fe
[tune] Increase volume size for long running pbt failure (#27163)
Currently running into an issue:

Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than  snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
2022-07-28 22:57:26 -07:00
Clark Zinzow
3730ec8cc9
[AIR - Datasets] Fix AIR release tests dealing with tensor columns. (#27221)
This PR fixes some AIR release tests that deal with tensor columns.
2022-07-28 14:34:11 -07:00
Simon Mo
8beb887bbe
[Serve] Remove release tests for checkpoint_path (#27194) 2022-07-28 12:30:30 -07:00
Kai Fricke
3cd9a0446b
[tune/rllib/release] Load correct metadata file in rllib cloud tests (#27164)
Currently this tries to load a stale metadata file that doesn't exist anymore after internal refactoring.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-28 15:51:09 +01:00
Kai Fricke
1d3c167bfe
[rllib/release] Fix rllib connect test with Tuner() API (#27155)
Currently failing because the Tune framework example does not return fitting results.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-28 11:08:02 +01:00
matthewdeng
0319dcd889
[air] fix xgboost_benchmark script by passing in args (#27146) 2022-07-27 19:08:15 -07:00
xwjiang2010
eb69c1ca28
[air] Add annotation for Tune module. (#27060)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 13:53:46 -07:00
Malinda
1d789aee63
[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. (#26284) 2022-07-27 22:38:35 +02:00
Simon Mo
e5a8b1dd55
[Serve] Add API Annotations And Move to _private (#27058) 2022-07-27 09:08:26 -07:00
SangBin Cho
a6fe2c1e87
[Release test] Add a memory monitor to nightly test long running actor death (#27083)
Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test
2022-07-27 07:32:10 -07:00
Amog Kamsetty
862d10c162
[AIR] Remove ML code from ray.util (#27005)
Removes all ML related code from `ray.util`

Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`

Moves `ray.util.ml_utils` to other locations

Closes #23900

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 14:24:19 +01:00
xwjiang2010
4c30325172
[air] update xgboost test (catch test failures properly). (#27023)
- Update xgboost test (catch test failures properly)
- Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-07-27 12:18:51 +01:00
Kai Fricke
ce5c5d858b
[ci/release/RLlib] Fix IMPALA long running release test. (#27086) 2022-07-27 12:38:32 +02:00
Avnish Narayan
f5a9a44b9c
[RLlib] Revert Revert Fix apex long running test (#26928) 2022-07-26 15:10:25 -07:00
Balaji Veeramani
89f7f2a567
[Datasets] Add size parameter to ImageFolderDatasource (#26975)
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
2022-07-26 14:57:38 -07:00
matthewdeng
1bb7651e95
[air] add smoke-test flag to tensorflow_benchmark (#26999)
Increase ratio from 1.15 to 1.2

Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-07-26 15:47:37 +01:00
Sihan Wang
8ecd928c34
[Serve] Make the checkpoint and recover only from GCS (#26753) 2022-07-25 14:24:53 -07:00
Chen Shen
acbab51d3e
[Nightly] fix microbenchmark scripts (#26947)
Signed-off-by: scv119 scv119@gmail.com

Why are these changes needed?
microbenchmarks failed complaining

   raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address: 
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"
2022-07-24 14:16:43 -07:00
Avnish Narayan
a50a81a13a
Revert "[RLlib] Fix apex breakout release test performance. (#26867)" (#26927) 2022-07-23 17:27:50 +02:00
Avnish Narayan
2cfd6c2e97
[RLlib] Fix apex breakout release test performance. (#26867) 2022-07-23 13:53:03 +02:00