Commit graph

751 commits

Author SHA1 Message Date
Kai Fricke
b91246a093
[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902)
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-16 19:16:08 +02:00
xwjiang2010
a3236b6225
[air] fix ptl release test (#27773)
Signed-off-by: xwjiang2010 xwjiang2010@gmail.com
2022-08-15 14:47:33 -07:00
xwjiang2010
68cc544da6
[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 19:03:40 +02:00
xwjiang2010
f77ec350fa
[release test] remove dask/modin_xgboost test completely. (#27865)
The original script was removed in https://github.com/ray-project/ray/pull/27816
This is just to clean up some remainings.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 16:55:33 +02:00
Jian Xiao
5a18b1fc45
Spread the actors in data ingest benchmark, which 2x the throughput (#27620)
The consuming actors were not spread and this PR fixed it, which improved throughput by 2x.
2022-08-11 11:47:54 -07:00
Ricky Xu
5ea4747448
[Core][State Observability] Nightly release test for state API (#26610)
* Initial

* Correctness test skeleton

* Added limit for listing

* Updated grpc config

* no more waiting

* metrics

* Updated constant and add test

* renamed

* actors

* actors

* actors

* dada

* actor dead?

* Script

* correct test name

* limit

* Added timeout

* release test /2

* Merged

* format+doc

* wip

Signed-off-by: rickyyx <ricky@anyscale.com>

* revert packag-lock

Signed-off-by: rickyyx <rickyx@anyscale.com>

* wip

* results

Signed-off-by: rickyx <rickyx@anyscale.com>

Signed-off-by: rickyyx <rickyx@anyscale.com>
Signed-off-by: rickyyx <ricky@anyscale.com>
Signed-off-by: rickyx <rickyx@anyscale.com>
Co-authored-by: rickyyx <ricky@anyscale.com>
2022-08-11 07:01:01 -07:00
Artur Niederfahrenhorst
0dceddb912
[RLlib] Move learning_starts logic from buffers into training_step(). (#26032) 2022-08-11 13:07:30 +02:00
matthewdeng
8eca6ae852
[rllib][release] mark long_running_many_ppo as unstable (#26874)
Per #26718 (comment)
2022-08-10 17:58:33 -07:00
Avnish Narayan
aee008ab49
[RLlib] PPO release tests tuned and re-enabled. (#27564) 2022-08-08 21:04:19 +02:00
Jian Xiao
30cf449807
Add data ingest benchmark (#27533)
Make sure Dataset/DatasetPipeline work performantly for data ingestion.
2022-08-05 12:31:06 -07:00
Avnish Narayan
6a31b61580
[RLlib] CQL change hparams and data reading strategy (#27451) 2022-08-04 18:55:32 -07:00
Avnish Narayan
55209692ee
[RLlib] Deflake MARWIL and BC and remove memory leak from torch MARWIL policy (#27406) 2022-08-03 16:53:12 -07:00
Jimmy Yao
1c1cca2736
[release/ray-lightning] adjust the release test of ray lightning master
First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.
2022-08-03 16:01:32 +01:00
Simon Mo
8ac6d02502
[Serve][Nightly] Environment for Nightly K8s Tests (#27126) 2022-08-02 23:05:47 -07:00
kourosh hakhamaneshi
bda5026428
[RLlib] Fix A2C release tests (#27314) 2022-08-02 10:44:52 -07:00
Kai Fricke
d527c7b335
[air/benchmarks] Drop OMP_NUM_THREADS in vanilla torch/tf training (#27256)
Ray automatically sets OMP_NUM_THREADS=1, potentially limiting multithreading in native pytorch/tensorflow. If this leads to performance differences, we should address this either in Ray Train or in Ray core.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-02 13:38:01 +01:00
Kai Fricke
149c031c4b
[tune/release] Do not use spot instances in k8s tests (#27250)
Spot instances are not being booted up, so let's go without them.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-02 11:30:41 +01:00
xwjiang2010
c9579fea1c
[air] update pytorch_training_e2e.py to use iter_torch_batches. (#27241)
update pytorch_training_e2e.py to use iter_torch_batches.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-01 19:23:01 +01:00
Dmitri Gekhtman
8bdeb30510
[docs][ml][kuberay] Add a --disable-check flag to the XGBoost benchmark. (#27277)
This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases.

Specifically, the reason is this:
We will include the XGBoost benchmark as an example workload for the KubeRay documentation.
The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure.
(When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)
2022-07-29 14:31:10 -07:00
Jun Gong
e6e10ce4cf
[RLlib] Revert 41c9ef70. (#27243)
Why are these changes needed?
Also:
Add validation to make sure multi-gpu and micro-batch is not used together.
Update A2C learning test to hit the microbatching branch.
Minor comment updates.
2022-07-29 11:05:15 -07:00
Kai Fricke
ee05fc94fe
[tune] Increase volume size for long running pbt failure (#27163)
Currently running into an issue:

Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than  snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB
2022-07-28 22:57:26 -07:00
Clark Zinzow
3730ec8cc9
[AIR - Datasets] Fix AIR release tests dealing with tensor columns. (#27221)
This PR fixes some AIR release tests that deal with tensor columns.
2022-07-28 14:34:11 -07:00
Simon Mo
8beb887bbe
[Serve] Remove release tests for checkpoint_path (#27194) 2022-07-28 12:30:30 -07:00
Kai Fricke
3cd9a0446b
[tune/rllib/release] Load correct metadata file in rllib cloud tests (#27164)
Currently this tries to load a stale metadata file that doesn't exist anymore after internal refactoring.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-28 15:51:09 +01:00
Kai Fricke
1d3c167bfe
[rllib/release] Fix rllib connect test with Tuner() API (#27155)
Currently failing because the Tune framework example does not return fitting results.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-28 11:08:02 +01:00
matthewdeng
0319dcd889
[air] fix xgboost_benchmark script by passing in args (#27146) 2022-07-27 19:08:15 -07:00
xwjiang2010
eb69c1ca28
[air] Add annotation for Tune module. (#27060)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 13:53:46 -07:00
Malinda
1d789aee63
[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. (#26284) 2022-07-27 22:38:35 +02:00
Simon Mo
e5a8b1dd55
[Serve] Add API Annotations And Move to _private (#27058) 2022-07-27 09:08:26 -07:00
SangBin Cho
a6fe2c1e87
[Release test] Add a memory monitor to nightly test long running actor death (#27083)
Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test
2022-07-27 07:32:10 -07:00
Amog Kamsetty
862d10c162
[AIR] Remove ML code from ray.util (#27005)
Removes all ML related code from `ray.util`

Removes:
- `ray.util.xgboost`
- `ray.util.lightgbm`
- `ray.util.horovod`
- `ray.util.ray_lightning`

Moves `ray.util.ml_utils` to other locations

Closes #23900

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-27 14:24:19 +01:00
xwjiang2010
4c30325172
[air] update xgboost test (catch test failures properly). (#27023)
- Update xgboost test (catch test failures properly)
- Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-07-27 12:18:51 +01:00
Kai Fricke
ce5c5d858b
[ci/release/RLlib] Fix IMPALA long running release test. (#27086) 2022-07-27 12:38:32 +02:00
Avnish Narayan
f5a9a44b9c
[RLlib] Revert Revert Fix apex long running test (#26928) 2022-07-26 15:10:25 -07:00
Balaji Veeramani
89f7f2a567
[Datasets] Add size parameter to ImageFolderDatasource (#26975)
If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.
2022-07-26 14:57:38 -07:00
matthewdeng
1bb7651e95
[air] add smoke-test flag to tensorflow_benchmark (#26999)
Increase ratio from 1.15 to 1.2

Signed-off-by: Matthew Deng <matt@anyscale.com>
2022-07-26 15:47:37 +01:00
Sihan Wang
8ecd928c34
[Serve] Make the checkpoint and recover only from GCS (#26753) 2022-07-25 14:24:53 -07:00
Chen Shen
acbab51d3e
[Nightly] fix microbenchmark scripts (#26947)
Signed-off-by: scv119 scv119@gmail.com

Why are these changes needed?
microbenchmarks failed complaining

   raise ValueError(f"Malformed address: {address}")
ValueError: Malformed address: 
this is due to 55a0f7b and fix it by set RAY_ADDRESS="local"
2022-07-24 14:16:43 -07:00
Avnish Narayan
a50a81a13a
Revert "[RLlib] Fix apex breakout release test performance. (#26867)" (#26927) 2022-07-23 17:27:50 +02:00
Avnish Narayan
2cfd6c2e97
[RLlib] Fix apex breakout release test performance. (#26867) 2022-07-23 13:53:03 +02:00
Richard Liaw
96e8027c7e
[air] large tune/torch benchmark (#26763)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-23 01:17:25 -07:00
Jiao
840b0478aa
[AIR CUJ] Add wait_for_nodes for 4x4 gpu test 2022-07-22 16:04:54 -07:00
Steven Morad
259429bdc3
Bump gym dep to 0.24 (#26190)
Co-authored-by: Steven Morad <smorad@anyscale.com>
Co-authored-by: Avnish <avnishnarayan@gmail.com>
Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>
2022-07-22 12:37:16 -07:00
Avnish Narayan
82395c4646
[RLlib] Put learning test into own folders (#26862)
Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>
2022-07-22 11:20:47 -07:00
Avnish Narayan
67c0a69643
[Rllib] Fix broken cluster env launcher gym pinning (#26865) 2022-07-21 20:45:16 -07:00
matthewdeng
14e2b2548c
[air] update remaining dict scaling_configs (#26856) 2022-07-21 18:55:21 -07:00
Balaji Veeramani
ac1d21027d
[AIR] Add framework-specific checkpoints (#26777) 2022-07-20 19:33:27 -07:00
Archit Kulkarni
e043f49957
[Serve] [CI] Increase instance size and add debug log for autoscaling_multi_deployment release test (#26732) 2022-07-20 16:13:36 -07:00
Kai Fricke
2e35d47bd2
[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776) 2022-07-20 14:07:51 -07:00
Avnish Narayan
5433c11650
[RLlib] Pin gym to 0.23.1 (#26752) 2022-07-20 11:49:01 -07:00