Commit graph

19 commits

Author SHA1 Message Date
Richard Liaw
96e8027c7e
[air] large tune/torch benchmark (#26763)
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-23 01:17:25 -07:00
Balaji Veeramani
ac1d21027d
[AIR] Add framework-specific checkpoints (#26777) 2022-07-20 19:33:27 -07:00
Kai Fricke
2e35d47bd2
[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776) 2022-07-20 14:07:51 -07:00
matthewdeng
2a425b195c
[air] change default strategy to PACK (#26757) 2022-07-19 23:01:24 -07:00
xwjiang2010
75027eb479
[air/benchmarks] train/tune benchmark (#26564)
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Signed-off-by: Kai Fricke <kai@anyscale.com>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 18:24:39 +01:00
Richard Liaw
7e62e1187c
[air/benchmark] Torch benchmarks for 4x4 (#26692)
Add benchmark data for 4x4 GPU setup.

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Jimmy Yao <jiahaoyao.math@gmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
2022-07-19 17:06:37 +01:00
Sumanth Ratna
759966781f
[air] Allow users to use instances of ScalingConfig (#25712)
Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-07-18 15:46:58 -07:00
Kai Fricke
00947fd949
[air/benchmarks] Add 4x1 GPU benchmark for Torch (#26562) 2022-07-18 12:14:10 -07:00
matthewdeng
6670708010
[air] add placement group max CPU to data benchmark (#26649)
Set experimental `_max_cpu_fraction_per_node` to prevent deadlock.

This should technically be a no-op with the SPREAD strategy.
2022-07-18 10:34:40 -07:00
Jiao
98a07920d3
[AIR][CUJ] Make distributing training benchmark at silver tier (#26640) 2022-07-17 22:07:09 -07:00
Jiao
77e2ef2eb6
[AIR] Update Torch benchmarks with documentation (#26631)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-16 17:58:21 -07:00
Eric Liang
0855bcb77e
[air] Use SPREAD strategy by default and don't special case it in benchmarks (#26633) 2022-07-16 17:37:06 -07:00
Jiao
196e52ad7c
[AIR][CUJ] E2E Pytorch training (#26621) 2022-07-16 08:23:19 -07:00
Jiao
988ffd494b
[AIR][CUJ] Add GPU bench prediction benchmark (#26614) 2022-07-16 08:22:37 -07:00
matthewdeng
e3a096f412
[air] add bulk ingest benchmarks (#26618) 2022-07-15 22:01:23 -07:00
Richard Liaw
5ad4e75831
[air] Add initial benchmark section (#26608) 2022-07-15 15:33:48 -07:00
xwjiang2010
a241e6a0f5
[air] Add xgboost release test for silver tier(10-node case). (#26460)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-07-15 13:21:10 -07:00
Kai Fricke
213a96e239
[air/benchmarks] Add distributed Tensorflow benchmarks (CPU only) (#26519)
Following up from #26436, this PR adds a distributed benchmark test for Tensorflow FashionMNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-14 22:08:43 +01:00
Kai Fricke
cf75cf7232
[air] Add AIR distributed training benchmark for Torch FashionMNIST (#26436)
This PR adds a distributed benchmark test for Pytorch MNIST training. It compares training with Ray AIR with training with vanilla PyTorch.

In both cases, the same training loop is used. For Ray AIR, we use a TorchTrainer with 4 CPU workers. For vanilla PyTorch, we upload a training script and kick it off (using Ray tasks) in subprocesses on each node. In both cases, we collect the end to end runtime.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-13 10:53:24 +01:00