mirror of
https://github.com/vale981/ray
synced 2025-03-06 10:31:39 -05:00

This PR adds a user guide to AIR for using Ray Train. It provides a high level overview of the trainers and removes redundant sections. The main file to review is here: doc/source/ray-air/trainer.rst. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Kai Fricke <kai@anyscale.com>
262 lines
9.8 KiB
ReStructuredText
262 lines
9.8 KiB
ReStructuredText
Benchmarks
|
|
==========
|
|
|
|
Below we document key performance benchmarks for common AIR tasks and workflows.
|
|
|
|
Bulk Ingest
|
|
-----------
|
|
|
|
This task uses the DummyTrainer module to ingest 200GiB of synthetic data.
|
|
|
|
We test out the performance across different cluster sizes.
|
|
|
|
- `Bulk Ingest Script`_
|
|
- `Bulk Ingest Cluster Configuration`_
|
|
|
|
For this benchmark, we configured the nodes to have reasonable disk size and throughput to account for object spilling.
|
|
|
|
.. code-block:: yaml
|
|
|
|
aws:
|
|
BlockDeviceMappings:
|
|
- DeviceName: /dev/sda1
|
|
Ebs:
|
|
Iops: 5000
|
|
Throughput: 1000
|
|
VolumeSize: 1000
|
|
VolumeType: gp3
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Performance**
|
|
- **Disk Spill**
|
|
- **Command**
|
|
* - 1 m5.4xlarge node (1 actor)
|
|
- 390 s (0.51 GiB/s)
|
|
- 205 GiB
|
|
- `python data_benchmark.py --dataset-size-gb=200 --num-workers=1`
|
|
* - 5 m5.4xlarge nodes (5 actors)
|
|
- 70 s (2.85 GiB/S)
|
|
- 206 GiB
|
|
- `python data_benchmark.py --dataset-size-gb=200 --num-workers=5`
|
|
* - 20 m5.4xlarge nodes (20 actors)
|
|
- 3.8 s (52.6 GiB/s)
|
|
- 0 GiB
|
|
- `python data_benchmark.py --dataset-size-gb=200 --num-workers=20`
|
|
|
|
|
|
XGBoost Batch Prediction
|
|
------------------------
|
|
|
|
This task uses the BatchPredictor module to process different amounts of data
|
|
using an XGBoost model.
|
|
|
|
We test out the performance across different cluster sizes and data sizes.
|
|
|
|
- `XGBoost Prediction Script`_
|
|
- `XGBoost Cluster Configuration`_
|
|
|
|
.. TODO: Add script for generating data and running the benchmark.
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Data Size**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 1 m5.4xlarge node (1 actor)
|
|
- 10 GB (26M rows)
|
|
- 275 s (94.5k rows/s)
|
|
- `python xgboost_benchmark.py --size 10GB`
|
|
* - 10 m5.4xlarge nodes (10 actors)
|
|
- 100 GB (260M rows)
|
|
- 331 s (786k rows/s)
|
|
- `python xgboost_benchmark.py --size 100GB`
|
|
|
|
.. _xgboost-benchmark:
|
|
|
|
XGBoost training
|
|
----------------
|
|
|
|
This task uses the XGBoostTrainer module to train on different sizes of data
|
|
with different amounts of parallelism.
|
|
|
|
XGBoost parameters were kept as defaults for xgboost==1.6.1 this task.
|
|
|
|
|
|
- `XGBoost Training Script`_
|
|
- `XGBoost Cluster Configuration`_
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Data Size**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 1 m5.4xlarge node (1 actor)
|
|
- 10 GB (26M rows)
|
|
- 692 s
|
|
- `python xgboost_benchmark.py --size 10GB`
|
|
* - 10 m5.4xlarge nodes (10 actors)
|
|
- 100 GB (260M rows)
|
|
- 693 s
|
|
- `python xgboost_benchmark.py --size 100GB`
|
|
|
|
|
|
GPU image batch prediction
|
|
--------------------------
|
|
|
|
This task uses the BatchPredictor module to process different amounts of data
|
|
using a Pytorch pre-trained ResNet model.
|
|
|
|
We test out the performance across different cluster sizes and data sizes.
|
|
|
|
- `GPU image batch prediction script`_
|
|
- `GPU training small cluster configuration`_
|
|
- `GPU training large cluster configuration`_
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Data Size**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 1 g3.8xlarge node
|
|
- 1 GB (1623 images)
|
|
- 72.59 s (22.3 images/sec)
|
|
- `python gpu_batch_prediction.py --data-size-gb=1`
|
|
* - 1 g3.8xlarge node
|
|
- 20 GB (32460 images)
|
|
- 1213.48 s (26.76 images/sec)
|
|
- `python gpu_batch_prediction.py --data-size-gb=20`
|
|
* - 4 g3.16xlarge nodes
|
|
- 100 GB (162300 images)
|
|
- 885.98 s (183.19 images/sec)
|
|
- `python gpu_batch_prediction.py --data-size-gb=100`
|
|
|
|
|
|
GPU image training
|
|
------------------
|
|
|
|
This task uses the TorchTrainer module to train different amounts of data
|
|
using an Pytorch ResNet model.
|
|
|
|
We test out the performance across different cluster sizes and data sizes.
|
|
|
|
- `GPU image training script`_
|
|
- `GPU training small cluster configuration`_
|
|
- `GPU training large cluster configuration`_
|
|
|
|
.. note::
|
|
|
|
For multi-host distributed training, on AWS we need to ensure ec2 instances are in the same VPC and
|
|
all ports are open in the secure group.
|
|
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Data Size**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 1 g3.8xlarge node (1 worker)
|
|
- 1 GB (1623 images)
|
|
- 79.76 s (2 epochs, 40.7 images/sec)
|
|
- `python pytorch_training_e2e.py --data-size-gb=1`
|
|
* - 1 g3.8xlarge node (1 worker)
|
|
- 20 GB (32460 images)
|
|
- 1388.33 s (2 epochs, 46.76 images/sec)
|
|
- `python pytorch_training_e2e.py --data-size-gb=20`
|
|
* - 4 g3.16xlarge nodes (16 workers)
|
|
- 100 GB (162300 images)
|
|
- 434.95 s (2 epochs, 746.29 images/sec)
|
|
- `python pytorch_training_e2e.py --data-size-gb=100 --num-workers=16`
|
|
|
|
.. _pytorch-training-parity:
|
|
|
|
Pytorch Training Parity
|
|
-----------------------
|
|
|
|
This task checks the performance parity between native Pytorch Distributed and
|
|
Ray Train's distributed TorchTrainer.
|
|
|
|
We demonstrate that the performance is similar (within 10\%) between the two frameworks.
|
|
Performance may vary greatly across different model, hardware, and cluster configurations.
|
|
|
|
- `Pytorch comparison training script`_
|
|
- `Pytorch comparison CPU cluster configuration`_
|
|
- `Pytorch comparison GPU cluster configuration`_
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Dataset**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 4 m5.2xlarge nodes (4 workers)
|
|
- FashionMNIST
|
|
- 201.17 s (vs 195.90 s Pytorch)
|
|
- `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8`
|
|
* - 4 m5.2xlarge nodes (16 workers)
|
|
- FashionMNIST
|
|
- 447.14 s (vs 461.75 s Pytorch)
|
|
- `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2`
|
|
* - 4 g4dn.12xlarge node (16 workers)
|
|
- FashionMNIST
|
|
- 236.61 s (vs 220.97 s Pytorch)
|
|
- `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 4 --use-gpu`
|
|
|
|
|
|
.. _tf-training-parity:
|
|
|
|
Tensorflow Training Parity
|
|
--------------------------
|
|
|
|
This task checks the performance parity between native Tensorflow Distributed and
|
|
Ray Train's distributed TensorflowTrainer.
|
|
|
|
We demonstrate that the performance is similar (within 10\%) between the two frameworks.
|
|
Performance may vary greatly across different model, hardware, and cluster configurations.
|
|
|
|
.. note:: The batch size and number of epochs is different for the GPU benchmark, resulting in a longer runtime.
|
|
|
|
- `Tensorflow comparison training script`_
|
|
- `Tensorflow comparison CPU cluster configuration`_
|
|
- `Tensorflow comparison GPU cluster configuration`_
|
|
|
|
.. list-table::
|
|
|
|
* - **Cluster Setup**
|
|
- **Dataset**
|
|
- **Performance**
|
|
- **Command**
|
|
* - 4 m5.2xlarge nodes (4 workers)
|
|
- FashionMNIST
|
|
- 90.61 s (vs 81.26 s Tensorflow)
|
|
- `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 4 --cpus-per-worker 8`
|
|
* - 4 m5.2xlarge nodes (16 workers)
|
|
- FashionMNIST
|
|
- 75.34 s (vs 69.51 s Tensorflow)
|
|
- `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 2`
|
|
* - 4 g4dn.12xlarge node (16 workers)
|
|
- FashionMNIST
|
|
- 495.85 s (vs 479.28 s Tensorflow)
|
|
- `python workloads/tensorflow_benchmark.py run --num-runs 3 --num-epochs 200 --num-workers 16 --cpus-per-worker 4 --batch-size 64 --use-gpu`
|
|
|
|
|
|
.. _`Bulk Ingest Script`: https://github.com/ray-project/ray/blob/a30bdf9ef34a45f973b589993f7707a763df6ebf/release/air_tests/air_benchmarks/workloads/data_benchmark.py#L25-L40
|
|
.. _`Bulk Ingest Cluster Configuration`: https://github.com/ray-project/ray/blob/a30bdf9ef34a45f973b589993f7707a763df6ebf/release/air_tests/air_benchmarks/data_20_nodes.yaml#L6-L15
|
|
.. _`XGBoost Training Script`: https://github.com/ray-project/ray/blob/a241e6a0f5a630d6ed5b84cce30c51963834d15b/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py#L40-L58
|
|
.. _`XGBoost Prediction Script`: https://github.com/ray-project/ray/blob/a241e6a0f5a630d6ed5b84cce30c51963834d15b/release/air_tests/air_benchmarks/workloads/xgboost_benchmark.py#L63-L71
|
|
.. _`XGBoost Cluster Configuration`: https://github.com/ray-project/ray/blob/a241e6a0f5a630d6ed5b84cce30c51963834d15b/release/air_tests/air_benchmarks/xgboost_compute_tpl.yaml#L6-L24
|
|
.. _`GPU image batch prediction script`: https://github.com/ray-project/ray/blob/cec82a1ced631525a4d115e4dc0c283fa4275a7f/release/air_tests/air_benchmarks/workloads/gpu_batch_prediction.py#L18-L49
|
|
.. _`GPU image training script`: https://github.com/ray-project/ray/blob/cec82a1ced631525a4d115e4dc0c283fa4275a7f/release/air_tests/air_benchmarks/workloads/pytorch_training_e2e.py#L95-L106
|
|
.. _`GPU training small cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_1.yaml#L6-L24
|
|
.. _`GPU training large cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_16.yaml#L5-L25
|
|
.. _`Pytorch comparison training script`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/workloads/torch_benchmark.py
|
|
.. _`Pytorch comparison CPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_cpu_4.yaml
|
|
.. _`Pytorch comparison GPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_4x4.yaml
|
|
.. _`Tensorflow comparison training script`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/workloads/tensorflow_benchmark.py
|
|
.. _`Tensorflow comparison CPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_cpu_4.yaml
|
|
.. _`Tensorflow comparison GPU cluster configuration`: https://github.com/ray-project/ray/blob/master/release/air_tests/air_benchmarks/compute_gpu_4x4.yaml
|