Benchmarks ========== Every experiment is reproducible on Tesla P40 GPUs. Follow the link to code for each benchmark. Transparency ~~~~~~~~~~~~ ResNet-101 Accuracy Benchmark ----------------------------- ========== ========== =============== ============ Batch size torchgpipe nn.DataParallel Goyal et al. ========== ========== =============== ============ 256 21.99±0.13 22.02±0.11 22.08±0.06 1K 22.24±0.19 22.04±0.24 N/A 4K 22.13±0.09 N/A N/A ========== ========== =============== ============ GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of `Accurate, Large Minibatch SGD `_ by Goyal et al. The reproducible code and experiment details are available in `benchmarks/resnet101-accuracy`_. .. _benchmarks/resnet101-accuracy: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/resnet101-accuracy Memory ~~~~~~ U-Net (B, C) Memory Benchmark ----------------------------- .. table:: :widths: 4,4,4,4 ========== ============ ========== ============ Experiment U-Net (B, C) Parameters Memory usage ========== ============ ========== ============ baseline (6, 72) 362.2M 20.3 GiB pipeline-1 (11, 128) 2.21B 20.5 GiB pipeline-2 (24, 128) 4.99B 43.4 GiB pipeline-4 (24, 160) 7.80B 79.1 GiB pipeline-8 (48, 160) 15.82B 154.1 GiB ========== ============ ========== ============ The table shows how GPipe facilitates scaling U-Net models. `baseline` denotes the baseline without pipeline parallelism nor checkpointing, and `pipeline-1`, `-2`, `-4`, `-8` denotes that the model is trained with GPipe with the corresponding number of partitions. Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters `B` and `C` which are proportional to the number of layers and filters, respectively. The reproducible code and experiment details are available in `benchmarks/unet-memory`_. .. _benchmarks/unet-memory: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/unet-memory AmoebaNet-D (L, D) Memory Benchmark ----------------------------------- ====================== ============= ========== ========== ========== ========== Experiment baseline pipeline-1 pipeline-2 pipeline-4 pipeline-8 ====================== ============= ========== ========== ========== ========== AmoebaNet-D (L, D) (18, 208) (18, 416) (18, 544) (36, 544) (72, 512) **torchgpipe** ------------------------------------------------------------------------------------- Parameters 81.5M 319.0M 542.7M 1.06B 1.84B Model Memory 0.91 GiB 3.57 GiB 6.07 GiB 11.80 GiB 20.62 GiB Peak Activation Memory Out of memory 0.91 GiB 3.39 GiB 6.91 GiB 10.83 GiB **Huang et al.** ------------------------------------------------------------------------------------- Parameters 82M 318M 542M 1.05B 1.8B Model Memory 1.05GB 3.8GB 6.45GB 12.53GB 24.62GB Peak Activation Memory 6.26GB 3.46GB 8.11GB 15.21GB 26.24GB ====================== ============= ========== ========== ========== ========== The table shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of `GPipe `_ by Huang et al. The size of an AmoebaNet-D model is determined by two hyperparameters `L` and `D` which are proportional to the number of layers and filters, respectively. We reproduced the same settings in the paper with regardless of memory capacity of Tesla P40 GPUs. The reproducible code and experiment details are available in `benchmarks/amoebanetd-memory`_. .. _benchmarks/amoebanetd-memory: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/amoebanetd-memory Speed ~~~~~ U-Net (5, 64) Speed Benchmark ----------------------------- ========== ========== ======== Experiment Throughput Speed up ========== ========== ======== baseline 28.500/s 1× pipeline-1 24.456/s 0.858× pipeline-2 35.502/s 1.246× pipeline-4 67.042/s 2.352× pipeline-8 88.497/s 3.105× ========== ========== ======== To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections. The reproducible code and experiment details are available in `benchmarks/unet-speed`_. .. _benchmarks/unet-speed: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/unet-speed AmoebaNet-D (18, 256) Speed Benchmark ------------------------------------- .. table:: (`n`: number of partitions, `m`: number of micro-batches) ========== ========== ========== ============ Experiment Throughput torchgpipe Huang et al. ========== ========== ========== ============ n=2, m=1 26.733/s 1× 1× n=2, m=4 41.133/s 1.539× 1.07× n=2, m=32 47.386/s 1.773× 1.21× n=4, m=1 26.827/s 1.004× 1.13× n=4, m=4 44.543/s 1.666× 1.26× n=4, m=32 72.412/s 2.709× 1.84× n=8, m=1 24.918/s 0.932× 1.38× n=8, m=4 70.065/s 2.621× 1.72× n=8, m=32 132.413/s 4.953× 3.48× ========== ========== ========== ============ The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of `GPipe `_ by Huang et al. Note that we replaced `K` in the paper with `n`. The reproducible code and experiment details are available in `benchmarks/amoebanetd-speed`_. .. _benchmarks/amoebanetd-speed: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/amoebanetd-speed ResNet-101 Speed Benchmark -------------------------- ========== ========== ========== ============ Experiment Throughput torchgpipe Huang et al. ========== ========== ========== ============ baseline 95.862/s 1× 1× pipeline-1 81.796/s 0.853× 0.80× pipeline-2 135.539/s 1.414× 1.42× pipeline-4 265.958/s 2.774× 2.18× pipeline-8 411.662/s 4.294× 2.89× ========== ========== ========== ============ The table shows the reproduced speed benchmark on ResNet-101, as reported in Figure 3(b) of `the fourth version `_ of GPipe by Huang et al. The reproducible code and experiment details are available in `benchmarks/resnet101-speed`_. .. _benchmarks/resnet101-speed: https://github.com/kakaobrain/torchgpipe/tree/master/benchmarks/resnet101-speed