Benchmarks¶
Every experiment is reproducible on Tesla P40 GPUs. Follow the link to code for each benchmark.
Transparency¶
ResNet-101 Accuracy Benchmark¶
Batch size |
torchgpipe |
nn.DataParallel |
Goyal et al. |
---|---|---|---|
256 |
21.99±0.13 |
22.02±0.11 |
22.08±0.06 |
1K |
22.24±0.19 |
22.04±0.24 |
N/A |
4K |
22.13±0.09 |
N/A |
N/A |
GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of Accurate, Large Minibatch SGD by Goyal et al.
The reproducible code and experiment details are available in benchmarks/resnet101-accuracy.
Memory¶
U-Net (B, C) Memory Benchmark¶
Experiment |
U-Net (B, C) |
Parameters |
Memory usage |
---|---|---|---|
baseline |
(6, 72) |
362.2M |
20.3 GiB |
pipeline-1 |
(11, 128) |
2.21B |
20.5 GiB |
pipeline-2 |
(24, 128) |
4.99B |
43.4 GiB |
pipeline-4 |
(24, 160) |
7.80B |
79.1 GiB |
pipeline-8 |
(48, 160) |
15.82B |
154.1 GiB |
The table shows how GPipe facilitates scaling U-Net models. baseline denotes the baseline without pipeline parallelism nor checkpointing, and pipeline-1, -2, -4, -8 denotes that the model is trained with GPipe with the corresponding number of partitions.
Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters B and C which are proportional to the number of layers and filters, respectively.
The reproducible code and experiment details are available in benchmarks/unet-memory.
AmoebaNet-D (L, D) Memory Benchmark¶
Experiment |
baseline |
pipeline-1 |
pipeline-2 |
pipeline-4 |
pipeline-8 |
---|---|---|---|---|---|
AmoebaNet-D (L, D) |
(18, 208) |
(18, 416) |
(18, 544) |
(36, 544) |
(72, 512) |
torchgpipe |
|||||
Parameters |
81.5M |
319.0M |
542.7M |
1.06B |
1.84B |
Model Memory |
0.91 GiB |
3.57 GiB |
6.07 GiB |
11.80 GiB |
20.62 GiB |
Peak Activation Memory |
Out of memory |
0.91 GiB |
3.39 GiB |
6.91 GiB |
10.83 GiB |
Huang et al. |
|||||
Parameters |
82M |
318M |
542M |
1.05B |
1.8B |
Model Memory |
1.05GB |
3.8GB |
6.45GB |
12.53GB |
24.62GB |
Peak Activation Memory |
6.26GB |
3.46GB |
8.11GB |
15.21GB |
26.24GB |
The table shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of GPipe by Huang et al. The size of an AmoebaNet-D model is determined by two hyperparameters L and D which are proportional to the number of layers and filters, respectively.
We reproduced the same settings in the paper with regardless of memory capacity of Tesla P40 GPUs. The reproducible code and experiment details are available in benchmarks/amoebanetd-memory.
Speed¶
U-Net (5, 64) Speed Benchmark¶
Experiment |
Throughput |
Speed up |
---|---|---|
baseline |
28.500/s |
1× |
pipeline-1 |
24.456/s |
0.858× |
pipeline-2 |
35.502/s |
1.246× |
pipeline-4 |
67.042/s |
2.352× |
pipeline-8 |
88.497/s |
3.105× |
To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections.
The reproducible code and experiment details are available in benchmarks/unet-speed.
AmoebaNet-D (18, 256) Speed Benchmark¶
Experiment |
Throughput |
torchgpipe |
Huang et al. |
---|---|---|---|
n=2, m=1 |
26.733/s |
1× |
1× |
n=2, m=4 |
41.133/s |
1.539× |
1.07× |
n=2, m=32 |
47.386/s |
1.773× |
1.21× |
n=4, m=1 |
26.827/s |
1.004× |
1.13× |
n=4, m=4 |
44.543/s |
1.666× |
1.26× |
n=4, m=32 |
72.412/s |
2.709× |
1.84× |
n=8, m=1 |
24.918/s |
0.932× |
1.38× |
n=8, m=4 |
70.065/s |
2.621× |
1.72× |
n=8, m=32 |
132.413/s |
4.953× |
3.48× |
The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.
The reproducible code and experiment details are available in benchmarks/amoebanetd-speed.
ResNet-101 Speed Benchmark¶
Experiment |
Throughput |
torchgpipe |
Huang et al. |
---|---|---|---|
baseline |
95.862/s |
1× |
1× |
pipeline-1 |
81.796/s |
0.853× |
0.80× |
pipeline-2 |
135.539/s |
1.414× |
1.42× |
pipeline-4 |
265.958/s |
2.774× |
2.18× |
pipeline-8 |
411.662/s |
4.294× |
2.89× |
The table shows the reproduced speed benchmark on ResNet-101, as reported in Figure 3(b) of the fourth version of GPipe by Huang et al.
The reproducible code and experiment details are available in benchmarks/resnet101-speed.