Benchmarks¶

Every experiment is reproducible on Tesla P40 GPUs. Follow the link to code for each benchmark.

Transparency¶

ResNet-101 Accuracy Benchmark¶

Batch size	torchgpipe	nn.DataParallel	Goyal et al.
256	21.99±0.13	22.02±0.11	22.08±0.06
1K	22.24±0.19	22.04±0.24	N/A
4K	22.13±0.09	N/A	N/A

GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of Accurate, Large Minibatch SGD by Goyal et al.

The reproducible code and experiment details are available in benchmarks/resnet101-accuracy.

Memory¶

U-Net (B, C) Memory Benchmark¶

Experiment	U-Net (B, C)	Parameters	Memory usage
baseline	(6, 72)	362.2M	20.3 GiB
pipeline-1	(11, 128)	2.21B	20.5 GiB
pipeline-2	(24, 128)	4.99B	43.4 GiB
pipeline-4	(24, 160)	7.80B	79.1 GiB
pipeline-8	(48, 160)	15.82B	154.1 GiB

The table shows how GPipe facilitates scaling U-Net models. baseline denotes the baseline without pipeline parallelism nor checkpointing, and pipeline-1, -2, -4, -8 denotes that the model is trained with GPipe with the corresponding number of partitions.

Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters B and C which are proportional to the number of layers and filters, respectively.

The reproducible code and experiment details are available in benchmarks/unet-memory.

AmoebaNet-D (L, D) Memory Benchmark¶

Experiment	baseline	pipeline-1	pipeline-2	pipeline-4	pipeline-8
AmoebaNet-D (L, D)	(18, 208)	(18, 416)	(18, 544)	(36, 544)	(72, 512)
torchgpipe
Parameters	81.5M	319.0M	542.7M	1.06B	1.84B
Model Memory	0.91 GiB	3.57 GiB	6.07 GiB	11.80 GiB	20.62 GiB
Peak Activation Memory	Out of memory	0.91 GiB	3.39 GiB	6.91 GiB	10.83 GiB
Huang et al.
Parameters	82M	318M	542M	1.05B	1.8B
Model Memory	1.05GB	3.8GB	6.45GB	12.53GB	24.62GB
Peak Activation Memory	6.26GB	3.46GB	8.11GB	15.21GB	26.24GB

The table shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of GPipe by Huang et al. The size of an AmoebaNet-D model is determined by two hyperparameters L and D which are proportional to the number of layers and filters, respectively.

We reproduced the same settings in the paper with regardless of memory capacity of Tesla P40 GPUs. The reproducible code and experiment details are available in benchmarks/amoebanetd-memory.

Speed¶

U-Net (5, 64) Speed Benchmark¶

Experiment	Throughput	Speed up
baseline	28.500/s	1×
pipeline-1	24.456/s	0.858×
pipeline-2	35.502/s	1.246×
pipeline-4	67.042/s	2.352×
pipeline-8	88.497/s	3.105×

To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections.

The reproducible code and experiment details are available in benchmarks/unet-speed.

AmoebaNet-D (18, 256) Speed Benchmark¶

(n: number of partitions, m: number of micro-batches)¶
Experiment	Throughput	torchgpipe	Huang et al.
n=2, m=1	26.733/s	1×	1×
n=2, m=4	41.133/s	1.539×	1.07×
n=2, m=32	47.386/s	1.773×	1.21×
n=4, m=1	26.827/s	1.004×	1.13×
n=4, m=4	44.543/s	1.666×	1.26×
n=4, m=32	72.412/s	2.709×	1.84×
n=8, m=1	24.918/s	0.932×	1.38×
n=8, m=4	70.065/s	2.621×	1.72×
n=8, m=32	132.413/s	4.953×	3.48×

The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.

The reproducible code and experiment details are available in benchmarks/amoebanetd-speed.

ResNet-101 Speed Benchmark¶

Experiment	Throughput	torchgpipe	Huang et al.
baseline	95.862/s	1×	1×
pipeline-1	81.796/s	0.853×	0.80×
pipeline-2	135.539/s	1.414×	1.42×
pipeline-4	265.958/s	2.774×	2.18×
pipeline-8	411.662/s	4.294×	2.89×

The table shows the reproduced speed benchmark on ResNet-101, as reported in Figure 3(b) of the fourth version of GPipe by Huang et al.

The reproducible code and experiment details are available in benchmarks/resnet101-speed.