Benchmarks

Every experiment is reproducible on Tesla P40 GPUs. Follow the link to code for each benchmark.

Transparency

ResNet-101 Accuracy Benchmark

Batch size

torchgpipe

nn.DataParallel

Goyal et al.

256

21.99±0.13

22.02±0.11

22.08±0.06

1K

22.24±0.19

22.04±0.24

N/A

4K

22.13±0.09

N/A

N/A

GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of Accurate, Large Minibatch SGD by Goyal et al.

The reproducible code and experiment details are available in benchmarks/resnet101-accuracy.

Memory

U-Net (B, C) Memory Benchmark

Experiment

U-Net (B, C)

Parameters

Memory usage

baseline

(6, 72)

362.2M

20.3 GiB

pipeline-1

(11, 128)

2.21B

20.5 GiB

pipeline-2

(24, 128)

4.99B

43.4 GiB

pipeline-4

(24, 160)

7.80B

79.1 GiB

pipeline-8

(48, 160)

15.82B

154.1 GiB

The table shows how GPipe facilitates scaling U-Net models. baseline denotes the baseline without pipeline parallelism nor checkpointing, and pipeline-1, -2, -4, -8 denotes that the model is trained with GPipe with the corresponding number of partitions.

Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters B and C which are proportional to the number of layers and filters, respectively.

The reproducible code and experiment details are available in benchmarks/unet-memory.

AmoebaNet-D (L, D) Memory Benchmark

Experiment

baseline

pipeline-1

pipeline-2

pipeline-4

pipeline-8

AmoebaNet-D (L, D)

(18, 208)

(18, 416)

(18, 544)

(36, 544)

(72, 512)

torchgpipe

Parameters

81.5M

319.0M

542.7M

1.06B

1.84B

Model Memory

0.91 GiB

3.57 GiB

6.07 GiB

11.80 GiB

20.62 GiB

Peak Activation Memory

Out of memory

0.91 GiB

3.39 GiB

6.91 GiB

10.83 GiB

Huang et al.

Parameters

82M

318M

542M

1.05B

1.8B

Model Memory

1.05GB

3.8GB

6.45GB

12.53GB

24.62GB

Peak Activation Memory

6.26GB

3.46GB

8.11GB

15.21GB

26.24GB

The table shows the better memory utilization of AmoebaNet-D with GPipe, as stated in Table 1 of GPipe by Huang et al. The size of an AmoebaNet-D model is determined by two hyperparameters L and D which are proportional to the number of layers and filters, respectively.

We reproduced the same settings in the paper with regardless of memory capacity of Tesla P40 GPUs. The reproducible code and experiment details are available in benchmarks/amoebanetd-memory.

Speed

U-Net (5, 64) Speed Benchmark

Experiment

Throughput

Speed up

baseline

28.500/s

pipeline-1

24.456/s

0.858×

pipeline-2

35.502/s

1.246×

pipeline-4

67.042/s

2.352×

pipeline-8

88.497/s

3.105×

To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections.

The reproducible code and experiment details are available in benchmarks/unet-speed.

AmoebaNet-D (18, 256) Speed Benchmark

(n: number of partitions, m: number of micro-batches)

Experiment

Throughput

torchgpipe

Huang et al.

n=2, m=1

26.733/s

n=2, m=4

41.133/s

1.539×

1.07×

n=2, m=32

47.386/s

1.773×

1.21×

n=4, m=1

26.827/s

1.004×

1.13×

n=4, m=4

44.543/s

1.666×

1.26×

n=4, m=32

72.412/s

2.709×

1.84×

n=8, m=1

24.918/s

0.932×

1.38×

n=8, m=4

70.065/s

2.621×

1.72×

n=8, m=32

132.413/s

4.953×

3.48×

The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.

The reproducible code and experiment details are available in benchmarks/amoebanetd-speed.

ResNet-101 Speed Benchmark

Experiment

Throughput

torchgpipe

Huang et al.

baseline

95.862/s

pipeline-1

81.796/s

0.853×

0.80×

pipeline-2

135.539/s

1.414×

1.42×

pipeline-4

265.958/s

2.774×

2.18×

pipeline-8

411.662/s

4.294×

2.89×

The table shows the reproduced speed benchmark on ResNet-101, as reported in Figure 3(b) of the fourth version of GPipe by Huang et al.

The reproducible code and experiment details are available in benchmarks/resnet101-speed.