
ResNet-101 Speed Benchmark

Experiment Throughput Speedup
naive-1 92.539 samples/sec 1.000x
pipeline-1 69.960 samples/sec 0.756x
pipeline-2 137.788 samples/sec 1.489x
pipeline-4 243.322 samples/sec 2.629x
pipeline-8 404.084 samples/sec 4.367x

The code is reproducible on Tesla P40 GPUs, and the experiment details can be found in examples/resnet101_speed_benchmark.

ResNet-101 Accuracy Benchmark

Experiment Top-1 error (%)
dataparallel-256 22.02±0.11
dataparallel-1k 22.04±0.24
pipeline-256 21.99±0.13
pipeline-1k 22.24±0.19
pipeline-4k 22.13±0.09

The code is reproducible on Tesla P40 GPUs, and the experiment details can be found in examples/resnet101_accuracy_benchmark.

AmoebaNet-D Speed Benchmark

Experiment Throughput Speedup
naive-2 14.188 samples/sec 1.000x
pipeline-2 20.346 samples/sec 1.434x
pipeline-4 29.074 samples/sec 2.049x
pipeline-8 34.392 samples/sec 2.424x

The code is reproducible on Tesla P40 GPUs, and the experiment details can be found in examples/amoebanetd_speed_benchmark.

AmoebaNet-D Memory Benchmark

Experiment AmoebaNet-D (L, F) # of Model Parameters Total Model Parameter Memory Total Peak Activation Memory
naive-1 (6, 208) 90M 1.00GB
pipeline-1 (6, 416) 358M 4.01GB 6.64GB
pipeline-2 (6, 544) 613M 6.45GB 11.31GB
pipeline-4 (12, 544) 1.16B 13.00GB 18.72GB
pipeline-8 (24, 512) 2.01B 22.42GB 35.78GB