Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu
This paper compares the theoretical ef?ciency of model-par- allel and data-parallel distributed stochastic gradient descent training of DNNs. For a typical Switchboard DNN with 46M parameters, the results are not pretty: With modern GPUs and interconnects, model parallelism is optimal with only 3 GPUs in a single server, while data parallelism with a minibatch size of 1024 does not even scale to 2 GPUs.
We further show that data-parallel training ef?ciency can be improved by increasing the minibatch size (through a com- bination of AdaGrad and automatic adjustments of learning rate and minibatch size) and data compression. We arrive at an estimated possible end-to-end speed-up of 5 times or more.
We do not address issues of robustness to process failure or other issues that might occur during training, nor of speed of convergence differences between ASGD and SGD param- eter update patterns.