Hang Su, Gang Li, Dong Yu, and Frank Seide
We investigate back-propagation based sequence training of Context-Dependent Deep-Neural-Network HMMs, or CDDNN- HMMs, for conversational speech transcription. Theoretically, sequence training integrates with backpropagation in a straight-forward manner. However, we find that to get reasonable results, heuristics are needed that point to a problem with lattice sparseness: The model must be adjusted to the updated numerator lattices by additional iterations of frame-based cross-entropy (CE) training; and to avoid distortions from “runaway” models, we can either add artificial silence arcs to the denominator lattices, or smooth the sequence objective with the frame-based one (F-smoothing). With the 309h Switchboard training set, the MMI objective achieves a relative word-error rate reduction of 11–15% over CE for matched test sets, and 10–17% for mismatched ones. This includes gains of 4–7% from realigned CE iterations. The BMMI and sMBR objectives gain less. With 2000h of data, gains are 2–9% after realigned CE iterations. Using GPGPUs, MMI is about 70% slower than CE training.
In ICASSP 2013
Publisher IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)