Sabato Marco Siniscalchi, Dong Yu, Li Deng, and Chin-hui Lee
In recent years, there has been a renewed interest in the use of artificial neural networks (ANNs) for speech applications, and it seems that a new trend to move the speech technology forward has begun. Two main contributions have triggered such a new trend: 1) a major advance has been made in training the weights in deep neural networks (DNNs), and a pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture has outperformed a conventional Gaussian mixture model hidden Markov model (GMM-HMM) automatic speech recognition (ASR) system on a challenging business search dataset, and 2) it has been shown that phoneme classification can be boosted by using a hierarchical structure of multi-layer perceptrons (MLPs) trained to model long-span temporal patterns with beneficial effects on language recognition tasks. In this work, we combine these two lines of research and demonstrate that word recognition accuracy can be significantly enhanced by arranging DNNs in a hierarchical structure to model long-term energy trajectories. The proposed solution has been evaluated on the 5000-word Wall Street Journal task, resulting in consistent and significant improvements in both phone and word recognition accuracy rates. We have also analyzed the effects of various modeling choices on the system performance, and several architectural solutions have been compared.
In IEEE Signal Processing Letters