Xiaodong He and Li Deng
Automatic speech recognition (ASR) is an enabling technology for a wide range of information processing applications including speech translation, voice search (i.e., information retrieval with speech input), and conversational understanding. In these speech-centric applications, the output of ASR as “noisy” text is fed into down-stream processing systems to accomplish the designated tasks of translation, information retrieval, or natural language understanding, etc. In conventional applications, the ASR model as a sub-system is usually trained without considering the down-stream systems. This often leads to sub-optimal end-to-end performance. In this paper, we propose a unifying end-to-end optimization framework in which the model parameters in all subsystems including ASR are learned by Extended Baum-Welch (EBW) algorithms via optimizing the criteria directly tied to the end-to-end performance measure. We demonstrate the effectiveness of the proposed approach on a speech translation task using the spoken language translation benchmark test of IWSLT. Our experimental results show that the proposed method leads to significant improvement of translation quality over the conventional techniques based on separate modular sub-system design. We also analyze the EBW-based optimization algorithms employed in our work and discuss its relationship with other popular optimization techniques.
|Publisher||IEEE International Confrence on Acoustics, Speech, and Signal Processing (ICASSP)|