Xiaodong He, Li Deng, and Alex Acero
May 2011
Speech translation (ST) is an enabling technology for cross-lingual
oral communication. A ST system consists of two major
components: an automatic speech recognizer (ASR) and a machine
translator (MT). Nowadays, most ASR systems are trained and
tuned by minimizing word error rate (WER). However, WER
counts word errors at the surface level. It does not consider the
contextual and syntactic roles of a word, which are often critical
for MT. In the end-to-end ST scenarios, whether WER is a good
metric for the ASR component of the full ST system is an open
issue and lacks systematic studies. In this paper, we report our
recent investigation on this issue, focusing on the interactions of
ASR and MT in a ST system. We show that BLEU-oriented global
optimization of ASR system parameters improves the translation
quality by an absolute 1.5% BLEU score, while sacrificing WER
over the conventional, WER-optimized ASR system. We also
conducted an in-depth study on the impact of ASR errors on the
final ST output. Our findings suggest that the speech recognizer
component of the full ST system should be optimized by
translation metrics instead of the traditional WER.
![]() PDF file |
In Proc. ICASSP
Publisher IEEE
| Type | Inproceedings |