Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the Speech Translation Task?

Speech translation (ST) is an enabling technology for cross-lingual

oral communication. A ST system consists of two major

components: an automatic speech recognizer (ASR) and a machine

translator (MT). Nowadays, most ASR systems are trained and

tuned by minimizing word error rate (WER). However, WER

counts word errors at the surface level. It does not consider the

contextual and syntactic roles of a word, which are often critical

for MT. In the end-to-end ST scenarios, whether WER is a good

metric for the ASR component of the full ST system is an open

issue and lacks systematic studies. In this paper, we report our

recent investigation on this issue, focusing on the interactions of

ASR and MT in a ST system. We show that BLEU-oriented global

optimization of ASR system parameters improves the translation

quality by an absolute 1.5% BLEU score, while sacrificing WER

over the conventional, WER-optimized ASR system. We also

conducted an in-depth study on the impact of ASR errors on the

final ST output. Our findings suggest that the speech recognizer

component of the full ST system should be optimized by

translation metrics instead of the traditional WER.

0005632.pdf
PDF file

In  Proc. ICASSP

Publisher  IEEE

Details

TypeInproceedings
> Publications > Why Word Error Rate is not a Good Metric for Speech Recognizer Training for the Speech Translation Task?