Microphone Array Processing for Distant Speech Recognition: Towards Real-World Deployment

Kenichi Kumatani; Takayuki Arakawa; Kazumasa Yamamoto; John McDonough; Bhiksha Raj; Rita Singh; Ivan Tashev

Microphone Array Processing for Distant Speech Recognition: Towards Real-World Deployment

Kenichi Kumatani ,
Takayuki Arakawa ,
Kazumasa Yamamoto ,
John McDonough ,
Bhiksha Raj ,
Rita Singh ,
Ivan Tashev

APSIPA Annual Summit and Conference | December 2012

Download BibTex

Distant speech recognition (DSR) holds out the promise of providing a natural human computer interface in that it enables verbal interactions with computers without the necessity of donning intrusive body- or head-mounted devices. Recognizing distant speech robustly, however, remains a challenge. This paper provides a overview of DSR systems based on microphone arrays. In particular, we present recent work on acoustic beamforming for DSR, along with experimental results verifying the effectiveness of the various algorithms described here; beginning from a word error rate (WER) of 14.3% with a single microphone of a linear array, our state-of-the-art DSR system achieved a WER of 5.3%, which was comparable to that of 4.2% obtained with a lapel microphone. Furthermore, we report speech recognition experiments on data captured with a popular device, the Kinect [1]. Even if speakers were 4 meters far from the Kinect, our DSR system achieved acceptable recognition performance, a WER of 24.1%, beginning from a WER of 42.5% with a single array channel.