Li Deng, Zicheng Liu, Zhengyou Zhang, and Alex Acero
One well-known difficulty in creating effective human-machine interface
via the speech input is the adverse effects of concurrent
acoustic noise. To overcome this challenge, we have developed
a joint hardware and software solution. A novel bone-conductive
microphone is integrated with a regular air-conductive one in a
single headset. These two simultaneous sensors capture distinct
signal properties in the speech embedded in acoustic noise. The
focus of this paper is exploration of the type of dynamic properties
that are relatively invariant between the bone-conductive sensor’s
signal and the clean speech signal; the latter would not be available
to the recognizer. Our approach is based on a nonlinear processing
technique that estimates the unobserved (hidden) vocal tract
resonances, as a representation of such invariant hidden dynamics,
from the available bone-sensor signal. The information about
these dynamic aspects of the clean speech is then fused with other
noisy measurements to aim at improving the recognition system’s
robustness to acoustic distortion. The fusion technique is based
on a combination of three sets of signals including the synthesized
speech signal using the vocal tract resonance dynamics extracted
nonlinearly from the bone-sensor signal.
In Proc. of the IEEE Workshop on Multimedia Signal Processing
Publisher Institute of Electrical and Electronics Engineers, Inc.
© 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.