Arun C. Surendran, CCSP Group, Microsoft Research
John C. Platt, CCSP Group, Microsoft Research
Christopher J.C. Burges, CCSP Group, Microsoft Research
8th International Conference on Spoken Language Processing, to appear, (2004).
In this paper, we introduce a new framework for speech
detection using convolutional networks. We propose a network architecture that
can incorporate long and short-term temporal and spectral correlations of
speech in the detection process. The proposed design is able to address many
shortcomings of existing speech detectors in a unified new framework: First, it
improves the robustness of the system to environmental variability while still
being fast to evaluate. Second, it allows for a framework that is extendable to
work under different time-scales for different applications. Finally, it is
discriminative and produces reliable estimates of the probability of presence
of speech in each frame for a wide variety of noise conditions. We propose that
the inputs to the system be features that are measures of the true
signal-to-noise ratio of a set of frequency bands of the signal. These can be
easily and automatically generated by tracking the noise spectrum online. We
present preliminary results on the
© 2004 ISCA. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the ISCA and/or the author.
PDF File (222K)