Elizabeth Shriberg, Andreas Stolcke, and Suman Ravuri
As dialog systems evolve to handle unconstrained input and for use in open environments, addressee detection (detecting speech to the system versus to other people) becomes an increasingly important challenge. We study a corpus in which speakers talk both to a system and to each other, and model two dimensions of speaking style that talkers modify when changing addressee: speech rhythm and vocal effort. For each dimension we design features that do not require speech recognition output, session normalization, speaker normalization, or dialog context. Detection experiments show that rhythm and effort features are complementary, outperform lexical models based on recognized words, and reduce error rates even if word recognition is error-free. Simulated online processing experiments show that all features need only the first couple seconds of speech. Finally, we find that temporal and spectral stylistic models can be trained on outside corpora, such as ATIS and ICSI meetings, with reasonable generalization to the target task, thus showing promise for domain-independent computer-versus- human addressee detectors.
|Published in||Proc. Interspeech|
|Publisher||ISCA - International Speech Communication Association|