Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-Tür, and Larry Heck
New challenges arise for addressee detection when multiple people interact jointly with a spoken dialog system using unconstrained natural language. We study the problem of discriminating computer-directed from human-directed speech in a new corpus of human-human-computer (H-H-C) dialog, using lexical and prosodic features. The prosodic features use no word, context, or speaker information. Results with 19% WER speech recognition show improvements from lexical features (EER=23.1%) to prosodic features (EER=12.6%) to a combined model (EER=11.1%). Prosodic features also provide a 35% error reduction over a lexical model using true words (EER from 10.2% to 6.7%). Modeling energy contours with GMMs provides a particularly good prosodic model. While lexical models perform well for commands, they confuse free-form system-directed speech with human-human speech. Prosodic models dramatically reduce these confusions, implying that users change speaking style as they shift addressees (computer versus human) within a session. Overall results provide strong support for combining simple acoustic-prosodic models with lexical models to detect speaking style differences for this task.
In Proceedings of Interspeech
Publisher International Speech Communication Association