|  |
|
| |
|
|
| |
|
|
The long term goal of the Situated Interaction project is to enable a new
generation of interactive systems that embed interaction and computation deeply
into the natural flow of everyday tasks, activities and collaborations. Example
scenarios include human-robot interaction, e-home, interactive billboards, systems that
monitor, assist and coordinate teams of experts through complex tasks and
procedures, etc.
|
|
| |
|
|
|  |
|
| |
|
|
| |
|
Some of the core research areas under investigation are:
| - |
situational awareness (e.g. conversational scene analysis, multimodal sensing and fusion, etc.); |
| - |
engagement, attention, floor in multi-participant interaction; |
| - |
mixed-initiative, multi-participant interaction and dialog control; |
| - |
situated grounding, robustness and error handling; |
| - |
life-long learning and adaptation; |
| - |
spatio-temporal reasoning, behavior and intention recognition; |
| - |
high-resolution, coordinated behavioral models for embodied agents; |
|
|
| |
|
|
|  |
|
| |
|
|
| |
|
| - |
Wang, W., Bohus, D., Kamar, E., and Horvitz, E. (2012) - Crowdsourcing the Acquisition of Natural Language Corpora: Methods and Observations, to appear in SLT'2012, Miami, USA [abs]
|
|
| |
We study the opportunity for using crowdsourcing methods to acquire language corpora for use in natural language processing systems. Specifically, we empirically investigate three methods for eliciting natural language sentences corresponding to a given semantic form. The methods convey frame semantics to a worker by means of sentences, scenarios, and list-based descriptions. We discuss various performance measures of the crowdsourcing process, analyze the semantic correctness of the collected language and investigate and discuss its naturalness and biases.
|
|
|
| |
|
|
| |
|
| - |
Rosenthal, S., Bohus, D., Horvitz, E. (2012) - Value of Information with Streaming Evidence, Microsoft Research Technical Report, MSR-TR-2012-99 [abs]
|
|
| |
We explore a key question facing autonomous situated systems that perform continual sensing: is it best to act based on current evidence or to wait for more evidence that could potentially improve the action selection at the cost of delay? To address this challenge we present methods for computing the expected value of information in settings with streaming, high-dimensional sensory evidence. The proposed approach relies on constructing state inference projection models, i.e. direct conditional models that predict future beliefs over future states given the recent history of streaming evidence. These models can be trained from data in a self-supervised fashion and can be used in systems with modular, hierarchical state inference architectures. We implement the proposed approach to extend the abilities of a physically situated conversational agent to proactively engage people in its environment and illustrate the operation of the method via runtime traces.
|
|
|
| |
|
|
| |
|
| - |
Vinyals, O., Bohus, D., Caruana, R. (2012) - Learning Speaker, Addressee and Overlap Detection Models from Multimodal Streams, to appear in ICMI'2012, Santa Monica, USA [abs]
|
|
| |
A key challenge in developing conversational systems is fusing streams of information provided by different sensors to make inferences about the behaviors and goals of people. Such systems can leverage visual and audio information collected through cameras and microphone arrays, including the location of various people, their focus of attention, body pose, the sound source direction, prosody, and speech recognition results. In this paper, we explore discriminative learning techniques for making accurate inferences on the problems of speaker, addressee and overlap detection in multiparty human-computer dialog. The focus is on finding ways to leverage within- and across-signal temporal patterns and to construct representations from the raw streams in an automated manner that are informative for the inference problem. We present a novel extension to traditional decision trees which allows them to incorporate and model temporal signals. We contrast these methods with more traditional approaches where a human expert manually engineers relevant temporal features. The proposed approach performs well even with relatively small amounts of training data, which is of practical importance as designing features that are task dependent is time consuming and not always possible.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Kamar, E., Horvitz, E. (2012) - Towards Situated Collaboration, in NAACL Workshop on Future Directions and Challenges in Spoken Dialog Systems: Tools and Data, Montreal, CA, 2012 [abs]
|
|
| |
We outline a set of key challenges for dialog management in physically situated interactive systems, and propose a core shift in perspective that places spoken dialog in the context of the larger collaborative challenge of managing parallel, coordinated actions in the open world.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2011) - Decisions about Turns in Multiparty Conversation: From Perception to Action, in ICMI-2011, Alicante, Spain [abs]
|
|
| |
We present a decision-theoretic approach for guiding turn taking in a spoken dialog system operating in multiparty settings. The proposed methodology couples inferences about multiparty conversational dynamics with assessed costs of different outcomes, to guide turn-taking decisions. Beyond considering uncertainties about outcomes arising from evidential reasoning about the state of a conversation, we endow the system with awareness and methods for handling uncertainties stemming from computational delays in its own perception and production. We illustrate via sample cases how the proposed approach makes decisions, and we investigate the behaviors of the proposed methods via a retrospective analysis on logs collected in a multiparty interaction study.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E., (2011) - Multiparty Turn Taking in Situated Dialog: Study, Lessons and Directions, in SIGdial'2011 [abs] [Supplemental materials and videos]
|
|
| |
We report on an empirical study of a multiparty turn taking model for physically situated spoken dialog systems. We discuss subjective and objective performance measures that show how the model, supported with a basic set of sensory competencies and turn-taking policies, can enable interactions with multiple participants in a collaborative task setting. The analysis we conduct brings to the fore several phenomena and frames challenges for managing multiparty turn taking in physically situated interaction.
|
|
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2010) - On the Challenges and Opportunities of Physically Situated Dialog, in AAAI Fall Symposium on Dialog with Robots, Arlington, VA [abs]
|
|
| |
We outline several challenges and opportunities for building physically situated systems that can interact in open, dynamic, and relatively unconstrained environments. We review a platform and recent progress on developing computational methods for situated, multiparty, open-world dialog, and highlight the value of representations of the physical surroundings and of harnessing the broader situational context when managing communicative processes such as engagement, turn-taking, language understanding, and dialog management. Finally, we outline an open-world learning challenge that spans these different levels.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2010) - Facilitating Multiparty Dialog with Gaze, Gesture and Speech, in ICMI'10, Beijing, China [abs] [Supplemental materials and videos]
|
|
| |
We study how synchronized gaze, gesture and speech rendered by an embodied conversational agent can influence the flow of conversations in multiparty settings. We review a computational framework for turn taking that provides the foundation for tracking and communicating intentions to hold, release, or take control of the conversational floor. We then present details of the implementation of the approach in an embodied conversational agent and describe experiments with the system in a shared task setting. Finally, we discuss results showing how the verbal and non-verbal cues used by the avatar can shape the dynamics of multiparty conversation.
|
|
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E., (2010) - Computational Models for Multiparty Turn-Taking, Microsoft Technical Report MSR-TR-2010-115 [abs] [Supplemental materials and videos]
|
|
| |
We describe a computational framework for modeling and managing turn-taking in open-world spoken dialog systems. We present a representation and methodology for tracking the conversational dynamics in multiparty interactions, making floor control decisions, and ren-dering these decisions into appropriate behav-iors. We show how the approach enables an embodied conversational agent to participate in multiparty interactions, and to handle a diversity of natural turn-taking phenomena, including multiparty floor management, barge-ins, restarts, and continuations. Finally, we discuss results and lessons learned from experiments.
|
|
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2009) - Dialog in the Open World: Platform and Applications, in Proceedings of ICMI'09, Boston, MA [abs] [Receptionist video] [Questions game video] | ICMI'09 outstanding paper award
|
|
| |
We review key challenges of developing spoken dialog systems that can engage in interaction with one or multiple participants in open, relatively unconstrained environments. We outline a set of core competencies for open-world dialog, and we describe three prototype systems in this space. The systems harness a common underlying conversational framework which integrates an array of predictive models and component technologies, including speech recognition, head and pose tracking, probabilistic models for scene analysis, multiparty engagement and turn taking, and inferences about user long-term goals and activities. We discuss the current models and showcase their function by means of a sample recorded interaction, and we review results from an observational study of open-world, multiparty dialog in the wild.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2009) - Learning to Predict Engagement with a Spoken Dialog System in Open-World Settings, in Proceedings of SIGdial'09, London, UK [abs] [note]
|
|
| |
We describe a machine learning approach that allows an open-world spoken dialog system to learn to predict engagement intentions in situ, from interaction. The proposed approach does not require any developer supervision, and leverages spatiotemporal and attentional features automatically extracted from a visual analysis of people coming into the proximity of the system to produce models that are attuned to the characteristics of the environment the system is placed in. Experimental results indicate that a system using the proposed approach can learn to recognize engagement intentions at low false positive rates (e.g. 2-4%) up to 3-4 seconds prior to the actual moment of engagement.
|
|
|
| |
Subsequent experiments with the machine learning infrastructure used in this work have revealed a small defect in the model construction and evaluation. The maximum entropy model was trained in a stepwise fashion, where at each step the next best feature was added to the model; stopping was based on a BIC criterion. During this stepwise model building process, the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported might differ by small amounts, we believe the general results we have reported in this paper stand.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2009) - Models for Multiparty Engagement in Open-World Dialog, in Proceedings of SIGdial'09, London, UK [abs] [video] | SIGdial'09 best paper award
|
|
| |
We present computational models that allow spoken dialog systems to handle multi-participant engagement in open, dynamic environments, where multiple people may enter and leave conversations, and interact with the system and with others in a natural manner. The models for managing the engagement process include components for (1) sensing the en-gagement state, actions and intentions of multiple agents in the scene, (2) making engagement decisions (i.e. whom to engage with, and when) and (3) rendering these decisions in a set of coordinated low-level behaviors in an embodied conversational agent. We review results from a study of interactions "in the wild" with a system that implements such a model.
|
|
|
| |
|
|
| |
|
| - |
Bohus, D., Horvitz, E. (2009) - Open-World Dialog: Challenges, Directions and a Prototype, in Proceedings of IJCAI Workshop on Knowledge and Reasoning in Practical Dialog Systems, Pasadena, CA [abs] [video]
|
|
| |
We present an investigation of open-world dialog, centering on systems that can
perform conversational dialog in an open-world context, where multiple people
with different needs, goals, and long-term plans may enter, interact, and
leave an environment. We outline and discuss a set of chal-lenges and core
competencies required for supporting the kind of fluid multiparty interaction
that people expect when conversing and collaborating with other people. Then,
we focus as a concrete example on the challenges faced by receptionists who
field requests at the entries to corporate buildings. We review the subtleties
and difficulties of creating an automated receptionist that can work with
people on solving their needs with the ease and etiquette expected from a
human receptionist. Finally, we review details of the construction and
operation of a working prototype.
|
|
|
|
| |
|
|
|  |
|
| |
|
|
| |
|
Here is a project overview video:
|
Here are a couple of videos illustrating multiparty engagement and interaction in the receptionist and trivia game domains:
|
|
Receptionist multiparty engagement: illustrates reasoning about engagement and interleaved conversations in the Receptionist domain. Notice the inferences about engagement state, actions, intentions, as well as high level goals and activities. The red dot shows the system's gaze direction.
|
|
|
|
Questions game multiparty engagement: illustrates how a situated conversational agent can initiate engagement with bystanders to solicit help and attract them into playing a multiparticipant questions game. Notice the inferences about engagement state, actions, intentions, as well as high level goals and activities. The red dot shows the system's gaze direction.
|
|
|
Here is a set of videos illustrating the initial Receptionist prototype:
|
|
Basic interaction: illustrates
a basic single-participant interaction with the system.
Notice the various layers of scene analysis (system
tracks user's face and pose, infers information about
clothing, affiliation, task goals, etc.) and the
natural engagement model (system engages as the user
approaches)
|
|
|
|
Scene inferences and grounding:
systems infers user goals from scene analysis (user
is dressed formally, hence most likely external, hence
probably wants registration), but grounds this
information through dialog. Notice also the grounding
of the building number.
|
|
|
|
Attention modeling and engagement:
systems monitors the user's attention (using information
from the face detector and pose tracker) and engages
the user accordingly.
|
|
|
|
Handling people waiting in line:
system monitors multiple users in the scene and
acknowledges the presence of a waiting user with a quick
glance (red dot shows system's gaze) and by engaging
them temporarily towards the end of the conversation
|
|
|
|
Re-engagement:
same as above, only that when system turns back the
initial user is no longer paying attention. Knowing
that a person is waiting in line, the system draws
the user's attention and re-engages by saying "Excuse me!"
|
|
|
|
Multi-participant dialog:
system infers from the scene (and confirms through dialog)
that the two participants are in a group together. System
then carries on a multi-participant conversation.
Notice the gaze model (red dot) that is information by
who is the speaking participant and also certain elements
in the discourse structure.
|
|
|
|
Multi-participant dialog with side conversation:
similar to the previous interaction; at the end the users engage in a side conversation. The system understands that the utterances
are not addressed to it and, after a while, interrupts the two users to convey the shuttle information. Notice also the touch-screen interaction that is used as a fallback for cases when speech recognition fails.
|
|
|
|
|
|
| |
|
|