Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos

Cha Zhang, Pei Yin, Yong Rui, Ross Cutler, Paul Viola, and Xinding Sun

Abstract

Identifying the active speaker in a video of a distributed meeting can be very helpful for remote participants to understand the dynamics of the meeting. A straightforward application of such analysis is to stream a high resolution video of the speaker to the remote participants. In this paper, we present the challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and propose a novel boosting-based multimodal speaker detection (BMSD) algorithm. Instead of separately performing sound source localization (SSL) and multi-person detection (MPD) and subsequently fusing their individual results, the proposed algorithm fuses audio and visual information at feature level by using boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. In experiments that includes hundreds of real-world meetings, the proposed BMSD algorithm reduces the error rate of SSL-only approach by 24.6%, and the SSL and MPD fusion approach by 20.9%. To the best of our knowledge, this is the first real-time multimodal speaker detection algorithm that is deployed in commercial products.

Details

Publication typeArticle
Published inIEEE Trans. on Multimedia
URLhttp://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4694847
PublisherIEEE
> Publications > Boosting-Based Multimodal Speaker Detection for Distributed Meeting Videos