A large number of problems in computer vision involve predictions over exponentially (or infinitely) large structured-output spaces, e.g. the space of segmentations of an image, the space of all object-part hierarchies in a context-free grammar, the space of all pixel-level depth-predictions, etc.
In order to build intelligent vision systems that are able to reason about these tasks, we must address the challenges of 1) representation: how do we store and represent beliefs over exponentially and infinitely large output-spaces? 2) learning: how do we learn these beliefs from data? 3) inference: how do we predict under these beliefs? and 4) their interactions: the richer the model, the more difficult it is to learn and infer under. In this talk, I will present a sampling of my recent work that addresses some of these challenges.
While a lot of progress has been made on the 'static' version of the MAP inference problem, a number of situations require dynamic inference algorithms that must adapt and reorder computation to focus on 'important' parts of the problem. I will present a novel measure for identifying such important parts of the problem and demonstrate how it is useful in speeding up inference algorithms in a variety of settings.
Next, I will talk about our recent work on the M-Best-Mode problem, which involves extracting not just the most probable solution, but also a /diverse/ set of top M most probable solutions in discrete graphical models (like MRFs/CRFs). Extracting the top M modes of the distribution allows us to better exploit the beliefs that our model holds.
Joint work with Pushmeet Kohli (MSRC), Vladimir Kolmogorov (IST), Sebastian Nowozin (MSRC), Greg Shakhnarovich (TTIC), Ashutosh Saxena (Cornell), Daniel Tarlow (UToronto) and Payman Yadollahpour (TTIC).